Open Model Lab

Coding Agent Failure Taxonomy

Why do small coding agents fail on long-horizon tasks?

Status

Status: planned
Month/theme: November 2026: Agent Evals + Long-Horizon Tasks

Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

Why do small coding agents fail on long-horizon tasks?

Planned setup

Create a small long-horizon coding task suite.
Run the agent harness with trace capture.
Classify observed failures with a controlled taxonomy.

Planned measurements

Score where the grader supports a score.
Latency and cost where the run infrastructure can measure them.
Output-quality notes and failure-mode labels.
Known caveats and reproducibility requirements.
Looping, premature success, context loss, and tool misuse labels.

Planned sections

Research question and claim boundary
Setup, model variants, data versions, and config hashes
Eval suite or task design
Measurements and failure modes
Limitations, caveats, and next decision

Expected artifacts

agent_evals module.
Long-horizon task benchmark.
Failure taxonomy.

Claim boundary

This report classifies observed failures in controlled tasks only.

Related links

Reports index Related month page Runs