Open Model Lab
Coding Agent Failure Taxonomy
Why do small coding agents fail on long-horizon tasks?
Status
- Status
- planned
- Month/theme
- November 2026: Agent Evals + Long-Horizon Tasks
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.
Research question
Why do small coding agents fail on long-horizon tasks?
Planned setup
- Create a small long-horizon coding task suite.
- Run the agent harness with trace capture.
- Classify observed failures with a controlled taxonomy.
Planned measurements
- Score where the grader supports a score.
- Latency and cost where the run infrastructure can measure them.
- Output-quality notes and failure-mode labels.
- Known caveats and reproducibility requirements.
- Looping, premature success, context loss, and tool misuse labels.
Planned sections
- Research question and claim boundary
- Setup, model variants, data versions, and config hashes
- Eval suite or task design
- Measurements and failure modes
- Limitations, caveats, and next decision
Expected artifacts
- agent_evals module.
- Long-horizon task benchmark.
- Failure taxonomy.
Claim boundary
This report classifies observed failures in controlled tasks only.