Open Model Lab

Coding Agent Failure Taxonomy

Why do small coding agents fail on long-horizon tasks?

Status

Status
planned
Month/theme
November 2026: Agent Evals + Long-Horizon Tasks
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

Why do small coding agents fail on long-horizon tasks?

Planned setup

  • Create a small long-horizon coding task suite.
  • Run the agent harness with trace capture.
  • Classify observed failures with a controlled taxonomy.

Planned measurements

  • Score where the grader supports a score.
  • Latency and cost where the run infrastructure can measure them.
  • Output-quality notes and failure-mode labels.
  • Known caveats and reproducibility requirements.
  • Looping, premature success, context loss, and tool misuse labels.

Planned sections

  • Research question and claim boundary
  • Setup, model variants, data versions, and config hashes
  • Eval suite or task design
  • Measurements and failure modes
  • Limitations, caveats, and next decision

Expected artifacts

  • agent_evals module.
  • Long-horizon task benchmark.
  • Failure taxonomy.

Claim boundary

This report classifies observed failures in controlled tasks only.