Open Model Lab

November 2026: Agent Evals + Long-Horizon Tasks

Measure agent success and failure taxonomy on multi-step tasks.

Gate status

Month
2026-11
Status
planned
Report
Coding Agent Failure Taxonomy

Success criterion

A reliable evaluation system categorizes why the agent fails.

Focus

  • Bug fix tasks.
  • Test addition tasks.
  • Refactor tasks.
  • Config repair tasks.
  • Small CLI feature tasks.
  • Failure modes: looping, premature success, context loss, tool misuse.
  • Task decomposition and self-correction.

Expected outputs

  • agent_evals module.
  • Long-horizon task benchmark.
  • Report: failure taxonomy for small coding agents.

End-of-month decision

Do the evals reveal actionable failure patterns rather than just pass/fail rates?