Open Model Lab

November 2026: Agent Evals + Long-Horizon Tasks

Measure agent success and failure taxonomy on multi-step tasks.

Gate status

Month: 2026-11
Status: planned
Report: Coding Agent Failure Taxonomy

Success criterion

A reliable evaluation system categorizes why the agent fails.

Focus

Bug fix tasks.
Test addition tasks.
Refactor tasks.
Config repair tasks.
Small CLI feature tasks.
Failure modes: looping, premature success, context loss, tool misuse.
Task decomposition and self-correction.

Expected outputs

agent_evals module.
Long-horizon task benchmark.
Report: failure taxonomy for small coding agents.

End-of-month decision

Do the evals reveal actionable failure patterns rather than just pass/fail rates?

Related links

All months Timeline Planned report