Open Model Lab
November 2026: Agent Evals + Long-Horizon Tasks
Measure agent success and failure taxonomy on multi-step tasks.
Gate status
- Month
- 2026-11
- Status
- planned
Success criterion
A reliable evaluation system categorizes why the agent fails.
Focus
- Bug fix tasks.
- Test addition tasks.
- Refactor tasks.
- Config repair tasks.
- Small CLI feature tasks.
- Failure modes: looping, premature success, context loss, tool misuse.
- Task decomposition and self-correction.
Expected outputs
- agent_evals module.
- Long-horizon task benchmark.
- Report: failure taxonomy for small coding agents.
End-of-month decision
Do the evals reveal actionable failure patterns rather than just pass/fail rates?