Open Model Lab

October 2026: Agent Harness / Tool Use

Move the model from passive question-answering into a tool-using agent setup.

Gate status

Month
2026-10
Status
planned
Report
Post-Training and Agentic Coding Behavior

Success criterion

Agent behavior is measured step by step with failure modes, not only success/failure.

Focus

  • Tool registry.
  • File reading/writing.
  • Python execution.
  • Test execution.
  • Planning, retry, error interpretation, and replayable traces.
  • Compare base/SFT/DPO models on agentic coding behavior.

Expected outputs

  • agents module.
  • Small coding-agent benchmark set.
  • Report: what post-training changes in agentic coding behavior?

End-of-month decision

Can agent traces explain why the model succeeds or fails?