Open Model Lab
Post-Training and Agentic Coding Behavior
How do base, SFT, and DPO variants behave inside a simple tool-using coding-agent harness?
Status
- Status
- planned
- Month/theme
- October 2026: Agent Harness / Tool Use
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.
Research question
How do base, SFT, and DPO variants behave inside a simple tool-using coding-agent harness?
Planned setup
- Build a small tool registry and coding-agent harness.
- Run base, SFT, and DPO variants on the same coding tasks.
- Save replayable traces for every task attempt.
Planned measurements
- Score where the grader supports a score.
- Latency and cost where the run infrastructure can measure them.
- Output-quality notes and failure-mode labels.
- Known caveats and reproducibility requirements.
- Tool-use failures, retry behavior, and trace-level diagnosis.
Planned sections
- Research question and claim boundary
- Setup, model variants, data versions, and config hashes
- Eval suite or task design
- Measurements and failure modes
- Limitations, caveats, and next decision
Expected artifacts
- agents module.
- Small coding-agent benchmark set.
- Replayable traces.
Claim boundary
This report will not claim general agent capability.