Open Model Lab
July 2026: Foundation + Eval Harness
Build the basic research infrastructure that can measure model behavior reliably before changing it.
Gate status
- Month
- 2026-07
- Status
- planned
Success criterion
Different open models can be compared on the same tasks and a reproducible report can be generated.
Focus
- Run open models on the same task set.
- Build a small but clean eval set for instruction following, coding, reasoning, factuality, and safety-lite behavior.
- Support both LLM-as-judge and unit-test-based grading.
- Track score, cost, latency, output quality, and failure modes.
Expected outputs
- First version of the public Open Model Research Harness repository.
- Model / score / cost / latency / failure-mode report.
- First technical write-up: minimal open-model eval harness.
End-of-month decision
Does this output make the next month's SFT experiments measurable and comparable?