Open Model Lab

July 2026: Foundation + Eval Harness

Build the basic research infrastructure that can measure model behavior reliably before changing it.

Gate status

Month
2026-07
Status
planned
Report
Minimal Open-Model Eval Harness

Success criterion

Different open models can be compared on the same tasks and a reproducible report can be generated.

Focus

  • Run open models on the same task set.
  • Build a small but clean eval set for instruction following, coding, reasoning, factuality, and safety-lite behavior.
  • Support both LLM-as-judge and unit-test-based grading.
  • Track score, cost, latency, output quality, and failure modes.

Expected outputs

  • First version of the public Open Model Research Harness repository.
  • Model / score / cost / latency / failure-mode report.
  • First technical write-up: minimal open-model eval harness.

End-of-month decision

Does this output make the next month's SFT experiments measurable and comparable?