Open Model Lab

July 2026: Foundation + Eval Harness

Build the basic research infrastructure that can measure model behavior reliably before changing it.

Gate status

Month: 2026-07
Status: planned
Report: Minimal Open-Model Eval Harness

Success criterion

Different open models can be compared on the same tasks and a reproducible report can be generated.

Focus

Run open models on the same task set.
Build a small but clean eval set for instruction following, coding, reasoning, factuality, and safety-lite behavior.
Support both LLM-as-judge and unit-test-based grading.
Track score, cost, latency, output quality, and failure modes.

Expected outputs

First version of the public Open Model Research Harness repository.
Model / score / cost / latency / failure-mode report.
First technical write-up: minimal open-model eval harness.

End-of-month decision

Does this output make the next month's SFT experiments measurable and comparable?

Related links

All months Timeline Planned report