Eval-first
Model behavior should be measured before it is modified.
Open Model Lab
A month-by-month public plan for building an eval-first research-engineering harness around open models, post-training, agents, safety, monitorability, and systems efficiency.
Model behavior should be measured before it is modified.
Base, SFT, DPO, and agent variants should be compared on the same task suite when possible.
Every report states what it can and cannot claim.
Planned dashboards and reports stay planned until runs, configs, datasets, and reports exist.
Build the basic research infrastructure that can measure model behavior reliably before changing it.
Success criterion: Different open models can be compared on the same tasks and a reproducible report can be generated.
End-of-month decision: Does this output make the next month's SFT experiments measurable and comparable?
Measure the behavioral difference between a base model and an instruction-tuned model.
Success criterion: The work does not stop at fine-tuning; it shows measurable behavior changes, including regressions.
End-of-month decision: Is the SFT checkpoint reliable enough to serve as the reference for preference optimization?
Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.
Success criterion: The report clearly shows where post-training helps and where it creates risk.
End-of-month decision: Is DPO improving real behavior or only optimizing preference style?
Move the model from passive question-answering into a tool-using agent setup.
Success criterion: Agent behavior is measured step by step with failure modes, not only success/failure.
End-of-month decision: Can agent traces explain why the model succeeds or fails?
Measure agent success and failure taxonomy on multi-step tasks.
Success criterion: A reliable evaluation system categorizes why the agent fails.
End-of-month decision: Do the evals reveal actionable failure patterns rather than just pass/fail rates?
Measure safety behavior by quality and balance, not just refusal rate.
Success criterion: The model's behavior is measured on both risky and harmless requests.
End-of-month decision: Can the safety eval distinguish good refusal from lazy or excessive refusal?
Evaluate not only final answers, but where reasoning processes break down.
Success criterion: The system can show flawed processes despite correct answers, or identify breakpoints leading to wrong answers.
End-of-month decision: Do process signals add useful diagnostic value beyond final-answer grading?
Start analyzing model behavior with internal signals in addition to external scores.
Success criterion: Behavior changes can be tracked not only from outputs but also from model-internal measurements.
End-of-month decision: Are internal signals useful enough to guide later experiments?
Enter multimodal model work through screenshot and UI-understanding tasks.
Success criterion: The model's ability to understand visual interfaces is measured task-by-task.
End-of-month decision: Can the eval distinguish OCR success from actual UI reasoning?
Add systems and profiling knowledge to make research experiments more efficient.
Success criterion: The main bottlenecks slowing model research infrastructure can be measured and improved.
End-of-month decision: Which bottlenecks most affect experiment velocity and cost?
Measure whether better data mixtures produce better behavior with the same compute.
Success criterion: The effect of data quality on behavior and compute efficiency is quantified.
End-of-month decision: Does data quality improve score/GPU-hour enough to justify the filtering pipeline?
Turn the 12-month work into a showable, reproducible, publishable portfolio.
Success criterion: An open-source project demonstrates an end-to-end open-model research-engineering loop.
End-of-month decision: Is the project understandable, reproducible, and credible to an external research-engineering reader?