Open Model Lab

Open Model Lab 12-Month Plan

A month-by-month public plan for building an eval-first research-engineering harness around open models, post-training, agents, safety, monitorability, and systems efficiency.

Plan principles

Eval-first

Model behavior should be measured before it is modified.

Same-suite comparison

Base, SFT, DPO, and agent variants should be compared on the same task suite when possible.

Claim boundaries

Every report states what it can and cannot claim.

No unsupported progress

Planned dashboards and reports stay planned until runs, configs, datasets, and reports exist.

Monthly gates

July 2026

Foundation + Eval Harness

Build the basic research infrastructure that can measure model behavior reliably before changing it.

Success criterion: Different open models can be compared on the same tasks and a reproducible report can be generated.

End-of-month decision: Does this output make the next month's SFT experiments measurable and comparable?

planned

August 2026

SFT Pipeline + Data Quality

Measure the behavioral difference between a base model and an instruction-tuned model.

Success criterion: The work does not stop at fine-tuning; it shows measurable behavior changes, including regressions.

End-of-month decision: Is the SFT checkpoint reliable enough to serve as the reference for preference optimization?

planned

September 2026

Preference Optimization / DPO

Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.

Success criterion: The report clearly shows where post-training helps and where it creates risk.

End-of-month decision: Is DPO improving real behavior or only optimizing preference style?

planned

October 2026

Agent Harness / Tool Use

Move the model from passive question-answering into a tool-using agent setup.

Success criterion: Agent behavior is measured step by step with failure modes, not only success/failure.

End-of-month decision: Can agent traces explain why the model succeeds or fails?

planned

November 2026

Agent Evals + Long-Horizon Tasks

Measure agent success and failure taxonomy on multi-step tasks.

Success criterion: A reliable evaluation system categorizes why the agent fails.

End-of-month decision: Do the evals reveal actionable failure patterns rather than just pass/fail rates?

planned

December 2026

Safety / Red Teaming / Refusal Quality

Measure safety behavior by quality and balance, not just refusal rate.

Success criterion: The model's behavior is measured on both risky and harmless requests.

End-of-month decision: Can the safety eval distinguish good refusal from lazy or excessive refusal?

planned

January 2027

Reasoning Behavior + Process Evaluation

Evaluate not only final answers, but where reasoning processes break down.

Success criterion: The system can show flawed processes despite correct answers, or identify breakpoints leading to wrong answers.

End-of-month decision: Do process signals add useful diagnostic value beyond final-answer grading?

planned

February 2027

Monitorability / Interpretability Start

Start analyzing model behavior with internal signals in addition to external scores.

Success criterion: Behavior changes can be tracked not only from outputs but also from model-internal measurements.

End-of-month decision: Are internal signals useful enough to guide later experiments?

planned

March 2027

Multimodal / UI Understanding / Computer Use

Enter multimodal model work through screenshot and UI-understanding tasks.

Success criterion: The model's ability to understand visual interfaces is measured task-by-task.

End-of-month decision: Can the eval distinguish OCR success from actual UI reasoning?

planned

April 2027

Training / Inference Systems Efficiency

Add systems and profiling knowledge to make research experiments more efficient.

Success criterion: The main bottlenecks slowing model research infrastructure can be measured and improved.

End-of-month decision: Which bottlenecks most affect experiment velocity and cost?

planned

May 2027

Data Efficiency + Scaling Ladder

Measure whether better data mixtures produce better behavior with the same compute.

Success criterion: The effect of data quality on behavior and compute efficiency is quantified.

End-of-month decision: Does data quality improve score/GPU-hour enough to justify the filtering pipeline?

planned

June 2027

Final Integration + Public Portfolio

Turn the 12-month work into a showable, reproducible, publishable portfolio.

Success criterion: An open-source project demonstrates an end-to-end open-model research-engineering loop.

End-of-month decision: Is the project understandable, reproducible, and credible to an external research-engineering reader?

planned