Foundation + Eval Harness
Different open models can be compared on the same tasks and a reproducible report can be generated.
Planned report: Minimal Open-Model Eval Harness
Open Model Lab
A compact timeline of planned gates, report targets, and status from July 2026 through June 2027.
Different open models can be compared on the same tasks and a reproducible report can be generated.
Planned report: Minimal Open-Model Eval Harness
The work does not stop at fine-tuning; it shows measurable behavior changes, including regressions.
Planned report: Base vs SFT Behavior Change
The report clearly shows where post-training helps and where it creates risk.
Planned report: DPO Behavior Impact
Agent behavior is measured step by step with failure modes, not only success/failure.
Planned report: Post-Training and Agentic Coding Behavior
A reliable evaluation system categorizes why the agent fails.
Planned report: Coding Agent Failure Taxonomy
The model's behavior is measured on both risky and harmless requests.
Planned report: Refusal and Jailbreak Evaluation
The system can show flawed processes despite correct answers, or identify breakpoints leading to wrong answers.
Planned report: Outcome vs Process Evaluation
Behavior changes can be tracked not only from outputs but also from model-internal measurements.
Planned report: Failure Prediction Probes
The model's ability to understand visual interfaces is measured task-by-task.
Planned report: UI Understanding Evaluation
The main bottlenecks slowing model research infrastructure can be measured and improved.
Planned report: Open-Model Systems Bottlenecks
The effect of data quality on behavior and compute efficiency is quantified.
Planned report: Score per GPU-Hour
An open-source project demonstrates an end-to-end open-model research-engineering loop.
Planned report: Final Technical Report