July 2026: Foundation + Eval Harness
Build the basic research infrastructure that can measure model behavior reliably before changing it.
Open Model Lab
Detailed month pages for the planned Open Model Research Harness gates.
Build the basic research infrastructure that can measure model behavior reliably before changing it.
Measure the behavioral difference between a base model and an instruction-tuned model.
Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.
Move the model from passive question-answering into a tool-using agent setup.
Measure agent success and failure taxonomy on multi-step tasks.
Measure safety behavior by quality and balance, not just refusal rate.
Evaluate not only final answers, but where reasoning processes break down.
Start analyzing model behavior with internal signals in addition to external scores.
Enter multimodal model work through screenshot and UI-understanding tasks.
Add systems and profiling knowledge to make research experiments more efficient.
Measure whether better data mixtures produce better behavior with the same compute.
Turn the 12-month work into a showable, reproducible, publishable portfolio.