Foundation + Eval Harness
Build the basic research infrastructure that can measure model behavior reliably before changing it.
Open Model Lab
A public workspace for tracking open-model evaluation, post-training, agentic behavior, safety, monitorability, and systems-efficiency experiments.
Before changing model behavior, the first milestone is to measure it reliably. The July gate is complete only when at least three open models can be evaluated on the same small task suite with score, cost, latency, output-quality, and failure-mode reporting.
Build the basic research infrastructure that can measure model behavior reliably before changing it.
Measure the behavioral difference between a base model and an instruction-tuned model.
Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.
Move the model from passive question-answering into a tool-using agent setup.
Measure agent success and failure taxonomy on multi-step tasks.
Measure safety behavior by quality and balance, not just refusal rate.
Evaluate not only final answers, but where reasoning processes break down.
Start analyzing model behavior with internal signals in addition to external scores.
Enter multimodal model work through screenshot and UI-understanding tasks.
Add systems and profiling knowledge to make research experiments more efficient.
Measure whether better data mixtures produce better behavior with the same compute.
Turn the 12-month work into a showable, reproducible, publishable portfolio.
Reproducible task execution, grading, metrics, and failure labels.
Instruction data cleaning, supervised fine-tuning, and behavior comparison.
Chosen/rejected pairs, DPO training, and side-effect measurement.
Tool registry, file operations, test execution, and replayable traces.
Multi-step coding tasks and trace-level failure taxonomy.
Refusal quality, over-refusal, under-refusal, jailbreak robustness, and safe completion.
Outcome scoring compared with observable process and self-correction signals.
Internal-signal probes for failure prediction and representation drift.
Screenshot QA, OCR plus reasoning, UI grounding, and visual failure modes.
Latency, throughput, batching, KV cache, quantization, and profiling.
Data mixtures, filtering, small scaling ladders, and score/GPU-hour.
Reproducible scripts, reports, dashboard, README, and portfolio packaging.
Planned report for the first eval-first gate. No public runs yet.
A documented schema for small task suites and graders.
A reproducible structure for run ids, configs, scores, costs, latency, and caveats.
A controlled vocabulary for output, grader, safety, and agent failures.
The full July 2026 through June 2027 research-engineering plan.
A compact month-by-month view of gates, themes, and report targets.
Detailed monthly gates, focus areas, expected outputs, and decisions.
Planned and published report pages with explicit claim boundaries.
Task schema, initial categories, graders, metrics, and failure taxonomy.
Future run registry and run schema. No public runs yet.
Future model cards and model categories. No model cards yet.
Planned eval, instruction, preference, agent, and safety datasets.
Planned dashboards. No placeholder charts or synthetic metrics.
Public decision log for naming, eval-first scope, and claim boundaries.
Concise definitions for terms used throughout the lab section.