Open Model Lab

Open Model Lab

A public workspace for tracking open-model evaluation, post-training, agentic behavior, safety, monitorability, and systems-efficiency experiments.

Status facts

Project
Open Model Research Harness
Repository
GitHub
Timeframe
July 2026 - June 2027
Current status
Planned
First gate
July 2026 eval harness
Claims
No model-quality claims yet

Current gate: July 2026 - Foundation + Eval Harness

Before changing model behavior, the first milestone is to measure it reliably. The July gate is complete only when at least three open models can be evaluated on the same small task suite with score, cost, latency, output-quality, and failure-mode reporting.

12-month timeline

July 2026

Foundation + Eval Harness

Build the basic research infrastructure that can measure model behavior reliably before changing it.

planned
August 2026

SFT Pipeline + Data Quality

Measure the behavioral difference between a base model and an instruction-tuned model.

planned
September 2026

Preference Optimization / DPO

Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.

planned
October 2026

Agent Harness / Tool Use

Move the model from passive question-answering into a tool-using agent setup.

planned

Research modules

planned

Eval Harness

Reproducible task execution, grading, metrics, and failure labels.

planned

SFT Pipeline

Instruction data cleaning, supervised fine-tuning, and behavior comparison.

planned

Preference Optimization / DPO

Chosen/rejected pairs, DPO training, and side-effect measurement.

planned

Agent Harness

Tool registry, file operations, test execution, and replayable traces.

planned

Long-Horizon Agent Evals

Multi-step coding tasks and trace-level failure taxonomy.

planned

Safety / Red Teaming

Refusal quality, over-refusal, under-refusal, jailbreak robustness, and safe completion.

planned

Reasoning Process Evaluation

Outcome scoring compared with observable process and self-correction signals.

planned

Monitorability

Internal-signal probes for failure prediction and representation drift.

planned

Multimodal UI Understanding

Screenshot QA, OCR plus reasoning, UI grounding, and visual failure modes.

planned

Systems Efficiency

Latency, throughput, batching, KV cache, quantization, and profiling.

planned

Data Efficiency / Scaling

Data mixtures, filtering, small scaling ladders, and score/GPU-hour.

planned

Final Integration

Reproducible scripts, reports, dashboard, README, and portfolio packaging.

Latest public artifacts

Planned

July eval harness report

Planned report for the first eval-first gate. No public runs yet.

Planned

Eval task schema

A documented schema for small task suites and graders.

Planned

Run storage format

A reproducible structure for run ids, configs, scores, costs, latency, and caveats.

Planned

Failure mode taxonomy

A controlled vocabulary for output, grader, safety, and agent failures.

Claim boundary

Planned scaffolding only. This section tracks a public learning and research-engineering process. Planned pages and empty dashboards are scaffolding. Results will be marked as published only after the underlying run, config, dataset, and report are available.

12-month plan

The full July 2026 through June 2027 research-engineering plan.

Timeline

A compact month-by-month view of gates, themes, and report targets.

Months

Detailed monthly gates, focus areas, expected outputs, and decisions.

Reports

Planned and published report pages with explicit claim boundaries.

Evals

Task schema, initial categories, graders, metrics, and failure taxonomy.

Runs

Future run registry and run schema. No public runs yet.

Models

Future model cards and model categories. No model cards yet.

Datasets

Planned eval, instruction, preference, agent, and safety datasets.

Dashboards

Planned dashboards. No placeholder charts or synthetic metrics.

Decisions

Public decision log for naming, eval-first scope, and claim boundaries.

Glossary

Concise definitions for terms used throughout the lab section.