Open Model Lab

A public workspace for tracking open-model evaluation, post-training, agentic behavior, safety, monitorability, and systems-efficiency experiments.

Status facts

Project: Open Model Research Harness
Repository: GitHub
Timeframe: July 2026 - June 2027
Current status: Planned
First gate: July 2026 eval harness
Claims: No model-quality claims yet

Current gate: July 2026 - Foundation + Eval Harness

Before changing model behavior, the first milestone is to measure it reliably. The July gate is complete only when at least three open models can be evaluated on the same small task suite with score, cost, latency, output-quality, and failure-mode reporting.

July month plan Planned report Eval design

12-month timeline

July 2026

Foundation + Eval Harness

Build the basic research infrastructure that can measure model behavior reliably before changing it.

planned

August 2026

SFT Pipeline + Data Quality

Measure the behavioral difference between a base model and an instruction-tuned model.

planned

September 2026

Preference Optimization / DPO

Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.

planned

October 2026

Agent Harness / Tool Use

Move the model from passive question-answering into a tool-using agent setup.

planned

November 2026

Agent Evals + Long-Horizon Tasks

Measure agent success and failure taxonomy on multi-step tasks.

planned

December 2026

Safety / Red Teaming / Refusal Quality

Measure safety behavior by quality and balance, not just refusal rate.

planned

January 2027

Reasoning Behavior + Process Evaluation

Evaluate not only final answers, but where reasoning processes break down.

planned

February 2027

Monitorability / Interpretability Start

Start analyzing model behavior with internal signals in addition to external scores.

planned

March 2027

Multimodal / UI Understanding / Computer Use

Enter multimodal model work through screenshot and UI-understanding tasks.

planned

April 2027

Training / Inference Systems Efficiency

Add systems and profiling knowledge to make research experiments more efficient.

planned

May 2027

Data Efficiency + Scaling Ladder

Measure whether better data mixtures produce better behavior with the same compute.

planned

June 2027

Final Integration + Public Portfolio

Turn the 12-month work into a showable, reproducible, publishable portfolio.

planned

Research modules

planned

Eval Harness

Reproducible task execution, grading, metrics, and failure labels.

planned

SFT Pipeline

Instruction data cleaning, supervised fine-tuning, and behavior comparison.

planned

Preference Optimization / DPO

Chosen/rejected pairs, DPO training, and side-effect measurement.

planned

Agent Harness

Tool registry, file operations, test execution, and replayable traces.

planned

Long-Horizon Agent Evals

Multi-step coding tasks and trace-level failure taxonomy.

planned

Safety / Red Teaming

Refusal quality, over-refusal, under-refusal, jailbreak robustness, and safe completion.

planned

Reasoning Process Evaluation

Outcome scoring compared with observable process and self-correction signals.

planned

Monitorability

Internal-signal probes for failure prediction and representation drift.

planned

Multimodal UI Understanding

Screenshot QA, OCR plus reasoning, UI grounding, and visual failure modes.

planned

Systems Efficiency

Latency, throughput, batching, KV cache, quantization, and profiling.

planned

Data Efficiency / Scaling

Data mixtures, filtering, small scaling ladders, and score/GPU-hour.

planned

Final Integration

Reproducible scripts, reports, dashboard, README, and portfolio packaging.

Latest public artifacts

Planned

July eval harness report

Planned report for the first eval-first gate. No public runs yet.

Planned

Eval task schema

A documented schema for small task suites and graders.

Planned

Run storage format

A reproducible structure for run ids, configs, scores, costs, latency, and caveats.

Planned

Failure mode taxonomy

A controlled vocabulary for output, grader, safety, and agent failures.

Claim boundary

Planned scaffolding only. This section tracks a public learning and research-engineering process. Planned pages and empty dashboards are scaffolding. Results will be marked as published only after the underlying run, config, dataset, and report are available.

Open Model Lab

Status facts

Current gate: July 2026 - Foundation + Eval Harness

12-month timeline

Research modules

Eval Harness

SFT Pipeline

Preference Optimization / DPO

Agent Harness

Long-Horizon Agent Evals

Safety / Red Teaming

Reasoning Process Evaluation

Monitorability

Multimodal UI Understanding

Systems Efficiency

Data Efficiency / Scaling

Final Integration

Latest public artifacts

July eval harness report

Eval task schema

Run storage format

Failure mode taxonomy

Claim boundary

Section index