Open Model Research Harness

Started: Jul 1, 2026
Updated: Jul 4, 2026

Repo ↗

→

Gathering feedback — open an issue with your thoughts.

Open in repo ↗

Roadmap0 / 12 — 0% · effort-weighted

July 2026: Foundation + Eval Harness — planned
August 2026: SFT Pipeline + Data Quality — planned
September 2026: Preference Optimization / DPO — planned
October 2026: Agent Harness / Tool Use — planned
November 2026: Agent Evals + Long-Horizon Tasks — planned
December 2026: Safety / Red Teaming / Refusal Quality — planned
January 2027: Reasoning Behavior + Process Evaluation — planned
February 2027: Monitorability / Interpretability Start — planned
March 2027: Multimodal / UI Understanding / Computer Use — planned
April 2027: Training / Inference Systems Efficiency — planned
May 2027: Data Efficiency + Scaling Ladder — planned
June 2027: Final Integration + Public Portfolio — planned

StatusPlanned / starts July 2026

TimeframeJuly 2026 - June 2027

ScopeOpen models, small experiments, reproducible evaluation, public reporting

Claim boundaryNo frontier-model capability claim

What it is

Open Model Research Harness is a 12-month public research-engineering project for studying open LLMs through reproducible evaluations, post-training experiments, agentic task harnesses, safety evaluations, monitorability probes, and systems-efficiency measurements.

The goal is not to claim frontier-model capability. The goal is to build and document the eval-first research workflow needed to understand how model behavior changes across SFT, preference optimization, tool use, safety constraints, and inference/training efficiency work.

The project is inspired by frontier-lab research workflows, but intentionally scoped to open models, small experiments, reproducible evaluation, and public reporting.

It is a modular research harness and public portfolio for:

evals
SFT
preference optimization / DPO
agent harnesses
long-horizon agent evals
safety and refusal quality
reasoning process evaluation
monitorability / interpretability probes
multimodal UI understanding
systems efficiency
data efficiency and scaling ladders

Why it exists

Open-model work is easy to overclaim when the evaluation loop is weak. A fine-tune, preference run, or agent demo can look useful in isolation while hiding regressions, safety failures, trace-level mistakes, or systems bottlenecks.

This project makes the measuring system the first deliverable. The July 2026 gate is not a model release; it is a small reproducible eval harness that can compare multiple open models on the same tasks and report score, cost, latency, output quality, and failure modes.

How it differs

The project is not structured as a leaderboard or benchmark marketing page. It is a public engineering record: each planned report has a research question, a setup, planned measurements, expected artifacts, and a claim boundary. Empty dashboards stay empty until real runs exist.

The work also connects several layers that are often presented separately: eval design, post-training, agent traces, safety behavior, monitorability probes, and systems profiling. The point is to show how behavior changes across the loop, not to claim that small open models match larger private systems.

What it is not

It is not a claim that I run a frontier AI lab.
It is not a leaderboard.
It is not a benchmark marketing page.
It is not a claim that small open models match frontier models.
It is not a large-scale pretraining project.

Research loop

Measure

Measure model behavior before changing it.

Modify

Modify behavior with SFT, DPO, tools, or constraints.

Compare

Compare base/SFT/DPO/agent variants on the same eval suite.

Diagnose

Diagnose failures with taxonomy and traces.

Report

Report results with claim boundaries.

Roadmap

July 2026

Foundation + Eval Harness

Build the basic research infrastructure that can measure model behavior reliably before changing it.

planned Report

August 2026

SFT Pipeline + Data Quality

Measure the behavioral difference between a base model and an instruction-tuned model.

planned Report

September 2026

Preference Optimization / DPO

Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.

planned Report

October 2026

Agent Harness / Tool Use

Move the model from passive question-answering into a tool-using agent setup.

planned Report

November 2026

Agent Evals + Long-Horizon Tasks

Measure agent success and failure taxonomy on multi-step tasks.

planned Report

December 2026

Safety / Red Teaming / Refusal Quality

Measure safety behavior by quality and balance, not just refusal rate.

planned Report

January 2027

Reasoning Behavior + Process Evaluation

Evaluate not only final answers, but where reasoning processes break down.

planned Report

February 2027

Monitorability / Interpretability Start

Start analyzing model behavior with internal signals in addition to external scores.

planned Report

March 2027

Multimodal / UI Understanding / Computer Use

Enter multimodal model work through screenshot and UI-understanding tasks.

planned Report

April 2027

Training / Inference Systems Efficiency

Add systems and profiling knowledge to make research experiments more efficient.

planned Report

May 2027

Data Efficiency + Scaling Ladder

Measure whether better data mixtures produce better behavior with the same compute.

planned Report

June 2027

Final Integration + Public Portfolio

Turn the 12-month work into a showable, reproducible, publishable portfolio.

planned Report

Module map

planned

Eval Harness

Reproducible task execution, grading, metrics, and failure labels.

planned

SFT Pipeline

Instruction data cleaning, supervised fine-tuning, and behavior comparison.

planned

Preference Optimization / DPO

Chosen/rejected pairs, DPO training, and side-effect measurement.

planned

Agent Harness

Tool registry, file operations, test execution, and replayable traces.

planned

Long-Horizon Agent Evals

Multi-step coding tasks and trace-level failure taxonomy.

planned

Safety / Red Teaming

Refusal quality, over-refusal, under-refusal, jailbreak robustness, and safe completion.

planned

Reasoning Process Evaluation

Outcome scoring compared with observable process and self-correction signals.

planned

Monitorability

Internal-signal probes for failure prediction and representation drift.

planned

Multimodal UI Understanding

Screenshot QA, OCR plus reasoning, UI grounding, and visual failure modes.

planned

Systems Efficiency

Latency, throughput, batching, KV cache, quantization, and profiling.

planned

Data Efficiency / Scaling

Data mixtures, filtering, small scaling ladders, and score/GPU-hour.

planned

Final Integration

Reproducible scripts, reports, dashboard, README, and portfolio packaging.

Status

As of 2026-07-04, Open Model Research Harness is planned. There are no public runs, model cards, dataset cards, benchmark results, latency numbers, cost numbers, or model-quality claims yet.

The first gate is the July 2026 Foundation + Eval Harness milestone. It is complete only when at least three open models can be evaluated on the same small task suite with score, cost, latency, output-quality, and failure-mode reporting.

Where it lives

The public lab section is Open Model Lab. The plan, eval schema, report scaffolds, and empty run/model/dataset/dashboard registries live there until real artifacts exist.

12-month plan The full July 2026 through June 2027 research-engineering plan. Reports Planned and published report pages with explicit claim boundaries. Evals Task schema, initial categories, graders, metrics, and failure taxonomy. Runs Future run registry and run schema. No public runs yet.

open-model-evalspost-trainingagent-harnessessafety-evalsmonitorabilitysystems-efficiency