Concept Fresh

Open Model Research Harness

A 12-month public research-engineering project for reproducible open-model evals, post-training, agents, safety, monitorability, and systems efficiency.

Design stage. No working code yet.

Started
Jul 1, 2026
Updated
Jul 4, 2026

Gathering feedback — open an issue with your thoughts.

Open in repo ↗
Roadmap0 / 12 — 0% · effort-weighted
  • July 2026: Foundation + Eval Harness — planned
  • August 2026: SFT Pipeline + Data Quality — planned
  • September 2026: Preference Optimization / DPO — planned
  • October 2026: Agent Harness / Tool Use — planned
  • November 2026: Agent Evals + Long-Horizon Tasks — planned
  • December 2026: Safety / Red Teaming / Refusal Quality — planned
  • January 2027: Reasoning Behavior + Process Evaluation — planned
  • February 2027: Monitorability / Interpretability Start — planned
  • March 2027: Multimodal / UI Understanding / Computer Use — planned
  • April 2027: Training / Inference Systems Efficiency — planned
  • May 2027: Data Efficiency + Scaling Ladder — planned
  • June 2027: Final Integration + Public Portfolio — planned
StatusPlanned / starts July 2026
TimeframeJuly 2026 - June 2027
ScopeOpen models, small experiments, reproducible evaluation, public reporting
Claim boundaryNo frontier-model capability claim

What it is

Open Model Research Harness is a 12-month public research-engineering project for studying open LLMs through reproducible evaluations, post-training experiments, agentic task harnesses, safety evaluations, monitorability probes, and systems-efficiency measurements.

The goal is not to claim frontier-model capability. The goal is to build and document the eval-first research workflow needed to understand how model behavior changes across SFT, preference optimization, tool use, safety constraints, and inference/training efficiency work.

The project is inspired by frontier-lab research workflows, but intentionally scoped to open models, small experiments, reproducible evaluation, and public reporting.

It is a modular research harness and public portfolio for:

  • evals
  • SFT
  • preference optimization / DPO
  • agent harnesses
  • long-horizon agent evals
  • safety and refusal quality
  • reasoning process evaluation
  • monitorability / interpretability probes
  • multimodal UI understanding
  • systems efficiency
  • data efficiency and scaling ladders

Why it exists

Open-model work is easy to overclaim when the evaluation loop is weak. A fine-tune, preference run, or agent demo can look useful in isolation while hiding regressions, safety failures, trace-level mistakes, or systems bottlenecks.

This project makes the measuring system the first deliverable. The July 2026 gate is not a model release; it is a small reproducible eval harness that can compare multiple open models on the same tasks and report score, cost, latency, output quality, and failure modes.

How it differs

The project is not structured as a leaderboard or benchmark marketing page. It is a public engineering record: each planned report has a research question, a setup, planned measurements, expected artifacts, and a claim boundary. Empty dashboards stay empty until real runs exist.

The work also connects several layers that are often presented separately: eval design, post-training, agent traces, safety behavior, monitorability probes, and systems profiling. The point is to show how behavior changes across the loop, not to claim that small open models match larger private systems.

What it is not

  • It is not a claim that I run a frontier AI lab.
  • It is not a leaderboard.
  • It is not a benchmark marketing page.
  • It is not a claim that small open models match frontier models.
  • It is not a large-scale pretraining project.

Research loop

01

Measure

Measure model behavior before changing it.

02

Modify

Modify behavior with SFT, DPO, tools, or constraints.

03

Compare

Compare base/SFT/DPO/agent variants on the same eval suite.

04

Diagnose

Diagnose failures with taxonomy and traces.

05

Report

Report results with claim boundaries.

Roadmap

Module map

planned

Eval Harness

Reproducible task execution, grading, metrics, and failure labels.

planned

SFT Pipeline

Instruction data cleaning, supervised fine-tuning, and behavior comparison.

planned

Preference Optimization / DPO

Chosen/rejected pairs, DPO training, and side-effect measurement.

planned

Agent Harness

Tool registry, file operations, test execution, and replayable traces.

planned

Long-Horizon Agent Evals

Multi-step coding tasks and trace-level failure taxonomy.

planned

Safety / Red Teaming

Refusal quality, over-refusal, under-refusal, jailbreak robustness, and safe completion.

planned

Reasoning Process Evaluation

Outcome scoring compared with observable process and self-correction signals.

planned

Monitorability

Internal-signal probes for failure prediction and representation drift.

planned

Multimodal UI Understanding

Screenshot QA, OCR plus reasoning, UI grounding, and visual failure modes.

planned

Systems Efficiency

Latency, throughput, batching, KV cache, quantization, and profiling.

planned

Data Efficiency / Scaling

Data mixtures, filtering, small scaling ladders, and score/GPU-hour.

planned

Final Integration

Reproducible scripts, reports, dashboard, README, and portfolio packaging.

Status

As of 2026-07-04, Open Model Research Harness is planned. There are no public runs, model cards, dataset cards, benchmark results, latency numbers, cost numbers, or model-quality claims yet.

The first gate is the July 2026 Foundation + Eval Harness milestone. It is complete only when at least three open models can be evaluated on the same small task suite with score, cost, latency, output-quality, and failure-mode reporting.

Where it lives

The public lab section is Open Model Lab. The plan, eval schema, report scaffolds, and empty run/model/dataset/dashboard registries live there until real artifacts exist.

open-model-evalspost-trainingagent-harnessessafety-evalsmonitorabilitysystems-efficiency