Open Model Research Harness
A 12-month public research-engineering project for reproducible open-model evals, post-training, agents, safety, monitorability, and systems efficiency.
Design stage. No working code yet.
- July 2026: Foundation + Eval Harness — planned
- August 2026: SFT Pipeline + Data Quality — planned
- September 2026: Preference Optimization / DPO — planned
- October 2026: Agent Harness / Tool Use — planned
- November 2026: Agent Evals + Long-Horizon Tasks — planned
- December 2026: Safety / Red Teaming / Refusal Quality — planned
- January 2027: Reasoning Behavior + Process Evaluation — planned
- February 2027: Monitorability / Interpretability Start — planned
- March 2027: Multimodal / UI Understanding / Computer Use — planned
- April 2027: Training / Inference Systems Efficiency — planned
- May 2027: Data Efficiency + Scaling Ladder — planned
- June 2027: Final Integration + Public Portfolio — planned
What it is
Open Model Research Harness is a 12-month public research-engineering project for studying open LLMs through reproducible evaluations, post-training experiments, agentic task harnesses, safety evaluations, monitorability probes, and systems-efficiency measurements.
The goal is not to claim frontier-model capability. The goal is to build and document the eval-first research workflow needed to understand how model behavior changes across SFT, preference optimization, tool use, safety constraints, and inference/training efficiency work.
The project is inspired by frontier-lab research workflows, but intentionally scoped to open models, small experiments, reproducible evaluation, and public reporting.
It is a modular research harness and public portfolio for:
- evals
- SFT
- preference optimization / DPO
- agent harnesses
- long-horizon agent evals
- safety and refusal quality
- reasoning process evaluation
- monitorability / interpretability probes
- multimodal UI understanding
- systems efficiency
- data efficiency and scaling ladders
Why it exists
Open-model work is easy to overclaim when the evaluation loop is weak. A fine-tune, preference run, or agent demo can look useful in isolation while hiding regressions, safety failures, trace-level mistakes, or systems bottlenecks.
This project makes the measuring system the first deliverable. The July 2026 gate is not a model release; it is a small reproducible eval harness that can compare multiple open models on the same tasks and report score, cost, latency, output quality, and failure modes.
How it differs
The project is not structured as a leaderboard or benchmark marketing page. It is a public engineering record: each planned report has a research question, a setup, planned measurements, expected artifacts, and a claim boundary. Empty dashboards stay empty until real runs exist.
The work also connects several layers that are often presented separately: eval design, post-training, agent traces, safety behavior, monitorability probes, and systems profiling. The point is to show how behavior changes across the loop, not to claim that small open models match larger private systems.
What it is not
- It is not a claim that I run a frontier AI lab.
- It is not a leaderboard.
- It is not a benchmark marketing page.
- It is not a claim that small open models match frontier models.
- It is not a large-scale pretraining project.
Research loop
Measure
Measure model behavior before changing it.
Modify
Modify behavior with SFT, DPO, tools, or constraints.
Compare
Compare base/SFT/DPO/agent variants on the same eval suite.
Diagnose
Diagnose failures with taxonomy and traces.
Report
Report results with claim boundaries.
Roadmap
Foundation + Eval Harness
Build the basic research infrastructure that can measure model behavior reliably before changing it.
SFT Pipeline + Data Quality
Measure the behavioral difference between a base model and an instruction-tuned model.
Preference Optimization / DPO
Use preference data to improve SFT behavior in a more controlled way, then measure behavioral side effects.
Agent Harness / Tool Use
Move the model from passive question-answering into a tool-using agent setup.
Agent Evals + Long-Horizon Tasks
Measure agent success and failure taxonomy on multi-step tasks.
Safety / Red Teaming / Refusal Quality
Measure safety behavior by quality and balance, not just refusal rate.
Reasoning Behavior + Process Evaluation
Evaluate not only final answers, but where reasoning processes break down.
Monitorability / Interpretability Start
Start analyzing model behavior with internal signals in addition to external scores.
Multimodal / UI Understanding / Computer Use
Enter multimodal model work through screenshot and UI-understanding tasks.
Training / Inference Systems Efficiency
Add systems and profiling knowledge to make research experiments more efficient.
Data Efficiency + Scaling Ladder
Measure whether better data mixtures produce better behavior with the same compute.
Final Integration + Public Portfolio
Turn the 12-month work into a showable, reproducible, publishable portfolio.
Module map
Eval Harness
Reproducible task execution, grading, metrics, and failure labels.
SFT Pipeline
Instruction data cleaning, supervised fine-tuning, and behavior comparison.
Preference Optimization / DPO
Chosen/rejected pairs, DPO training, and side-effect measurement.
Agent Harness
Tool registry, file operations, test execution, and replayable traces.
Long-Horizon Agent Evals
Multi-step coding tasks and trace-level failure taxonomy.
Safety / Red Teaming
Refusal quality, over-refusal, under-refusal, jailbreak robustness, and safe completion.
Reasoning Process Evaluation
Outcome scoring compared with observable process and self-correction signals.
Monitorability
Internal-signal probes for failure prediction and representation drift.
Multimodal UI Understanding
Screenshot QA, OCR plus reasoning, UI grounding, and visual failure modes.
Systems Efficiency
Latency, throughput, batching, KV cache, quantization, and profiling.
Data Efficiency / Scaling
Data mixtures, filtering, small scaling ladders, and score/GPU-hour.
Final Integration
Reproducible scripts, reports, dashboard, README, and portfolio packaging.
Status
As of 2026-07-04, Open Model Research Harness is planned. There are no public runs, model cards, dataset cards, benchmark results, latency numbers, cost numbers, or model-quality claims yet.
The first gate is the July 2026 Foundation + Eval Harness milestone. It is complete only when at least three open models can be evaluated on the same small task suite with score, cost, latency, output-quality, and failure-mode reporting.
Where it lives
The public lab section is Open Model Lab. The plan, eval schema, report scaffolds, and empty run/model/dataset/dashboard registries live there until real artifacts exist.