Open Model Lab

Open Model Lab Reports

Report scaffolds for the 12-month Open Model Research Harness. Planned reports stay planned until the underlying run, config, dataset, and write-up exist.

Report index

planned

Minimal Open-Model Eval Harness

Month: July 2026 - Foundation + Eval Harness

Question: Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?

Claim boundary: This report will validate the harness, not claim model superiority.

planned

Base vs SFT Behavior Change

Month: August 2026 - SFT Pipeline + Data Quality

Question: Which behaviors improve or regress after supervised fine-tuning?

Claim boundary: This report will evaluate behavior changes within a controlled task set, not general model quality.

planned

DPO Behavior Impact

Month: September 2026 - Preference Optimization / DPO

Question: How does preference optimization change helpfulness, instruction following, factuality, conciseness, coding, and refusal behavior?

Claim boundary: This report will not treat preference loss as model quality.

planned

Post-Training and Agentic Coding Behavior

Month: October 2026 - Agent Harness / Tool Use

Question: How do base, SFT, and DPO variants behave inside a simple tool-using coding-agent harness?

Claim boundary: This report will not claim general agent capability.

planned

Coding Agent Failure Taxonomy

Month: November 2026 - Agent Evals + Long-Horizon Tasks

Question: Why do small coding agents fail on long-horizon tasks?

Claim boundary: This report classifies observed failures in controlled tasks only.

planned

Refusal and Jailbreak Evaluation

Month: December 2026 - Safety / Red Teaming / Refusal Quality

Question: Can refusal quality, over-refusal, under-refusal, and jailbreak robustness be measured together?

Claim boundary: This report is not a safety certification.

planned

Outcome vs Process Evaluation

Month: January 2027 - Reasoning Behavior + Process Evaluation

Question: When do final-answer scores hide flawed reasoning processes?

Claim boundary: This report evaluates diagnostic signals, not private chain-of-thought access.

planned

Failure Prediction Probes

Month: February 2027 - Monitorability / Interpretability Start

Question: Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?

Claim boundary: This report is an exploratory monitorability experiment.

planned

UI Understanding Evaluation

Month: March 2027 - Multimodal / UI Understanding / Computer Use

Question: Can screenshot-based tasks distinguish OCR success from actual UI reasoning?

Claim boundary: This report does not claim robust computer-use ability.

planned

Open-Model Systems Bottlenecks

Month: April 2027 - Training / Inference Systems Efficiency

Question: Which latency, throughput, batching, KV-cache, quantization, and profiling bottlenecks matter most for small experiments?

Claim boundary: This report focuses on practical experiment velocity, not production-scale serving.

planned

Score per GPU-Hour

Month: May 2027 - Data Efficiency + Scaling Ladder

Question: Which data mixtures produce the most behavior gain per GPU-hour?

Claim boundary: This report compares small scaling-ladder experiments only.

planned

Final Technical Report

Month: June 2027 - Final Integration + Public Portfolio

Question: What did the 12-month open-model research harness demonstrate end to end?

Claim boundary: The final report is a public portfolio and engineering record, not a frontier-model capability claim.