planned Month: July 2026 - Foundation + Eval Harness
Question: Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?
Claim boundary: This report will validate the harness, not claim model superiority.
planned Month: August 2026 - SFT Pipeline + Data Quality
Question: Which behaviors improve or regress after supervised fine-tuning?
Claim boundary: This report will evaluate behavior changes within a controlled task set, not general model quality.
planned Month: September 2026 - Preference Optimization / DPO
Question: How does preference optimization change helpfulness, instruction following, factuality, conciseness, coding, and refusal behavior?
Claim boundary: This report will not treat preference loss as model quality.
planned Month: October 2026 - Agent Harness / Tool Use
Question: How do base, SFT, and DPO variants behave inside a simple tool-using coding-agent harness?
Claim boundary: This report will not claim general agent capability.
planned Month: November 2026 - Agent Evals + Long-Horizon Tasks
Question: Why do small coding agents fail on long-horizon tasks?
Claim boundary: This report classifies observed failures in controlled tasks only.
planned Month: December 2026 - Safety / Red Teaming / Refusal Quality
Question: Can refusal quality, over-refusal, under-refusal, and jailbreak robustness be measured together?
Claim boundary: This report is not a safety certification.
planned Month: January 2027 - Reasoning Behavior + Process Evaluation
Question: When do final-answer scores hide flawed reasoning processes?
Claim boundary: This report evaluates diagnostic signals, not private chain-of-thought access.
planned Month: February 2027 - Monitorability / Interpretability Start
Question: Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?
Claim boundary: This report is an exploratory monitorability experiment.
planned Month: March 2027 - Multimodal / UI Understanding / Computer Use
Question: Can screenshot-based tasks distinguish OCR success from actual UI reasoning?
Claim boundary: This report does not claim robust computer-use ability.
planned Month: April 2027 - Training / Inference Systems Efficiency
Question: Which latency, throughput, batching, KV-cache, quantization, and profiling bottlenecks matter most for small experiments?
Claim boundary: This report focuses on practical experiment velocity, not production-scale serving.
planned Month: May 2027 - Data Efficiency + Scaling Ladder
Question: Which data mixtures produce the most behavior gain per GPU-hour?
Claim boundary: This report compares small scaling-ladder experiments only.
planned Month: June 2027 - Final Integration + Public Portfolio
Question: What did the 12-month open-model research harness demonstrate end to end?
Claim boundary: The final report is a public portfolio and engineering record, not a frontier-model capability claim.