Open Model Lab
Minimal Open-Model Eval Harness
Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?
Status
- Status
- planned
- Month/theme
- July 2026: Foundation + Eval Harness
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.
Research question
Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?
Planned setup
- Select a small open-model set only after the harness can record model identity and config.
- Run the same initial task suite across all selected models.
- Record evaluator, seed, prompt, grader, latency, and failure-mode metadata.
Planned measurements
- Score where the grader supports a score.
- Latency and cost where the run infrastructure can measure them.
- Output-quality notes and failure-mode labels.
- Known caveats and reproducibility requirements.
Planned sections
- Research question and claim boundary
- Setup, model variants, data versions, and config hashes
- Eval suite or task design
- Measurements and failure modes
- Limitations, caveats, and next decision
Expected artifacts
- Eval task schema.
- Run storage format.
- Model / score / cost / latency / failure-mode report.
Claim boundary
This report will validate the harness, not claim model superiority.