Open Model Lab

Minimal Open-Model Eval Harness

Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?

Status

Status
planned
Month/theme
July 2026: Foundation + Eval Harness
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?

Planned setup

  • Select a small open-model set only after the harness can record model identity and config.
  • Run the same initial task suite across all selected models.
  • Record evaluator, seed, prompt, grader, latency, and failure-mode metadata.

Planned measurements

  • Score where the grader supports a score.
  • Latency and cost where the run infrastructure can measure them.
  • Output-quality notes and failure-mode labels.
  • Known caveats and reproducibility requirements.

Planned sections

  • Research question and claim boundary
  • Setup, model variants, data versions, and config hashes
  • Eval suite or task design
  • Measurements and failure modes
  • Limitations, caveats, and next decision

Expected artifacts

  • Eval task schema.
  • Run storage format.
  • Model / score / cost / latency / failure-mode report.

Claim boundary

This report will validate the harness, not claim model superiority.