Open Model Lab

Minimal Open-Model Eval Harness

Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?

Status

Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Can three open models be evaluated on the same small task suite with reproducible score, cost, latency, and failure-mode reporting?

Select a small open-model set only after the harness can record model identity and config.
Run the same initial task suite across all selected models.
Record evaluator, seed, prompt, grader, latency, and failure-mode metadata.

This report will validate the harness, not claim model superiority.