Open Model Lab
Open Model Lab Evals
Evals are the foundation of the project: every later SFT, DPO, agent, safety, monitorability, or systems change needs comparable measurement.
Purpose
The eval harness is the first gate because model changes are not meaningful unless they can be compared against a stable task suite, grader set, and reporting format. The initial suite is intentionally small and explicit.
Task schema
Schema example, not a published task record:
{
"id": "coding_001",
"category": "coding",
"prompt": "Write a Python function that...",
"grader": "unit_test",
"expected_behavior": "The function passes all hidden tests.",
"difficulty": "easy",
"tags": ["python", "unit-test", "deterministic"]
} Initial task categories
| Category | Count | Examples |
|---|---|---|
| Coding | 5 tasks | Short Python function, bug fix, unit-test pass |
| Reasoning | 5 tasks | Multi-step logic, small math, error analysis |
| Factuality | 5 tasks | Infer from given text, avoid unsupported claims |
| Instruction following | 5 tasks | Format, length, tone, constraints |
| Safety-lite | 5 tasks | Safe direction and unnecessary refusal behavior |
Grader types
| Grader | Use |
|---|---|
| Unit-test grader | Coding, deterministic correctness, format validation |
| LLM-as-judge | Helpfulness, clarity, instruction following, reasoning quality |
| Rubric grader | Human-readable criteria scored 1-5 |
| Exact-match grader | Short-answer and classification tasks |
Metrics
- Score
- Latency
- Cost
- Failure mode
- Reproducibility
Failure mode taxonomy
- Hallucination
- Wrong reasoning
- Instruction miss
- Format failure
- Over-answering
- Unsafe answer
- Judge uncertainty
- Looping
- Premature success
- Context loss
- Tool misuse
- Over-refusal
- Under-refusal
- Factuality drift
- Style collapse