Open Model Lab

Open Model Lab Evals

Evals are the foundation of the project: every later SFT, DPO, agent, safety, monitorability, or systems change needs comparable measurement.

Purpose

The eval harness is the first gate because model changes are not meaningful unless they can be compared against a stable task suite, grader set, and reporting format. The initial suite is intentionally small and explicit.

Task schema

Schema example, not a published task record:

{
  "id": "coding_001",
  "category": "coding",
  "prompt": "Write a Python function that...",
  "grader": "unit_test",
  "expected_behavior": "The function passes all hidden tests.",
  "difficulty": "easy",
  "tags": ["python", "unit-test", "deterministic"]
}

Initial task categories

Category	Count	Examples
Coding	5 tasks	Short Python function, bug fix, unit-test pass
Reasoning	5 tasks	Multi-step logic, small math, error analysis
Factuality	5 tasks	Infer from given text, avoid unsupported claims
Instruction following	5 tasks	Format, length, tone, constraints
Safety-lite	5 tasks	Safe direction and unnecessary refusal behavior

Grader types

Grader	Use
Unit-test grader	Coding, deterministic correctness, format validation
LLM-as-judge	Helpfulness, clarity, instruction following, reasoning quality
Rubric grader	Human-readable criteria scored 1-5
Exact-match grader	Short-answer and classification tasks

Metrics

Score
Latency
Cost
Failure mode
Reproducibility

Failure mode taxonomy

Hallucination
Wrong reasoning
Instruction miss
Format failure
Over-answering
Unsafe answer
Judge uncertainty
Looping
Premature success
Context loss
Tool misuse
Over-refusal
Under-refusal
Factuality drift
Style collapse