Open Model Lab

Open Model Lab Evals

Evals are the foundation of the project: every later SFT, DPO, agent, safety, monitorability, or systems change needs comparable measurement.

Purpose

The eval harness is the first gate because model changes are not meaningful unless they can be compared against a stable task suite, grader set, and reporting format. The initial suite is intentionally small and explicit.

Task schema

Schema example, not a published task record:

{
  "id": "coding_001",
  "category": "coding",
  "prompt": "Write a Python function that...",
  "grader": "unit_test",
  "expected_behavior": "The function passes all hidden tests.",
  "difficulty": "easy",
  "tags": ["python", "unit-test", "deterministic"]
}

Initial task categories

Category Count Examples
Coding 5 tasks Short Python function, bug fix, unit-test pass
Reasoning 5 tasks Multi-step logic, small math, error analysis
Factuality 5 tasks Infer from given text, avoid unsupported claims
Instruction following 5 tasks Format, length, tone, constraints
Safety-lite 5 tasks Safe direction and unnecessary refusal behavior

Grader types

Grader Use
Unit-test grader Coding, deterministic correctness, format validation
LLM-as-judge Helpfulness, clarity, instruction following, reasoning quality
Rubric grader Human-readable criteria scored 1-5
Exact-match grader Short-answer and classification tasks

Metrics

  • Score
  • Latency
  • Cost
  • Failure mode
  • Reproducibility

Failure mode taxonomy

  • Hallucination
  • Wrong reasoning
  • Instruction miss
  • Format failure
  • Over-answering
  • Unsafe answer
  • Judge uncertainty
  • Looping
  • Premature success
  • Context loss
  • Tool misuse
  • Over-refusal
  • Under-refusal
  • Factuality drift
  • Style collapse