Open Model Lab

Glossary

Concise definitions for terms used throughout the Open Model Lab section.

Terms

Term Definition
eval harness Infrastructure that runs tasks, grades outputs, records metadata, and produces reports.
task suite A versioned group of tasks used to compare models or model variants.
grader The scoring method attached to a task, such as unit tests, exact match, a rubric, or LLM-as-judge.
LLM-as-judge A model used to score another model's output against a rubric or expected behavior.
unit-test grader A deterministic grader that runs tests against generated code or structured output.
rubric grader A grader that applies explicit criteria, usually with a bounded score such as 1-5.
SFT Supervised fine-tuning on instruction-response examples.
DPO Direct Preference Optimization, a preference-optimization method using chosen/rejected response pairs.
preference pair A chosen response and rejected response for the same prompt.
base model A model before instruction tuning or project-specific post-training.
checkpoint A saved model state from training or fine-tuning.
adapter A smaller trainable component attached to a base model instead of changing all model weights.
failure mode A named category explaining how or why a model output failed.
hallucination An unsupported or invented claim presented as if it were grounded.
instruction miss A failure to follow an explicit format, tone, length, or task constraint.
over-refusal Refusing a harmless or allowed request unnecessarily.
under-refusal Complying with a request that should have been refused or safely redirected.
factuality drift A post-training side effect where factual behavior worsens or becomes less grounded.
style collapse A narrowing of output style caused by over-optimization or narrow preference data.
agent harness Infrastructure that lets a model use tools, inspect files, run tests, and produce replayable traces.
tool use Model-initiated calls to external capabilities such as file access, code execution, or test runners.
long-horizon task A task that requires multiple steps, decisions, and corrections before completion.
replayable trace A recorded sequence of prompts, tool calls, outputs, errors, and decisions that can be inspected later.
monitorability The ability to observe behavior through outputs, metadata, and sometimes internal model signals.
representation drift A change in internal activation or hidden-state patterns across model variants.
entropy A measure of uncertainty in a model's predicted token distribution.
logprob margin The gap between the log probability of the chosen token and competing tokens.
multimodal eval An evaluation involving more than text, such as screenshots or UI images.
UI grounding Linking a model's answer to the actual visible UI elements in a screenshot or interface.
systems efficiency The practical speed, memory, throughput, and cost behavior of training or inference infrastructure.
KV cache Cached attention keys and values used to make autoregressive inference faster.
quantization Reducing numeric precision to lower memory or compute cost, with possible quality trade-offs.
score/GPU-hour A compute-normalized measure comparing behavior gain with GPU time spent.
claim boundary An explicit statement of what a report does and does not prove.