Open Model Lab

Glossary

Concise definitions for terms used throughout the Open Model Lab section.

Terms

Term	Definition
eval harness	Infrastructure that runs tasks, grades outputs, records metadata, and produces reports.
task suite	A versioned group of tasks used to compare models or model variants.
grader	The scoring method attached to a task, such as unit tests, exact match, a rubric, or LLM-as-judge.
LLM-as-judge	A model used to score another model's output against a rubric or expected behavior.
unit-test grader	A deterministic grader that runs tests against generated code or structured output.
rubric grader	A grader that applies explicit criteria, usually with a bounded score such as 1-5.
SFT	Supervised fine-tuning on instruction-response examples.
DPO	Direct Preference Optimization, a preference-optimization method using chosen/rejected response pairs.
preference pair	A chosen response and rejected response for the same prompt.
base model	A model before instruction tuning or project-specific post-training.
checkpoint	A saved model state from training or fine-tuning.
adapter	A smaller trainable component attached to a base model instead of changing all model weights.
failure mode	A named category explaining how or why a model output failed.
hallucination	An unsupported or invented claim presented as if it were grounded.
instruction miss	A failure to follow an explicit format, tone, length, or task constraint.
over-refusal	Refusing a harmless or allowed request unnecessarily.
under-refusal	Complying with a request that should have been refused or safely redirected.
factuality drift	A post-training side effect where factual behavior worsens or becomes less grounded.
style collapse	A narrowing of output style caused by over-optimization or narrow preference data.
agent harness	Infrastructure that lets a model use tools, inspect files, run tests, and produce replayable traces.
tool use	Model-initiated calls to external capabilities such as file access, code execution, or test runners.
long-horizon task	A task that requires multiple steps, decisions, and corrections before completion.
replayable trace	A recorded sequence of prompts, tool calls, outputs, errors, and decisions that can be inspected later.
monitorability	The ability to observe behavior through outputs, metadata, and sometimes internal model signals.
representation drift	A change in internal activation or hidden-state patterns across model variants.
entropy	A measure of uncertainty in a model's predicted token distribution.
logprob margin	The gap between the log probability of the chosen token and competing tokens.
multimodal eval	An evaluation involving more than text, such as screenshots or UI images.
UI grounding	Linking a model's answer to the actual visible UI elements in a screenshot or interface.
systems efficiency	The practical speed, memory, throughput, and cost behavior of training or inference infrastructure.
KV cache	Cached attention keys and values used to make autoregressive inference faster.
quantization	Reducing numeric precision to lower memory or compute cost, with possible quality trade-offs.
score/GPU-hour	A compute-normalized measure comparing behavior gain with GPU time spent.
claim boundary	An explicit statement of what a report does and does not prove.