Open Model Lab
Glossary
Concise definitions for terms used throughout the Open Model Lab section.
Terms
| Term | Definition |
|---|---|
| eval harness | Infrastructure that runs tasks, grades outputs, records metadata, and produces reports. |
| task suite | A versioned group of tasks used to compare models or model variants. |
| grader | The scoring method attached to a task, such as unit tests, exact match, a rubric, or LLM-as-judge. |
| LLM-as-judge | A model used to score another model's output against a rubric or expected behavior. |
| unit-test grader | A deterministic grader that runs tests against generated code or structured output. |
| rubric grader | A grader that applies explicit criteria, usually with a bounded score such as 1-5. |
| SFT | Supervised fine-tuning on instruction-response examples. |
| DPO | Direct Preference Optimization, a preference-optimization method using chosen/rejected response pairs. |
| preference pair | A chosen response and rejected response for the same prompt. |
| base model | A model before instruction tuning or project-specific post-training. |
| checkpoint | A saved model state from training or fine-tuning. |
| adapter | A smaller trainable component attached to a base model instead of changing all model weights. |
| failure mode | A named category explaining how or why a model output failed. |
| hallucination | An unsupported or invented claim presented as if it were grounded. |
| instruction miss | A failure to follow an explicit format, tone, length, or task constraint. |
| over-refusal | Refusing a harmless or allowed request unnecessarily. |
| under-refusal | Complying with a request that should have been refused or safely redirected. |
| factuality drift | A post-training side effect where factual behavior worsens or becomes less grounded. |
| style collapse | A narrowing of output style caused by over-optimization or narrow preference data. |
| agent harness | Infrastructure that lets a model use tools, inspect files, run tests, and produce replayable traces. |
| tool use | Model-initiated calls to external capabilities such as file access, code execution, or test runners. |
| long-horizon task | A task that requires multiple steps, decisions, and corrections before completion. |
| replayable trace | A recorded sequence of prompts, tool calls, outputs, errors, and decisions that can be inspected later. |
| monitorability | The ability to observe behavior through outputs, metadata, and sometimes internal model signals. |
| representation drift | A change in internal activation or hidden-state patterns across model variants. |
| entropy | A measure of uncertainty in a model's predicted token distribution. |
| logprob margin | The gap between the log probability of the chosen token and competing tokens. |
| multimodal eval | An evaluation involving more than text, such as screenshots or UI images. |
| UI grounding | Linking a model's answer to the actual visible UI elements in a screenshot or interface. |
| systems efficiency | The practical speed, memory, throughput, and cost behavior of training or inference infrastructure. |
| KV cache | Cached attention keys and values used to make autoregressive inference faster. |
| quantization | Reducing numeric precision to lower memory or compute cost, with possible quality trade-offs. |
| score/GPU-hour | A compute-normalized measure comparing behavior gain with GPU time spent. |
| claim boundary | An explicit statement of what a report does and does not prove. |