Open Model Lab
December 2026: Safety / Red Teaming / Refusal Quality
Measure safety behavior by quality and balance, not just refusal rate.
Gate status
- Month
- 2026-12
- Status
- planned
Success criterion
The model's behavior is measured on both risky and harmless requests.
Focus
- Refusal quality.
- Over-refusal.
- Under-refusal.
- Jailbreak robustness.
- Safe completion.
- Helpfulness/safety trade-off.
- Safety boundaries during tool use.
Expected outputs
- safety_evals module.
- Refusal and jailbreak eval report.
- Model-level safety/helpfulness comparison.
End-of-month decision
Can the safety eval distinguish good refusal from lazy or excessive refusal?