Open Model Lab

December 2026: Safety / Red Teaming / Refusal Quality

Measure safety behavior by quality and balance, not just refusal rate.

Gate status

Month
2026-12
Status
planned
Report
Refusal and Jailbreak Evaluation

Success criterion

The model's behavior is measured on both risky and harmless requests.

Focus

  • Refusal quality.
  • Over-refusal.
  • Under-refusal.
  • Jailbreak robustness.
  • Safe completion.
  • Helpfulness/safety trade-off.
  • Safety boundaries during tool use.

Expected outputs

  • safety_evals module.
  • Refusal and jailbreak eval report.
  • Model-level safety/helpfulness comparison.

End-of-month decision

Can the safety eval distinguish good refusal from lazy or excessive refusal?