Open Model Lab

Refusal and Jailbreak Evaluation

Can refusal quality, over-refusal, under-refusal, and jailbreak robustness be measured together?

Status

Status
planned
Month/theme
December 2026: Safety / Red Teaming / Refusal Quality
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

Can refusal quality, over-refusal, under-refusal, and jailbreak robustness be measured together?

Planned setup

  • Build safety and harmless-request eval sets.
  • Score refusal quality, safe completion, over-refusal, and under-refusal.
  • Keep risky and harmless task categories separated in reporting.

Planned measurements

  • Score where the grader supports a score.
  • Latency and cost where the run infrastructure can measure them.
  • Output-quality notes and failure-mode labels.
  • Known caveats and reproducibility requirements.
  • Refusal quality, jailbreak robustness, and helpfulness/safety trade-off.

Planned sections

  • Research question and claim boundary
  • Setup, model variants, data versions, and config hashes
  • Eval suite or task design
  • Measurements and failure modes
  • Limitations, caveats, and next decision

Expected artifacts

  • safety_evals module.
  • Refusal and jailbreak eval report.
  • Model-level safety/helpfulness comparison.

Claim boundary

This report is not a safety certification.