Open Model Lab
Refusal and Jailbreak Evaluation
Can refusal quality, over-refusal, under-refusal, and jailbreak robustness be measured together?
Status
- Status
- planned
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.
Research question
Can refusal quality, over-refusal, under-refusal, and jailbreak robustness be measured together?
Planned setup
- Build safety and harmless-request eval sets.
- Score refusal quality, safe completion, over-refusal, and under-refusal.
- Keep risky and harmless task categories separated in reporting.
Planned measurements
- Score where the grader supports a score.
- Latency and cost where the run infrastructure can measure them.
- Output-quality notes and failure-mode labels.
- Known caveats and reproducibility requirements.
- Refusal quality, jailbreak robustness, and helpfulness/safety trade-off.
Planned sections
- Research question and claim boundary
- Setup, model variants, data versions, and config hashes
- Eval suite or task design
- Measurements and failure modes
- Limitations, caveats, and next decision
Expected artifacts
- safety_evals module.
- Refusal and jailbreak eval report.
- Model-level safety/helpfulness comparison.
Claim boundary
This report is not a safety certification.