Open Model Lab
Failure Prediction Probes
Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?
Status
- Status
- planned
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.
Research question
Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?
Planned setup
- Log internal signals for a small model where instrumentation is practical.
- Compare Base/SFT/DPO representation drift where possible.
- Train or inspect simple failure-prediction probes.
Planned measurements
- Activation-derived or logprob-derived signals.
- Failure prediction against wrong answers, refusals, or low-confidence outputs.
- Limits of the instrumentation and task scope.
Planned sections
- Research question and claim boundary
- Setup, model variants, data versions, and config hashes
- Eval suite or task design
- Measurements and failure modes
- Limitations, caveats, and next decision
Expected artifacts
- monitorability module.
- Simple probe or failure-prediction experiment.
- Internal-signal analysis report.
Claim boundary
This report is an exploratory monitorability experiment.