Open Model Lab

Failure Prediction Probes

Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?

Status

Status: planned
Month/theme: February 2027: Monitorability / Interpretability Start

Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?

Planned setup

Log internal signals for a small model where instrumentation is practical.
Compare Base/SFT/DPO representation drift where possible.
Train or inspect simple failure-prediction probes.

Planned measurements

Activation-derived or logprob-derived signals.
Failure prediction against wrong answers, refusals, or low-confidence outputs.
Limits of the instrumentation and task scope.

Planned sections

Research question and claim boundary
Setup, model variants, data versions, and config hashes
Eval suite or task design
Measurements and failure modes
Limitations, caveats, and next decision

Expected artifacts

monitorability module.
Simple probe or failure-prediction experiment.
Internal-signal analysis report.

Claim boundary

This report is an exploratory monitorability experiment.

Related links

Reports index Related month page Runs