Open Model Lab

Failure Prediction Probes

Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?

Status

Status
planned
Month/theme
February 2027: Monitorability / Interpretability Start
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

Can simple internal signals predict wrong answers, refusals, or low-confidence behavior?

Planned setup

  • Log internal signals for a small model where instrumentation is practical.
  • Compare Base/SFT/DPO representation drift where possible.
  • Train or inspect simple failure-prediction probes.

Planned measurements

  • Activation-derived or logprob-derived signals.
  • Failure prediction against wrong answers, refusals, or low-confidence outputs.
  • Limits of the instrumentation and task scope.

Planned sections

  • Research question and claim boundary
  • Setup, model variants, data versions, and config hashes
  • Eval suite or task design
  • Measurements and failure modes
  • Limitations, caveats, and next decision

Expected artifacts

  • monitorability module.
  • Simple probe or failure-prediction experiment.
  • Internal-signal analysis report.

Claim boundary

This report is an exploratory monitorability experiment.