Open Model Lab

February 2027: Monitorability / Interpretability Start

Start analyzing model behavior with internal signals in addition to external scores.

Gate status

Month
2027-02
Status
planned
Report
Failure Prediction Probes

Success criterion

Behavior changes can be tracked not only from outputs but also from model-internal measurements.

Focus

  • Activation logging.
  • Hidden-state analysis.
  • Entropy.
  • Logprob margin.
  • Representation drift across Base/SFT/DPO.
  • Failure prediction for wrong answers or refusals.

Expected outputs

  • monitorability module.
  • Simple probe or failure-prediction experiment.
  • Report: can failures be predicted in small models?

End-of-month decision

Are internal signals useful enough to guide later experiments?