Open Model Lab
February 2027: Monitorability / Interpretability Start
Start analyzing model behavior with internal signals in addition to external scores.
Gate status
- Month
- 2027-02
- Status
- planned
- Report
- Failure Prediction Probes
Success criterion
Behavior changes can be tracked not only from outputs but also from model-internal measurements.
Focus
- Activation logging.
- Hidden-state analysis.
- Entropy.
- Logprob margin.
- Representation drift across Base/SFT/DPO.
- Failure prediction for wrong answers or refusals.
Expected outputs
- monitorability module.
- Simple probe or failure-prediction experiment.
- Report: can failures be predicted in small models?
End-of-month decision
Are internal signals useful enough to guide later experiments?