Open Model Lab
Outcome vs Process Evaluation
When do final-answer scores hide flawed reasoning processes?
Status
- Status
- planned
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.
Research question
When do final-answer scores hide flawed reasoning processes?
Planned setup
- Create tasks with outcome-based and process-oriented rubrics.
- Compare final-answer correctness with observable reasoning-path signals.
- Record uncertainty and self-correction behavior.
Planned measurements
- Score where the grader supports a score.
- Latency and cost where the run infrastructure can measure them.
- Output-quality notes and failure-mode labels.
- Known caveats and reproducibility requirements.
- Final-answer vs reasoning-path disagreements.
Planned sections
- Research question and claim boundary
- Setup, model variants, data versions, and config hashes
- Eval suite or task design
- Measurements and failure modes
- Limitations, caveats, and next decision
Expected artifacts
- reasoning_evals module.
- Final answer vs reasoning path comparison.
- Process-signal diagnostic report.
Claim boundary
This report evaluates diagnostic signals, not private chain-of-thought access.