Open Model Lab

Outcome vs Process Evaluation

When do final-answer scores hide flawed reasoning processes?

Status

Status: planned
Month/theme: January 2027: Reasoning Behavior + Process Evaluation

Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

When do final-answer scores hide flawed reasoning processes?

Planned setup

Create tasks with outcome-based and process-oriented rubrics.
Compare final-answer correctness with observable reasoning-path signals.
Record uncertainty and self-correction behavior.

Planned measurements

Score where the grader supports a score.
Latency and cost where the run infrastructure can measure them.
Output-quality notes and failure-mode labels.
Known caveats and reproducibility requirements.
Final-answer vs reasoning-path disagreements.

Planned sections

Research question and claim boundary
Setup, model variants, data versions, and config hashes
Eval suite or task design
Measurements and failure modes
Limitations, caveats, and next decision

Expected artifacts

reasoning_evals module.
Final answer vs reasoning path comparison.
Process-signal diagnostic report.

Claim boundary

This report evaluates diagnostic signals, not private chain-of-thought access.

Related links

Reports index Related month page Runs