Open Model Lab

Outcome vs Process Evaluation

When do final-answer scores hide flawed reasoning processes?

Status

Status
planned
Month/theme
January 2027: Reasoning Behavior + Process Evaluation
Status: Planned. This page is a report scaffold. It does not contain model scores, charts, or completed run results.

Research question

When do final-answer scores hide flawed reasoning processes?

Planned setup

  • Create tasks with outcome-based and process-oriented rubrics.
  • Compare final-answer correctness with observable reasoning-path signals.
  • Record uncertainty and self-correction behavior.

Planned measurements

  • Score where the grader supports a score.
  • Latency and cost where the run infrastructure can measure them.
  • Output-quality notes and failure-mode labels.
  • Known caveats and reproducibility requirements.
  • Final-answer vs reasoning-path disagreements.

Planned sections

  • Research question and claim boundary
  • Setup, model variants, data versions, and config hashes
  • Eval suite or task design
  • Measurements and failure modes
  • Limitations, caveats, and next decision

Expected artifacts

  • reasoning_evals module.
  • Final answer vs reasoning path comparison.
  • Process-signal diagnostic report.

Claim boundary

This report evaluates diagnostic signals, not private chain-of-thought access.