Open Model Lab

January 2027: Reasoning Behavior + Process Evaluation

Evaluate not only final answers, but where reasoning processes break down.

Gate status

Month
2027-01
Status
planned
Report
Outcome vs Process Evaluation

Success criterion

The system can show flawed processes despite correct answers, or identify breakpoints leading to wrong answers.

Focus

  • Outcome-based grading vs process-based grading.
  • Math, logic, code reasoning, and planning tasks.
  • Self-correction.
  • Uncertainty signals.
  • Final-answer vs reasoning-path comparison.

Expected outputs

  • reasoning_evals module.
  • Final answer vs reasoning path comparison.
  • Report: why outcome scores are not enough.

End-of-month decision

Do process signals add useful diagnostic value beyond final-answer grading?