Open Model Lab

January 2027: Reasoning Behavior + Process Evaluation

Evaluate not only final answers, but where reasoning processes break down.

Gate status

Month: 2027-01
Status: planned
Report: Outcome vs Process Evaluation

Success criterion

The system can show flawed processes despite correct answers, or identify breakpoints leading to wrong answers.

Focus

Outcome-based grading vs process-based grading.
Math, logic, code reasoning, and planning tasks.
Self-correction.
Uncertainty signals.
Final-answer vs reasoning-path comparison.

Expected outputs

reasoning_evals module.
Final answer vs reasoning path comparison.
Report: why outcome scores are not enough.

End-of-month decision

Do process signals add useful diagnostic value beyond final-answer grading?

Related links

All months Timeline Planned report