Open Model Lab

August 2026: SFT Pipeline + Data Quality

Measure the behavioral difference between a base model and an instruction-tuned model.

Gate status

Month
2026-08
Status
planned
Report
Base vs SFT Behavior Change

Success criterion

The work does not stop at fine-tuning; it shows measurable behavior changes, including regressions.

Focus

  • Build an SFT pipeline.
  • Define and clean instruction data.
  • Compare base vs SFT behavior using the same eval harness.
  • Measure eval regressions: what improved, what got worse?
  • Create dataset and model cards.

Expected outputs

  • training/sft module.
  • Base/SFT behavior comparison table.
  • Dataset card for instruction_v1.
  • Report: what behavior does SFT improve and what can it damage?

End-of-month decision

Is the SFT checkpoint reliable enough to serve as the reference for preference optimization?