Open Model Lab

May 2027: Data Efficiency + Scaling Ladder

Measure whether better data mixtures produce better behavior with the same compute.

Gate status

Month
2027-05
Status
planned
Report
Score per GPU-Hour

Success criterion

The effect of data quality on behavior and compute efficiency is quantified.

Focus

  • Data mixture.
  • Quality filtering.
  • Synthetic data.
  • Contamination control.
  • Small model ladder: 30M, 70M, 150M, optional 350M.
  • Score/GPU-hour.

Expected outputs

  • data_scaling module.
  • Data mixture comparison report.
  • Report: which data mixture produces the most behavior gain per GPU-hour?

End-of-month decision

Does data quality improve score/GPU-hour enough to justify the filtering pipeline?