Open Model Lab
May 2027: Data Efficiency + Scaling Ladder
Measure whether better data mixtures produce better behavior with the same compute.
Gate status
- Month
- 2027-05
- Status
- planned
- Report
- Score per GPU-Hour
Success criterion
The effect of data quality on behavior and compute efficiency is quantified.
Focus
- Data mixture.
- Quality filtering.
- Synthetic data.
- Contamination control.
- Small model ladder: 30M, 70M, 150M, optional 350M.
- Score/GPU-hour.
Expected outputs
- data_scaling module.
- Data mixture comparison report.
- Report: which data mixture produces the most behavior gain per GPU-hour?
End-of-month decision
Does data quality improve score/GPU-hour enough to justify the filtering pipeline?