GCTX passes its first data-readiness gate
GCTX has passed its first strict data-readiness gate.
That is a small sentence with a lot packed inside it. It does not mean there is a released model yet. It does not mean the dataset is public. It does not mean GCTX is already a useful commit-message model. The claim is narrower and more important for this stage: the data pipeline has produced enough reviewed, split-aware, source-attributed private supervised examples to justify the first proof language-model run.
The current proof artifact is gctx1-strict.v0.
The reason this took longer than “just scrape Git logs” is that historical commit subjects are not clean training labels. Some are excellent. Many are not: merge commits, dependency bumps, vague subjects, issue-number-only messages, release bookkeeping, project-specific house style, or changes that are too broad for a narrow Conventional Commit model.
So GCTX uses Git history as the source of real code changes, not as the final label source. The label is regenerated from the diff, validated, reviewed, and only then promoted.
The current artifact lineage looks like this:
| Layer | Count | Notes |
|---|---|---|
| Source diffs | 33,152 | Real public Git changes from the GCTX-1 source plan and expansion passes. |
| Teacher inputs | 14,104 | Source diffs that passed review and became prompt payloads. |
| Generated labels | 14,090 | Local teacher outputs; 14 missing labels are recorded and not promoted. |
| Reviewed training records | 11,926 | Accepted or edited labels promoted into gctx1-strict.v0. |
The proof-readiness report is green. The smoke checks are also green, but they should be read correctly: they validate the artifact path, not the final model quality.
The important boundary is this: GCTX is now ready for a proof language-model experiment, not a public model-quality announcement.
The next step is to train a 60M-100M proof model from the gctx1-strict.v0 DEV records and evaluate it on locked REPORT before making any stronger claim. If that model cannot beat the baselines in a meaningful way, the right answer is not to wrap it in a CLI and pretend. The right answer is to improve the data, training setup, or model size, then measure again.
That is the shape I want for GCTX: small, useful, inspectable, and honest about what has actually been proven.
Full current status lives on the GCTX project page.