GCTX passes its first data-readiness gate

GCTX has passed its first strict data-readiness gate.

That is a small sentence with a lot packed inside it. It does not mean there is a released model yet. It does not mean the dataset is public. It does not mean GCTX is already a useful commit-message model. The claim is narrower and more important for this stage: the data pipeline has produced enough reviewed, split-aware, source-attributed private supervised examples to justify the first proof language-model run.

The current proof artifact is gctx1-strict.v0.

ItemCurrentStatusMeaning
Reviewed supervised records11,926DoneEnough reviewed examples exist for the first proof run.
DEV records10,299DoneThe proof trainer has crossed the 10,000-record planning target.
REPORT records1,627DoneThe first visible evaluation set is large enough to be useful.
HELD_OUT in training0DoneReserved records stay out of this training artifact.

The reason this took longer than “just scrape Git logs” is that historical commit subjects are not clean training labels. Some are excellent. Many are not: merge commits, dependency bumps, vague subjects, issue-number-only messages, release bookkeeping, project-specific house style, or changes that are too broad for a narrow Conventional Commit model.

So GCTX uses Git history as the source of real code changes, not as the final label source. The label is regenerated from the diff, validated, reviewed, and only then promoted.

The current artifact lineage looks like this:

LayerCountNotes
Source diffs33,152Real public Git changes from the GCTX-1 source plan and expansion passes.
Teacher inputs14,104Source diffs that passed review and became prompt payloads.
Generated labels14,090Local teacher outputs; 14 missing labels are recorded and not promoted.
Reviewed training records11,926Accepted or edited labels promoted into gctx1-strict.v0.

The proof-readiness report is green. The smoke checks are also green, but they should be read correctly: they validate the artifact path, not the final model quality.

CheckResultStatusInterpretation
Reviewed target format11,926/11,926 validDoneThe promoted targets satisfy the Conventional Commit parser.
REPORT target format1,627/1,627 validDoneThe locked visible eval split is structurally clean.
Path-type smoke932/1,627 type matchesDoneA simple dependency-free prototype can learn useful coarse signal.
Tiny softmax smokeloss 1.061279 -> 0.820147; 988/1,627 type matchesDoneThe training/eval path is alive, but this is not the GCTX language model.

The important boundary is this: GCTX is now ready for a proof language-model experiment, not a public model-quality announcement.

The next step is to train a 60M-100M proof model from the gctx1-strict.v0 DEV records and evaluate it on locked REPORT before making any stronger claim. If that model cannot beat the baselines in a meaningful way, the right answer is not to wrap it in a CLI and pretend. The right answer is to improve the data, training setup, or model size, then measure again.

That is the shape I want for GCTX: small, useful, inspectable, and honest about what has actually been proven.

Full current status lives on the GCTX project page.