Starting GCTX: a small model for Git diffs
I published the project page for GCTX, a from-scratch small language model project for understanding Git diffs and writing Conventional Commit messages.
This note is not about a product shell. The core project is the model: a narrow code-context model that should read a Git diff, infer the intent of the change, and write a grounded Conventional Commit message. Not a general coding assistant. Not a chat model. One small model for one small but useful job.
The important distinction is that the model claim has not happened yet. GCTX is still in the data-production stage. The current work is about whether the first proof-model batch deserves to exist at all.
The planned ladder looks like this:
The first real decision is therefore not “how big should the model be?” It is whether GCTX-1 has enough clean, reviewed, split-aware data to make a 60M-100M proof model worth training.
The current GCTX-1 source artifact has 17,511 source-diff records from 37 permissively licensed public repositories. After source review, 7,989 records were accepted for teacher labeling and 9,522 were rejected before generation. HELD_OUT remains reserved.
The gate status is:
That table is the current caveat. The batch is strong enough to continue teacher generation and review, but the accepted DEV count is still below the first 10,000-record target. After generated-label review, the decision is either to treat this as a smaller proof run or expand the source manifest before making the first real language-model training claim.
The work immediately in front of the model is:
accept, edit, or reject; only accepted/edited records can become supervised data.This is slower than scraping Git logs and training on commit subjects directly. But historical commit messages are not automatically good labels. Some are vague, some are project-specific, some are release bookkeeping, and many are not Conventional Commit messages at all. For GCTX, public Git history is the source of real code changes, not the final truth.
So the current milestone is not a model release. It is stricter than that: prove that GCTX-1 has a clean enough data path to deserve the first model-quality experiment.