Starting GCTX: a small model for Git diffs

I published the project page for GCTX, a from-scratch small language model project for understanding Git diffs and writing Conventional Commit messages.

This note is not about a product shell. The core project is the model: a narrow code-context model that should read a Git diff, infer the intent of the change, and write a grounded Conventional Commit message. Not a general coding assistant. Not a chat model. One small model for one small but useful job.

The important distinction is that the model claim has not happened yet. GCTX is still in the data-production stage. The current work is about whether the first proof-model batch deserves to exist at all.

The planned ladder looks like this:

StepSizeStatusPurpose
GCTX-0rules/templatesDoneDeterministic baseline, parser, scorer, and pipeline checks.
GCTX-160M-100MWaiting on data gateFirst proof model. It must beat deterministic and teacher baselines on REPORT before scaling is justified.
GCTX-2150M-300MPlannedFirst serious public-model candidate, only with a larger reviewed dataset and signed model/data/eval cards.
GCTX-3~500MOpenOnly useful if the task grows beyond commit messages into broader repository-operator behavior.
GCTX-4~1BNot plannedToo large for the first question; only makes sense if later evidence justifies it.

The first real decision is therefore not “how big should the model be?” It is whether GCTX-1 has enough clean, reviewed, split-aware data to make a 60M-100M proof model worth training.

The current GCTX-1 source artifact has 17,511 source-diff records from 37 permissively licensed public repositories. After source review, 7,989 records were accepted for teacher labeling and 9,522 were rejected before generation. HELD_OUT remains reserved.

The gate status is:

GateTargetCurrentStatus
Source diffs extractedplanned batch17,511 from 37 repositoriesDone
DEV accepted records10,0006,704 accepted before teacher generationWaiting
REPORT accepted records1,0001,285 accepted before teacher generationDone
HELD_OUT reserved records1,0001,513 extracted and reservedDone
DEV repositories2537 represented in DEVDone
REPORT repositories524 represented in REPORTDone
HELD_OUT repositories521 represented in HELD_OUTDone
Ecosystemsat least 2encoded in the source planDone
Maximum single DEV repo share25%enforced by split-readiness checksDone

That table is the current caveat. The batch is strong enough to continue teacher generation and review, but the accepted DEV count is still below the first 10,000-record target. After generated-label review, the decision is either to treat this as a smaller proof run or expand the source manifest before making the first real language-model training claim.

The work immediately in front of the model is:

MilestoneStatusWhat changes when it is done
Finish local teacher-label generationRunningEvery accepted source diff has a generated Conventional Commit candidate.
Review generated labelsWaitingLabels become accept, edit, or reject; only accepted/edited records can become supervised data.
Promote a supervised artifactWaitingA data-card/output-use decision says what the reviewed artifact may be used for.
Decide GCTX-1 proof run vs. source expansionWaitingThe project either trains a smaller proof model or expands the source manifest first.
Train GCTX-1WaitingThe first 60M-100M language-model result can be evaluated against REPORT.

This is slower than scraping Git logs and training on commit subjects directly. But historical commit messages are not automatically good labels. Some are vague, some are project-specific, some are release bookkeeping, and many are not Conventional Commit messages at all. For GCTX, public Git history is the source of real code changes, not the final truth.

So the current milestone is not a model release. It is stricter than that: prove that GCTX-1 has a clean enough data path to deserve the first model-quality experiment.