GCTX — a language model for Git diffs

Started: Jun 17, 2026
Updated: Jun 21, 2026

→

Gathering feedback — open an issue with your thoughts.

Roadmap23 / 29 — 67% · effort-weighted

✓Public repository scaffold and project identity
✓Conventional Commit parser, scorer, and fixture evals
✓License-reviewed source manifest for the first public-repo audit set
✓Pilot source-diff extraction from Click, Requests, and Pluggy — 250 source-diff records extracted; 114 accepted for teacher labeling after source review
✓Local teacher-input pipeline for named artifacts
✓Pilot teacher-label generation with a local Qwen Coder model — 114/114 labels generated; retry-safe generation handled evidence-path cleanup
✓Pilot generated-label human review — 105 accepted, 9 edited, 0 rejected
✓Expanded source-diff batch with split-plan lineage — 1,000 next-batch source-diff records extracted and reviewed; 356 accepted for teacher labeling
✓Teacher-input artifact for the expanded batch — 356 next-batch teacher-input records generated and validated
✓Teacher-label generation for the expanded batch — 356/356 generated labels produced with zero failed records
✓Generated-label review for the expanded batch — 301 accepted, 34 edited, 21 rejected
✓Expanded supervised training artifact — 335 reviewed SFT records materialized as the next private artifact
✓Expanded baseline evaluation report — 335/335 reviewed targets are format-valid; REPORT subset is 33/33 format-valid
✓Record-level REPORT inspection — 33 REPORT records inspected; reviewed targets have 0 format or scope errors
✓Training pipeline smoke — A dependency-free prototype trained on 302 DEV records and produced 33/33 format-valid REPORT predictions
✓Promotion/data-card decision for the expanded supervised artifact — next-v0 is approved for private pipeline/eval/model-artifact validation and aggregate public reporting, not public dataset or model release
✓Tiny neural training smoke — A dependency-free softmax model trained on 302 DEV records; REPORT predictions were 33/33 format-valid with 15/33 type matches
✓Split-readiness gate for GCTX-1 planning — The earlier next-v0 plan correctly failed because it lacked HELD_OUT windows, target-record counts, enough DEV repos, and ecosystem metadata
✓GCTX-1 source manifest and split plan — 37 permissively licensed repositories planned with DEV, REPORT, and HELD_OUT splits
✓GCTX-1 source-diff extraction — 17,511 source-diff records extracted: 13,567 DEV, 2,431 REPORT, 1,513 HELD_OUT
✓GCTX-1 source-review policy — 7,989 source diffs accepted for teacher labeling; 9,522 rejected; HELD_OUT remains reserved
✓GCTX-1 teacher-input artifact — 7,989 validated teacher-input records materialized for local teacher generation
✓Retry-safe teacher generation with progress output — Generation writes records incrementally, skips existing outputs on rerun, and reports progress during long runs
Complete GCTX-1 local teacher-label generation
Review generated GCTX-1 labels and promote a supervised artifact
Train the first specialized commit-message model
Evaluate against deterministic checks plus held-out repository diffs
Package a CLI that runs against a local Git diff
Publish model weights, dataset cards, model card, and release notes

What it is

GCTX is a from-scratch small language model family for understanding Git diffs and writing Conventional Commit messages.

The model goal comes first: build a narrow language model that can read a code change, infer intent, and produce a concise Conventional Commit message. gitctx is the CLI and product shell around that model; the main work is the data, training, evaluation, and release process behind GCTX.

The model is not released yet. The current artifact is a conservative data-production pipeline that has moved from pilot validation into the first GCTX-1 teacher-generation batch.

Why it exists

Commit messages are small, structured, repetitive, and still surprisingly easy to get wrong. They are also a useful narrow target for a small language model: the model does not need to be a general coding assistant; it needs to understand a diff, identify the intent, choose an accurate type and scope, and explain the change without inventing context.

That makes GCTX a good proving ground for an open-source model workflow:

the task is valuable enough to become a real developer tool;
the output is easy to inspect;
quality can be checked with deterministic Conventional Commit rules plus human review;
the dataset can be built from public repositories with explicit license review;
the same infrastructure can later support larger or different code-context models.

The practical bet is that a small model can be useful when the task is narrow enough and the data loop is strict enough. A general assistant has to answer almost anything. GCTX has a smaller job: read one Git change, decide what kind of change it is, and write one message in a known format. That makes every part of the project easier to audit. Bad examples can be rejected. Outputs can be scored mechanically. Held-out repositories can be reserved before generation. Progress can be measured without pretending that the model is generally intelligent.

How it differs

The first principle is that GCTX should be more than an open-weight release. The intended release path includes source manifests, data cards, model cards, review artifacts, and reproducible scripts for turning approved public Git history into supervised examples.

The current pipeline is deliberately conservative:

source repositories are selected through a reviewed manifest;
source diffs are extracted into named artifacts;
source diffs are reviewed before any teacher labeling;
teacher labels are generated locally, one diff at a time, with retry-safe provenance;
generated labels are reviewed as accept, edit, or reject;
labels are not training data until a promotion/data-card decision says how they may be used.

The first local teacher is a Qwen Coder model served through Ollama. That teacher is only part of the data-production process; the project target is a smaller specialized language model released under an open-source-friendly process, with gitctx as the first practical CLI interface.

Why not just train on Git logs?

The tempting shortcut is to take public repositories, collect every historical commit message, and train on those messages directly. The problem is that public Git history is not automatically high-quality instruction data. Many commits are merge commits, dependency bumps, release bookkeeping, vague subjects, issue-number-only messages, inconsistent house styles, or broad mixed changes. Even good maintainers often optimize commit messages for their own project history, not for a reusable Conventional Commit assistant.

The current data shows this clearly. In the expanded artifact, reviewed targets are Conventional Commit format-valid on 335/335 records. Raw teacher labels are 333/335 format-valid. Original historical commit subjects are only 9/335 format-valid, and 0/33 format-valid on the REPORT subset. That does not mean historical messages are useless; they still help identify real diffs and real developer intent. But they are not clean enough to be the main supervised label without filtering or regeneration.

So the pipeline uses public Git history as the source of real code changes, not as the final label source. The label is regenerated from the diff, then reviewed. That is slower than scraping commit logs, but it produces cleaner examples and gives every training record a provenance trail.

Data production loop

The core loop is intentionally explicit:

Choose source repositories. Each repo enters through a manifest with license and revision metadata.
Extract source diffs. The extractor records one source-diff example per commit, with changed paths, stats, parent/source commit IDs, split assignment, and provenance.
Review source diffs. Records that are docs-only, CI-only, release bookkeeping, very broad, or weak for the target task are rejected before teacher labeling.
Create teacher inputs. Only accepted source diffs become full prompt payloads.
Generate one label per call. Each diff is sent independently to the local teacher. This keeps retries, validation, and provenance simple.
Validate generated labels. Outputs must be JSON-shaped and must satisfy the Conventional Commit parser/scorer.
Review generated labels. Human review marks labels as accept, edit, or reject.
Promote to training data. A data-card/output-use decision says exactly what the resulting artifact may be used for.

This design is deliberately not fully automatic. Automation helps generate candidates and run validators, but a small model trained on a narrow task is only as good as the examples that survive the gates.

Status

As of 2026-06-21, the pilot and expanded next-v0 passes have validated the pipeline, and GCTX-1 has moved into its first larger teacher-generation batch. The project is still in data production. There is no released model yet, and none of the current numbers should be read as model-quality claims.

The pilot source set used three permissively licensed public repositories: Click, Requests, and Pluggy. The source extractor produced 250 source-diff records. Source review accepted 114 records for teacher labeling and rejected 136 records that were documentation-only, CI/config-only, release bookkeeping, too broad, or otherwise weak for the first label-quality target.

The local teacher-generation run produced 114 labels for those 114 accepted diffs. The first attempt generated 112 labels and failed on two records because the teacher returned evidence paths that did not exactly match the changed paths. The generator was changed to normalize line-suffixed evidence paths and convert invalid evidence paths into warnings instead of fatal errors; the retry completed the remaining two labels.

Human review of the generated labels is now complete:

Decision	Count
Accept	105
Edit	9
Reject	0
Needs review	0

The edited labels corrected type, scope, evidence, or subject issues in the flagged set. The pilot pass established the review and promotion workflow, but it is too small for a useful model.

The next batch expands the pipeline beyond the pilot shape. It extracted 1,000 source-diff records from a reviewed source manifest and explicit split plan. Source review accepted 356 records for teacher labeling and rejected 644 records, mostly because they were documentation-only, CI/config-only, release bookkeeping, broad mixed changes, or otherwise weak for the first supervised quality target.

The accepted next-batch records were materialized as 356 validated teacher-input payloads. Teacher-label generation produced 356 labels with zero failed records. Human review then marked 301 as accepted, 34 as edited, and 21 as rejected. The resulting private supervised artifact contains 335 reviewed SFT records.

The expanded baseline report now compares reviewed targets, raw teacher labels, and original historical commit subjects. The reviewed targets are format-valid on 335/335 records, including 33/33 in the REPORT subset. Raw teacher labels are 333/335 format-valid. Historical commit subjects are only 9/335 format-valid, and 0/33 on REPORT. This is a data-quality checkpoint, not a model-quality claim. It supports the decision to regenerate and review labels instead of training directly on public Git subjects.

The REPORT subset now also has a record-level inspection artifact. It contains 33 reviewed records: 30 accepted as-is and 3 edited during review. Reviewed targets have 0 format errors and 0 scope errors; raw teacher labels have 1 scope error. This closes the immediate inspection step before deciding whether to run a tiny proof-model pipeline validation or expand the dataset first.

The first training pipeline smoke is also complete. It uses a dependency-free prototype model, not a neural model: the goal is to validate that reviewed SFT records can produce a model artifact, prediction artifact, and REPORT eval report. The prototype trained on 302 DEV records and evaluated on 33 REPORT records. Its predictions are 33/33 format-valid with 0 scope errors, but only 17/33 type matches and 0/33 exact message matches. That is useful pipeline evidence, not model-quality evidence.

The first tiny neural smoke is now complete as well. It is a dependency-free single-layer softmax classifier, not the target language model. It trained on 302 DEV records for 25 epochs, with loss moving from 1.459991 to 1.043913, and evaluated on 33 REPORT records. Its predictions were 33/33 format-valid with 0 prediction scope errors, 15/33 type matches, and 0/33 exact message matches. This proves the first gradient-descent model-artifact path works; it is still not a public model-quality claim.

The expanded artifact now has a public aggregate data card and output-use decision. The decision is intentionally narrow: the private supervised artifact may be used for private pipeline, evaluation, and model-artifact validation, and its aggregate statistics may be reported publicly. It is not approved as a full public JSONL dataset, not enough for a public model release, and not evidence of neural model quality.

The split-readiness gate then blocked a premature larger run. The next-v0 plan correctly failed because it had no HELD_OUT windows, no explicit target-record counts, insufficient DEV repo count for the proof gate, and no ecosystem metadata. That failure was useful: it forced GCTX-1 to become a planned source batch rather than a loose continuation of the pilot.

GCTX-1 now has a larger reviewed source plan and extracted source artifact. The current GCTX-1 source artifact contains 17,511 source-diff records from 37 permissively licensed public repositories:

Split	Source diffs	Repositories
DEV	13,567	37
REPORT	2,431	24
HELD_OUT	1,513	21

The source-review policy accepted 7,989 records for teacher labeling and rejected 9,522 records before generation. Accepted records are currently split as 6,704 DEV and 1,285 REPORT. HELD_OUT records remain reserved and are not sent to teacher generation at this stage.

Source-review decision	Count
Accepted for teacher labeling	7,989
Rejected before teacher labeling	9,522
Needs review	0

The accepted records have been materialized as 7,989 validated teacher-input payloads. These are large prompt artifacts, so the artifact store tracks them through Git LFS instead of normal Git blobs. Generation is designed to be resumable: each output record is written incrementally, reruns skip already-generated IDs, and the generator prints progress during long local teacher runs.

The important caveat is that GCTX-1 is still under the initial 10k DEV training-record target after source review. It has enough REPORT candidates for the visible evaluation target and a reserved HELD_OUT pool, but only 6,704 accepted DEV records before generated-label review. After generation and review, the likely decision is either to expand the source manifest again or to treat this pass as a smaller proof run rather than the first full GCTX-1 training set.

Current architecture

The working pipeline has four artifact layers:

Source manifest — license-reviewed public repositories and pinned revisions.
Source diffs — extracted Git diffs with paths, stats, source commit, parent commit, split, and provenance.
Teacher inputs — full prompt payloads for source diffs accepted for teacher labeling.
Generated labels and reviews — model-produced Conventional Commit candidates plus human decisions and edits.

The code path supports named artifacts such as smoke, pilot, next, and gctx1, so future batches can be generated, validated, retried, reviewed, and promoted without mixing stages.

Evaluation philosophy

GCTX uses multiple evaluation layers because no single metric is enough.

The first layer is deterministic format scoring. A candidate message must parse as a Conventional Commit. It must have a valid type, optional scope, subject, and any required body/footer structure. This catches many failures cheaply, but it does not prove semantic quality.

The second layer is split-based reporting. DEV records are allowed to influence training. REPORT records are used for visible evaluation during development. HELD_OUT records must be reserved before generation and kept away from tuning decisions. That separation matters because small projects can accidentally overfit their evaluation set simply by repeatedly looking at it.

The third layer is record-level inspection. Aggregate scores can hide bad behavior. The REPORT inspection artifact exists so individual failures can be reviewed: wrong type, wrong scope, invented context, mixed-change confusion, or a message that is valid but unhelpful.

The fourth layer is model-artifact validation. The dependency-free prototype and tiny neural smoke do not claim useful model quality. They prove that reviewed SFT records can drive a training run, produce a model artifact, produce predictions, and be evaluated reproducibly. The next model work should move from pipeline validation toward actual language-model behavior.

Current gates

The current GCTX-1 planning target is:

DEV target records10,00013,567 extracted; 6,704 accepted before teacher generation

REPORT target records1,0002,431 extracted; 1,285 accepted before teacher generation

HELD_OUT target records1,0001,513 extracted and reserved

DEV repositories2537 repositories represented in DEV

REPORT repositories524 repositories represented in REPORT

HELD_OUT repositories521 repositories represented in HELD_OUT

Ecosystemsat least 2encoded in the GCTX-1 source plan

Maximum single DEV repo share25%enforced by split-readiness checks before expansion

The earlier next-v0 artifact is useful because it proved the pipeline. GCTX-1 is the first pass that looks like a real proof-model source batch, but it still may need more accepted DEV records before it should be promoted into a first model-training claim.

The immediate milestone is to finish the GCTX-1 local teacher-generation and review loop without weakening the data gates.

The concrete next sequence is:

Complete local teacher-label generation for the 7,989 accepted GCTX-1 teacher inputs.
Validate generated labels for JSON shape, Conventional Commit format, scope consistency, and evidence paths.
Review generated labels as accept, edit, or reject.
Promote only accepted or edited labels into a supervised artifact with a data-card/output-use decision.
Decide whether the reviewed GCTX-1 artifact is large enough for a small proof model or whether the source manifest needs another expansion first.
Train the first specialized commit-message language model once the reviewed artifact is large and clean enough to justify a model-quality experiment.

CLI packaging for gitctx.com remains after the model/eval loop shows measurable behavior. The CLI should be a product shell around a real model, not a substitute for the model work.

Where it lives

Repo: github.com/serkanaltuntas/gitctx
Planned CLI/product home: gitctx.com
This page tracks the build-in-public status until the first public model/data release.

pythonollamalocal-llmconventional-commits