This project provides an end-to-end pipeline for automatically generating high-quality Git commit messages by fine-tuning a Large Language Model (LLM) using real-world Git repository data.
It covers:
- Discovering and cloning popular GitHub repositories
- Extracting and normalizing commit-related data
- Converting commits into LLM-friendly prompt/target pairs
- Filtering and cleaning datasets
- Fine-tuning an LLM using LoRA (Low-Rank Adaptation) for efficient training
The final result is a lightweight LoRA adapter that can generate concise, imperative commit messages based on code diffs and repository context.
GitHub Search
↓
Clone Repositories
↓
Extract Commit Data (JSONL)
↓
Normalize & Filter
↓
Convert to LLM Prompts
↓
LoRA Fine-Tuning
.
├── clone-repos.sh # Clone repositories in parallel
├── scraping-repo-list.sh # Fetch top GitHub repositories by language
├── exclude-repos.sh # Manually exclude problematic repositories
├── process-repos.sh # Extract structured commit data from repos
├── normalize-charset.py # Normalize Unicode and clean text
├── language-filter.py # Filter commit messages by language
├── sequentize-for-llm.py # Convert commit data to LLM prompt/target pairs
├── finetune-via-lora.py # Fine-tune an LLM with LoRA
└── repos/ # Cloned repositories (generated)
gitjqcurlGNU parallel
- Python 3.9+
transformersdatasetspefttorchlangdetectorjson(optional, recommended)
Search for popular repositories by language and stars:
./scraping-repo-list.sh <max_pages> <lang1,lang2,...> --min-stars 100Example:
./scraping-repo-list.sh 5 python,cpp --min-stars 500This generates a text file containing owner/repo entries.
Optionally exclude repositories by adding them to
exclude-repos.sh.
Clone repositories in parallel:
./clone-repos.sh repo_list.txt 8All repositories will be cloned into the repos/ directory.
Convert Git history into structured JSONL files:
./process-repos.sh -r repos -o commit_data -m 1000 -t 8Each repository produces one .jsonl file containing:
- Cleaned commit message
- Code diff (truncated)
- Recent commit history
- Code style guidelines (if available)
- Affected files
Normalize Unicode and clean invisible characters:
python normalize-charset.py commit_data repo_data_normalizedThis step ensures consistent encoding across multilingual repositories.
Keep only commits in specific languages (default: English):
python language-filter.py input.jsonl output.jsonl --target-lang enGenerate prompt/target pairs suitable for causal language modeling:
python sequentize-for-llm.py commit_data samples.jsonlEach sample contains:
- A structured prompt (diff, affected files, history, style)
- A target commit message
Train a LoRA adapter on top of a base LLM:
python finetune-via-lora.py \
--model_name meta-llama/Llama-3-8b-hf \
--data_path samples.jsonl \
--fourbit \
--bf16 \
--output_dir ./output \
--final_save_path ./commit-message-loraKey features:
- 4-bit or 8-bit training support
- Configurable LoRA target modules
- Optional validation dataset
- Efficient training on limited GPU memory
Each prompt includes:
- Affected files
- Code diff
- Recent commit examples
- Code style guidelines
The model is instructed to generate a concise, imperative commit message .
The final output is a LoRA adapter directory containing:
- LoRA weights
- Tokenizer configuration
This adapter can be merged with or loaded alongside the base model for inference.
- Avoid merge, revert, squash, and fixup commits (filtered automatically)
- Large diffs and style files are truncated for efficiency
- Use diverse repositories for better generalization
- Always validate data quality before training
This project is intended for research and educational purposes. Please ensure that your use of GitHub data complies with repository licenses and GitHub's Terms of Service.
- Hugging Face Transformers & Datasets
- PEFT / LoRA
- GNU Parallel
- The open-source GitHub community