This repository contains code to reproduce results from the paper https://arxiv.org/abs/2601.02031.
Our model training code can be found in the folder nanoGPT. It is based on nanoGPT (commit 7a1614e).
Our changes include (but are not limited to) the use of
- FineWeb
- RoPE, SwiGLU
- qk-layernorm & independent weight decay
- sequence length 2048
- customizable batch size & learning rate + automatic choice of micro batch size
- optional Xavier weight initialization
- optional weight tying
Our analysis code can be found in the folder results.
# e.g. using conda
conda create -n venv26 python=3.11
conda activate venv26
pip install torch==2.6 numpy transformers datasets tiktoken wandb tqdm rotary-embedding-torch scipy matplotlib seaborn jupyter scienceplots
~100B tokens, ~300GB disk space
cd nanoGPT/data
python prepare_fineweb.py
Note that the above python script contains a TARGET_DIRECTORY variable that should be adjusted beforehand.
The experiments can be run step by step with the bash scripts listed in the following table.
| Script Name | Purpose |
|---|---|
| nanoGPT/config/*.sh | Create training config files |
| 0_run_training.sh | Run training |
| 1_prepare_validation.sh | Create validation config files |
| 2_run_validation.sh | Run validation |
| 3_aggregate.sh | Aggregate results |
Note:
-
Each bash script contains the commands for all main experiments (not the hyperparameter sensitivity experiments)
-
The "Scale" variable in the bash scripts corresponds to the model size as follows:
Scale Model Size 4 16M 6 29M 8 57M A 109M C 221M -
The "Method" variable in the bash scripts corresponds to the mitigation strategy as follows:
Method Mitigation Strategy A baseline E mu-loss R mu-centering Z z-loss -
W&B logging is turned off by default. To turn it on, change
wandb_log = Falsetowandb_log = Truein the config files and log in to W&B. -
The output checkpoints from each experiment can be found in the subfolders of
nanoGPT/output. -
The aggregated results can be found in
nanoGPT/output/loss_overview.csv -
The actual experiments were conducted in parallel using slurm scripts
The actual results,
loss_overview.csv(main experiments)loss_overview_all.csv(main + hyperparameter sensitivity experiments)checkpoints
are analyzed using the jupyter notebooks in the results folder:
cd results
jupyter notebook
They produce figures and tables that can be found in
results/figsandresults/tables
respectively.