This project builds a GPT-style decoder-only transformer from scratch in PyTorch and
pre-trains it on a real language modeling dataset used in published research. The goal is not
to call AutoModelForCausalLM.from_pretrained() — it is to implement every architectural
decision by hand and watch language modeling loss and perplexity fall as the model learns to
predict text.
By the end of this project you will have:
- Written every component of a transformer (attention, FFN, positional encoding, weight tying)
- Trained a BPE tokenizer on real data
- Observed attention head patterns as heatmaps
- Analyzed weight distributions and spectral norms per layer
- Generated coherent short stories from a model you trained yourself
TinyStories — Eldan & Li, 2023
Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759.
A synthetic dataset of ~2.1M short children's stories generated by GPT-3.5 and GPT-4, specifically designed to train small language models that produce coherent English text. Each story uses a vocabulary accessible to a 3–4 year old, making it ideal for small models on limited hardware.
- Source:
roneneldan/TinyStoriesvia Hugging Face Datasets (streaming) - Subset used: First 50,000 stories (~95/5 train/val split)
- No manual download required — streamed automatically on first run
| Hyperparameter | Value |
|---|---|
| Architecture | Decoder-only (GPT-style) |
| Vocab size | 8,000 (BPE trained on data) |
| Context length | 256 tokens |
| d_model | 128 |
| n_layers | 4 |
| n_heads | 4 |
| d_ff | 512 (4× d_model) |
| Activation | GELU |
| Normalization | Pre-LayerNorm |
| Weight tying | Yes (embedding ↔ LM head) |
| Total parameters | ~3.5M |
Key design decisions:
- Pre-LayerNorm (normalize before attention/FFN, not after) — more stable training
- Weight tying between token embedding and LM head — reduces parameters, improves perplexity
- Causal mask registered as a buffer — no future token leakage by construction
- GELU activation in FFN — smoother gradient flow than ReLU
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Learning rate | 3e-4 (peak) |
| LR schedule | Cosine decay with linear warmup |
| Warmup steps | 500 |
| Max steps | 10,000 |
| Batch size | 16 (per micro-batch) |
| Gradient accumulation | 16 steps (effective batch=256) |
| Gradient clip norm | 1.0 |
| Weight decay | 0.1 |
| Dropout | 0.1 |
pretrain/
├── config.py — TransformerConfig dataclass + YAML loader
├── tokenizer.py — BPETokenizer (wraps HuggingFace tokenizers)
├── data.py — TinyStories streaming, chunking into context windows
├── model.py — GPTModel, TransformerBlock, CausalSelfAttention
├── train.py — Pre-training loop with grad accum, checkpointing, logging
├── evaluate.py — Perplexity computation on validation set
├── generate.py — Greedy decoding and nucleus sampling
├── visualize.py — Attention heatmaps, weight distribution plots
└── tests/ — Property-based and unit tests (Hypothesis + pytest)
# Install dependencies (from repo root)
make setup
# Train the model (runs ~30 min on CPU)
python -m pretrain.train
# Generate text from a trained checkpoint
python -m pretrain.generate --prompt "Once upon a time"
# Run tests
python -m pytest pretrain/tests/ -vAfter training, plots are saved to outputs/project2/plots/:
training_curves.png— train loss and validation perplexity vs stepattention_heatmaps/— per-head attention weight matricesweight_distributions/— weight histograms and spectral norms per layer
After 10,000 training steps on TinyStories, the model generates coherent short stories. Example (nucleus sampling, top_p=0.9, temperature=0.8):
Once upon a time, there was a little girl named Lily. She loved to play in the garden with her dog. One day, she found a small bird that could not fly. Lily was very sad. She asked her mom for help...
| Metric | Value (after 10K steps) |
|---|---|
| Final train loss | ~1.8 |
| Final val perplexity | ~25–35 |
| Untrained baseline PPL | ~8,000 (random weights) |
The model achieves perplexity demonstrably lower than an untrained baseline, confirming it has learned language structure from the TinyStories corpus.
| Property | Description |
|---|---|
| Property 5 | Transformer output shape: (B, T, vocab_size) for any valid (B, T) |
| Property 6 | Tokenizer round-trip: decode(encode(text)) == text |
| Property 7 | Causal mask: upper triangle of attention weights is zero after softmax |
| Property 8 | Inference reproducibility: same seed → identical output sequences |
| Property 17 | Config round-trip: YAML serialize/deserialize preserves all fields |
| Paper | Why it matters |
|---|---|
| Vaswani et al., 2017 — Attention Is All You Need | The original transformer architecture — every component here traces back to this paper |
| Radford et al., 2019 — Language Models are Unsupervised Multitask Learners (GPT-2) | Introduced the decoder-only GPT architecture, weight tying, and pre-norm |
| Eldan & Li, 2023 — TinyStories | The dataset used here; demonstrates small models can learn coherent language on focused data |
| Loshchilov & Hutter, 2019 — Decoupled Weight Decay Regularization (AdamW) | The optimizer used — decouples weight decay from gradient update |
| Loshchilov & Hutter, 2017 — SGDR: Stochastic Gradient Descent with Warm Restarts | The cosine LR schedule with warmup used throughout training |