Skip to content

Latest commit

 

History

History
167 lines (125 loc) · 6.71 KB

File metadata and controls

167 lines (125 loc) · 6.71 KB

Project 2 — Transformer Architecture and Language Model Pre-training

Motivation

This project builds a GPT-style decoder-only transformer from scratch in PyTorch and pre-trains it on a real language modeling dataset used in published research. The goal is not to call AutoModelForCausalLM.from_pretrained() — it is to implement every architectural decision by hand and watch language modeling loss and perplexity fall as the model learns to predict text.

By the end of this project you will have:

  • Written every component of a transformer (attention, FFN, positional encoding, weight tying)
  • Trained a BPE tokenizer on real data
  • Observed attention head patterns as heatmaps
  • Analyzed weight distributions and spectral norms per layer
  • Generated coherent short stories from a model you trained yourself

Dataset

TinyStories — Eldan & Li, 2023

Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759.

A synthetic dataset of ~2.1M short children's stories generated by GPT-3.5 and GPT-4, specifically designed to train small language models that produce coherent English text. Each story uses a vocabulary accessible to a 3–4 year old, making it ideal for small models on limited hardware.

  • Source: roneneldan/TinyStories via Hugging Face Datasets (streaming)
  • Subset used: First 50,000 stories (~95/5 train/val split)
  • No manual download required — streamed automatically on first run

Architecture

Hyperparameter Value
Architecture Decoder-only (GPT-style)
Vocab size 8,000 (BPE trained on data)
Context length 256 tokens
d_model 128
n_layers 4
n_heads 4
d_ff 512 (4× d_model)
Activation GELU
Normalization Pre-LayerNorm
Weight tying Yes (embedding ↔ LM head)
Total parameters ~3.5M

Key design decisions:

  • Pre-LayerNorm (normalize before attention/FFN, not after) — more stable training
  • Weight tying between token embedding and LM head — reduces parameters, improves perplexity
  • Causal mask registered as a buffer — no future token leakage by construction
  • GELU activation in FFN — smoother gradient flow than ReLU

Training Hyperparameters

Hyperparameter Value
Optimizer AdamW (β₁=0.9, β₂=0.95)
Learning rate 3e-4 (peak)
LR schedule Cosine decay with linear warmup
Warmup steps 500
Max steps 10,000
Batch size 16 (per micro-batch)
Gradient accumulation 16 steps (effective batch=256)
Gradient clip norm 1.0
Weight decay 0.1
Dropout 0.1

Module Structure

pretrain/
├── config.py       — TransformerConfig dataclass + YAML loader
├── tokenizer.py    — BPETokenizer (wraps HuggingFace tokenizers)
├── data.py         — TinyStories streaming, chunking into context windows
├── model.py        — GPTModel, TransformerBlock, CausalSelfAttention
├── train.py        — Pre-training loop with grad accum, checkpointing, logging
├── evaluate.py     — Perplexity computation on validation set
├── generate.py     — Greedy decoding and nucleus sampling
├── visualize.py    — Attention heatmaps, weight distribution plots
└── tests/          — Property-based and unit tests (Hypothesis + pytest)

Quickstart

# Install dependencies (from repo root)
make setup

# Train the model (runs ~30 min on CPU)
python -m pretrain.train

# Generate text from a trained checkpoint
python -m pretrain.generate --prompt "Once upon a time"

# Run tests
python -m pytest pretrain/tests/ -v

Training Curves

After training, plots are saved to outputs/project2/plots/:

  • training_curves.png — train loss and validation perplexity vs step
  • attention_heatmaps/ — per-head attention weight matrices
  • weight_distributions/ — weight histograms and spectral norms per layer

Sample Generated Stories

After 10,000 training steps on TinyStories, the model generates coherent short stories. Example (nucleus sampling, top_p=0.9, temperature=0.8):

Once upon a time, there was a little girl named Lily. She loved to play in the garden with her dog. One day, she found a small bird that could not fly. Lily was very sad. She asked her mom for help...


Results

Metric Value (after 10K steps)
Final train loss ~1.8
Final val perplexity ~25–35
Untrained baseline PPL ~8,000 (random weights)

The model achieves perplexity demonstrably lower than an untrained baseline, confirming it has learned language structure from the TinyStories corpus.


Correctness Properties Tested

Property Description
Property 5 Transformer output shape: (B, T, vocab_size) for any valid (B, T)
Property 6 Tokenizer round-trip: decode(encode(text)) == text
Property 7 Causal mask: upper triangle of attention weights is zero after softmax
Property 8 Inference reproducibility: same seed → identical output sequences
Property 17 Config round-trip: YAML serialize/deserialize preserves all fields

Foundational Papers

Paper Why it matters
Vaswani et al., 2017 — Attention Is All You Need The original transformer architecture — every component here traces back to this paper
Radford et al., 2019 — Language Models are Unsupervised Multitask Learners (GPT-2) Introduced the decoder-only GPT architecture, weight tying, and pre-norm
Eldan & Li, 2023 — TinyStories The dataset used here; demonstrates small models can learn coherent language on focused data
Loshchilov & Hutter, 2019 — Decoupled Weight Decay Regularization (AdamW) The optimizer used — decouples weight decay from gradient update
Loshchilov & Hutter, 2017 — SGDR: Stochastic Gradient Descent with Warm Restarts The cosine LR schedule with warmup used throughout training