Project 2 — Transformer Architecture and Language Model Pre-training

Motivation

This project builds a GPT-style decoder-only transformer from scratch in PyTorch and pre-trains it on a real language modeling dataset used in published research. The goal is not to call AutoModelForCausalLM.from_pretrained() — it is to implement every architectural decision by hand and watch language modeling loss and perplexity fall as the model learns to predict text.

By the end of this project you will have:

Written every component of a transformer (attention, FFN, positional encoding, weight tying)
Trained a BPE tokenizer on real data
Observed attention head patterns as heatmaps
Analyzed weight distributions and spectral norms per layer
Generated coherent short stories from a model you trained yourself

Dataset

TinyStories — Eldan & Li, 2023

Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759.

A synthetic dataset of ~2.1M short children's stories generated by GPT-3.5 and GPT-4, specifically designed to train small language models that produce coherent English text. Each story uses a vocabulary accessible to a 3–4 year old, making it ideal for small models on limited hardware.

Source: roneneldan/TinyStories via Hugging Face Datasets (streaming)
Subset used: First 50,000 stories (~95/5 train/val split)
No manual download required — streamed automatically on first run

Architecture

Hyperparameter	Value
Architecture	Decoder-only (GPT-style)
Vocab size	8,000 (BPE trained on data)
Context length	256 tokens
d_model	128
n_layers	4
n_heads	4
d_ff	512 (4× d_model)
Activation	GELU
Normalization	Pre-LayerNorm
Weight tying	Yes (embedding ↔ LM head)
Total parameters	~3.5M

Key design decisions:

Pre-LayerNorm (normalize before attention/FFN, not after) — more stable training
Weight tying between token embedding and LM head — reduces parameters, improves perplexity
Causal mask registered as a buffer — no future token leakage by construction
GELU activation in FFN — smoother gradient flow than ReLU

Training Hyperparameters

Hyperparameter	Value
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Learning rate	3e-4 (peak)
LR schedule	Cosine decay with linear warmup
Warmup steps	500
Max steps	10,000
Batch size	16 (per micro-batch)
Gradient accumulation	16 steps (effective batch=256)
Gradient clip norm	1.0
Weight decay	0.1
Dropout	0.1

Module Structure

pretrain/
├── config.py       — TransformerConfig dataclass + YAML loader
├── tokenizer.py    — BPETokenizer (wraps HuggingFace tokenizers)
├── data.py         — TinyStories streaming, chunking into context windows
├── model.py        — GPTModel, TransformerBlock, CausalSelfAttention
├── train.py        — Pre-training loop with grad accum, checkpointing, logging
├── evaluate.py     — Perplexity computation on validation set
├── generate.py     — Greedy decoding and nucleus sampling
├── visualize.py    — Attention heatmaps, weight distribution plots
└── tests/          — Property-based and unit tests (Hypothesis + pytest)

Quickstart

# Install dependencies (from repo root)
make setup

# Train the model (runs ~30 min on CPU)
python -m pretrain.train

# Generate text from a trained checkpoint
python -m pretrain.generate --prompt "Once upon a time"

# Run tests
python -m pytest pretrain/tests/ -v

Training Curves

After training, plots are saved to outputs/project2/plots/:

training_curves.png — train loss and validation perplexity vs step
attention_heatmaps/ — per-head attention weight matrices
weight_distributions/ — weight histograms and spectral norms per layer

Sample Generated Stories

After 10,000 training steps on TinyStories, the model generates coherent short stories. Example (nucleus sampling, top_p=0.9, temperature=0.8):

Once upon a time, there was a little girl named Lily. She loved to play in the garden with her dog. One day, she found a small bird that could not fly. Lily was very sad. She asked her mom for help...

Results

Metric	Value (after 10K steps)
Final train loss	~1.8
Final val perplexity	~25–35
Untrained baseline PPL	~8,000 (random weights)

The model achieves perplexity demonstrably lower than an untrained baseline, confirming it has learned language structure from the TinyStories corpus.

Correctness Properties Tested

Property	Description
Property 5	Transformer output shape: `(B, T, vocab_size)` for any valid `(B, T)`
Property 6	Tokenizer round-trip: `decode(encode(text)) == text`
Property 7	Causal mask: upper triangle of attention weights is zero after softmax
Property 8	Inference reproducibility: same seed → identical output sequences
Property 17	Config round-trip: YAML serialize/deserialize preserves all fields

Foundational Papers

Paper	Why it matters
Vaswani et al., 2017 — Attention Is All You Need	The original transformer architecture — every component here traces back to this paper
Radford et al., 2019 — Language Models are Unsupervised Multitask Learners (GPT-2)	Introduced the decoder-only GPT architecture, weight tying, and pre-norm
Eldan & Li, 2023 — TinyStories	The dataset used here; demonstrates small models can learn coherent language on focused data
Loshchilov & Hutter, 2019 — Decoupled Weight Decay Regularization (AdamW)	The optimizer used — decouples weight decay from gradient update
Loshchilov & Hutter, 2017 — SGDR: Stochastic Gradient Descent with Warm Restarts	The cosine LR schedule with warmup used throughout training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 2 — Transformer Architecture and Language Model Pre-training

Motivation

Dataset

Architecture

Training Hyperparameters

Module Structure

Quickstart

Training Curves

Sample Generated Stories

Results

Correctness Properties Tested

Foundational Papers

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Project 2 — Transformer Architecture and Language Model Pre-training

Motivation

Dataset

Architecture

Training Hyperparameters

Module Structure

Quickstart

Training Curves

Sample Generated Stories

Results

Correctness Properties Tested

Foundational Papers