This is my attempt at implementing a PyTorch version of GPT-1 (Generative Pre-trained Transformer) from scratch, based on the original paper "Improving Language Understanding by Generative Pre-Training" by OpenAI.
I am not a researcher nor is my implementation perfect. I am just testing out and understanding how it works.
This project implements the GPT-1 architecture from the ground up, including all core components:
- Decoder-only transformer architecture with stacked decoder layers
- Multi-head self-attention with causal masking
- Layer normalization (custom implementation)
- Position-wise feedforward neural networks with GELU activation
- Learned positional embeddings (instead of static sinusoidal embeddings)
- KV caching for efficient inference
The implementation follows the GPT-1 specifications:
- Embedding dimension: 768 (configurable)
- Number of layers: Configurable (n layers)
- Attention heads: Configurable (h heads)
- Feedforward inner dimension: 3072 (configurable)
- Activation function: GELU
- Dropout: 0.1 for regularization (residual, embedding, and attention dropouts)
- Max sequence length: 512 tokens (configurable)
- Weight initialization: N(0, 0.02)
├── custom_gpt1.py # Main GPT-1 model class
├── gpt1_layer.py # Decoder layer implementation
├── gpt1_sublayers.py # Core sublayers (attention, layer norm, FFN, embeddings)
├── gpt1_utils.py # Utility functions (attention, masking, position IDs)
└── notebook/
├── training_with_prebuilt_tokenizer.ipynb # Training script
├── testing_sublayers.ipynb # Testing and validation
├── testing_eval.ipynb # Comprehensive evaluation on multiple benchmarks
└── *.pt # Model checkpoints
The main model class that:
- Stacks multiple decoder layers
- Handles token and positional embeddings
- Provides both standard forward pass and KV-cached forward pass for efficient inference
Implements a single decoder layer with:
- Multi-head self-attention with residual connection and layer norm
- Position-wise feedforward network with residual connection and layer norm
- Dropout for regularization
- LearnedPositionalEmbedding: Combines token and positional embeddings
- MultiHeadSelfAttention: Implements scaled dot-product attention with multiple heads
- LayerNorm: Custom layer normalization implementation
- PositionwiseFeedForwardNeuralNetwork: Two-layer MLP with GELU activation
Helper functions for:
- Scaled dot-product attention computation
- Creating attention masks (padding and causal)
- Generating position IDs
- ✅ Full GPT-1 architecture implementation
- ✅ KV caching for efficient autoregressive generation
- ✅ Custom layer normalization
- ✅ Causal masking for autoregressive language modeling
- ✅ Padding mask support
- ✅ Training and checkpointing utilities
- ✅ Modular design for easy experimentation
The model is trained on the Open-Phi Textbooks dataset from Hugging Face:
- Dataset:
open-phi/textbooks - Source: Hugging Face Datasets
- Training examples: 1,795 original samples
- Preprocessed samples: ~160,412 training sequences (after chunking)
- Content: AI-generated textbook content in markdown format
- Fields: Each example contains
topic,model,concepts,outline,markdown,field,subfield, andrag - Training data: Uses the
markdownfield which contains full textbook content
- Long sequences are split into chunks of 512 tokens (max sequence length)
- Each chunk becomes a separate training example
- Sequences are tokenized using OpenAI GPT tokenizer
- Special tokens added:
<start/>,<end/>,<pad/>,<mask/>
- Tokenizer: OpenAI GPT tokenizer (
OpenAIGPTTokenizer) from Hugging Face - Vocabulary size: 40,478 + 4 special tokens = 40,482
- Encoding: Byte Pair Encoding (BPE) with 40,000 merges (matching original GPT-1)
The training notebook (training_with_prebuilt_tokenizer.ipynb) includes:
- Dataset loading and preprocessing
- Model initialization with proper weight initialization
- Training loop with checkpointing
- Loss tracking and visualization
- Custom learning rate scheduler (linear warmup + cosine annealing)
- Adam optimizer with weight decay (L2 regularization)
- Batch size: 32
- Epochs: 300
- Learning rate: 2.5e-4 (max) with warmup and cosine annealing
- Warmup steps: 2,000
- Weight decay: 0.01 (L2 regularization on non-bias/gain weights)
- Dropout: 0.1
- Model size: 2 decoder layers (configurable)
The evaluation notebook (testing_eval.ipynb) provides comprehensive evaluation on multiple benchmarks:
- Perplexity: Measures how well the model predicts the next token (lower is better)
- Cross-entropy loss: Direct loss metric
- Next token prediction accuracy: Top-1 and Top-5 accuracy for token prediction
- Qualitative text generation: Sample text generation for manual inspection
-
Open-Phi Textbooks Dataset
- Evaluation on held-out test set from training data
- Measures in-domain performance
- 80/20 train/test split
-
WikiText-2 Benchmark
- Standard language modeling benchmark
- Provides comparison to published GPT-1 results (~40.9 perplexity for full 12-layer model)
- Evaluates on both validation and test splits
-
HellaSwag Benchmark
- Commonsense reasoning benchmark
- Zero-shot evaluation (no fine-tuning required)
- Multiple-choice task: model selects best continuation from 4 options
- Original GPT-1 achieved ~78.9% accuracy (zero-shot)
- Multi-checkpoint evaluation (epochs 0, 30, 60, 90, 120)
- Visualization of results over training epochs
- Comparison to original GPT-1 performance where applicable
- The current implementation does not include weight tying between the embedding and output layers (as noted in the code comments)
- The model follows the original GPT-1 paper specifications closely
- Checkpoints are saved during training for model persistence
- Evaluation includes both in-domain (open-phi) and out-of-domain (WikiText-2, HellaSwag) benchmarks
- The 2-layer model will have higher perplexity than the full 12-layer GPT-1, which is expected given the smaller model size
See SETUP.md for detailed installation and setup instructions.
- PyTorch
- Transformers (Hugging Face)
- Datasets (Hugging Face)
- NumPy
- Pandas
- Matplotlib
- tqdm
- Jupyter Notebook
See requirements.txt for a complete list of dependencies.