GPT-1 Implementation from Scratch

This is my attempt at implementing a PyTorch version of GPT-1 (Generative Pre-trained Transformer) from scratch, based on the original paper "Improving Language Understanding by Generative Pre-Training" by OpenAI.

Note:

I am not a researcher nor is my implementation perfect. I am just testing out and understanding how it works.

Overview

This project implements the GPT-1 architecture from the ground up, including all core components:

Decoder-only transformer architecture with stacked decoder layers
Multi-head self-attention with causal masking
Layer normalization (custom implementation)
Position-wise feedforward neural networks with GELU activation
Learned positional embeddings (instead of static sinusoidal embeddings)
KV caching for efficient inference

Architecture Details

The implementation follows the GPT-1 specifications:

Embedding dimension: 768 (configurable)
Number of layers: Configurable (n layers)
Attention heads: Configurable (h heads)
Feedforward inner dimension: 3072 (configurable)
Activation function: GELU
Dropout: 0.1 for regularization (residual, embedding, and attention dropouts)
Max sequence length: 512 tokens (configurable)
Weight initialization: N(0, 0.02)

Project Structure

├── custom_gpt1.py          # Main GPT-1 model class
├── gpt1_layer.py           # Decoder layer implementation
├── gpt1_sublayers.py       # Core sublayers (attention, layer norm, FFN, embeddings)
├── gpt1_utils.py           # Utility functions (attention, masking, position IDs)
└── notebook/
    ├── training_with_prebuilt_tokenizer.ipynb  # Training script
    ├── testing_sublayers.ipynb                # Testing and validation
    ├── testing_eval.ipynb                     # Comprehensive evaluation on multiple benchmarks
    └── *.pt                                   # Model checkpoints

Key Components

Custom_GPT1 (`custom_gpt1.py`)

The main model class that:

Stacks multiple decoder layers
Handles token and positional embeddings
Provides both standard forward pass and KV-cached forward pass for efficient inference

GPT1_DecoderLayer (`gpt1_layer.py`)

Implements a single decoder layer with:

Multi-head self-attention with residual connection and layer norm
Position-wise feedforward network with residual connection and layer norm
Dropout for regularization

Sublayers (`gpt1_sublayers.py`)

LearnedPositionalEmbedding: Combines token and positional embeddings
MultiHeadSelfAttention: Implements scaled dot-product attention with multiple heads
LayerNorm: Custom layer normalization implementation
PositionwiseFeedForwardNeuralNetwork: Two-layer MLP with GELU activation

Utilities (`gpt1_utils.py`)

Helper functions for:

Scaled dot-product attention computation
Creating attention masks (padding and causal)
Generating position IDs

Features

✅ Full GPT-1 architecture implementation
✅ KV caching for efficient autoregressive generation
✅ Custom layer normalization
✅ Causal masking for autoregressive language modeling
✅ Padding mask support
✅ Training and checkpointing utilities
✅ Modular design for easy experimentation

Dataset

The model is trained on the Open-Phi Textbooks dataset from Hugging Face:

Dataset: open-phi/textbooks
Source: Hugging Face Datasets
Training examples: 1,795 original samples
Preprocessed samples: ~160,412 training sequences (after chunking)
Content: AI-generated textbook content in markdown format
Fields: Each example contains topic, model, concepts, outline, markdown, field, subfield, and rag
Training data: Uses the markdown field which contains full textbook content

Preprocessing

Long sequences are split into chunks of 512 tokens (max sequence length)
Each chunk becomes a separate training example
Sequences are tokenized using OpenAI GPT tokenizer
Special tokens added: <start/>, <end/>, <pad/>, <mask/>

Tokenizer

Tokenizer: OpenAI GPT tokenizer (OpenAIGPTTokenizer) from Hugging Face
Vocabulary size: 40,478 + 4 special tokens = 40,482
Encoding: Byte Pair Encoding (BPE) with 40,000 merges (matching original GPT-1)

Training

The training notebook (training_with_prebuilt_tokenizer.ipynb) includes:

Dataset loading and preprocessing
Model initialization with proper weight initialization
Training loop with checkpointing
Loss tracking and visualization
Custom learning rate scheduler (linear warmup + cosine annealing)
Adam optimizer with weight decay (L2 regularization)

Training Configuration

Batch size: 32
Epochs: 300
Learning rate: 2.5e-4 (max) with warmup and cosine annealing
Warmup steps: 2,000
Weight decay: 0.01 (L2 regularization on non-bias/gain weights)
Dropout: 0.1
Model size: 2 decoder layers (configurable)

Evaluation

The evaluation notebook (testing_eval.ipynb) provides comprehensive evaluation on multiple benchmarks:

Evaluation Metrics

Perplexity: Measures how well the model predicts the next token (lower is better)
Cross-entropy loss: Direct loss metric
Next token prediction accuracy: Top-1 and Top-5 accuracy for token prediction
Qualitative text generation: Sample text generation for manual inspection

Benchmarks Evaluated

Open-Phi Textbooks Dataset
- Evaluation on held-out test set from training data
- Measures in-domain performance
- 80/20 train/test split
WikiText-2 Benchmark
- Standard language modeling benchmark
- Provides comparison to published GPT-1 results (~40.9 perplexity for full 12-layer model)
- Evaluates on both validation and test splits
HellaSwag Benchmark
- Commonsense reasoning benchmark
- Zero-shot evaluation (no fine-tuning required)
- Multiple-choice task: model selects best continuation from 4 options
- Original GPT-1 achieved ~78.9% accuracy (zero-shot)

Evaluation Features

Multi-checkpoint evaluation (epochs 0, 30, 60, 90, 120)
Visualization of results over training epochs
Comparison to original GPT-1 performance where applicable

Notes

The current implementation does not include weight tying between the embedding and output layers (as noted in the code comments)
The model follows the original GPT-1 paper specifications closely
Checkpoints are saved during training for model persistence
Evaluation includes both in-domain (open-phi) and out-of-domain (WikiText-2, HellaSwag) benchmarks
The 2-layer model will have higher perplexity than the full 12-layer GPT-1, which is expected given the smaller model size

Quick Start

See SETUP.md for detailed installation and setup instructions.

Requirements

PyTorch
Transformers (Hugging Face)
Datasets (Hugging Face)
NumPy
Pandas
Matplotlib
tqdm
Jupyter Notebook

See requirements.txt for a complete list of dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-1 Implementation from Scratch

Note:

Overview

Architecture Details

Project Structure

Key Components

Custom_GPT1 (`custom_gpt1.py`)

GPT1_DecoderLayer (`gpt1_layer.py`)

Sublayers (`gpt1_sublayers.py`)

Utilities (`gpt1_utils.py`)

Features

Dataset

Preprocessing

Tokenizer

Training

Training Configuration

Evaluation

Evaluation Metrics

Benchmarks Evaluated

Evaluation Features

Notes

Quick Start

Requirements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebook		notebook
.gitignore		.gitignore
README.md		README.md
SETUP.md		SETUP.md
custom_gpt1.py		custom_gpt1.py
gpt1_layer.py		gpt1_layer.py
gpt1_sublayers.py		gpt1_sublayers.py
gpt1_utils.py		gpt1_utils.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GPT-1 Implementation from Scratch

Note:

Overview

Architecture Details

Project Structure

Key Components

Custom_GPT1 (custom_gpt1.py)

GPT1_DecoderLayer (gpt1_layer.py)

Sublayers (gpt1_sublayers.py)

Utilities (gpt1_utils.py)

Features

Dataset

Preprocessing

Tokenizer

Training

Training Configuration

Evaluation

Evaluation Metrics

Benchmarks Evaluated

Evaluation Features

Notes

Quick Start

Requirements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Custom_GPT1 (`custom_gpt1.py`)

GPT1_DecoderLayer (`gpt1_layer.py`)

Sublayers (`gpt1_sublayers.py`)

Utilities (`gpt1_utils.py`)

Packages