Skip to content

ktk-07/GPT1_For_Fun

Repository files navigation

GPT-1 Implementation from Scratch

This is my attempt at implementing a PyTorch version of GPT-1 (Generative Pre-trained Transformer) from scratch, based on the original paper "Improving Language Understanding by Generative Pre-Training" by OpenAI.

Note:

I am not a researcher nor is my implementation perfect. I am just testing out and understanding how it works.

Overview

This project implements the GPT-1 architecture from the ground up, including all core components:

  • Decoder-only transformer architecture with stacked decoder layers
  • Multi-head self-attention with causal masking
  • Layer normalization (custom implementation)
  • Position-wise feedforward neural networks with GELU activation
  • Learned positional embeddings (instead of static sinusoidal embeddings)
  • KV caching for efficient inference

Architecture Details

The implementation follows the GPT-1 specifications:

  • Embedding dimension: 768 (configurable)
  • Number of layers: Configurable (n layers)
  • Attention heads: Configurable (h heads)
  • Feedforward inner dimension: 3072 (configurable)
  • Activation function: GELU
  • Dropout: 0.1 for regularization (residual, embedding, and attention dropouts)
  • Max sequence length: 512 tokens (configurable)
  • Weight initialization: N(0, 0.02)

Project Structure

├── custom_gpt1.py          # Main GPT-1 model class
├── gpt1_layer.py           # Decoder layer implementation
├── gpt1_sublayers.py       # Core sublayers (attention, layer norm, FFN, embeddings)
├── gpt1_utils.py           # Utility functions (attention, masking, position IDs)
└── notebook/
    ├── training_with_prebuilt_tokenizer.ipynb  # Training script
    ├── testing_sublayers.ipynb                # Testing and validation
    ├── testing_eval.ipynb                     # Comprehensive evaluation on multiple benchmarks
    └── *.pt                                   # Model checkpoints

Key Components

Custom_GPT1 (custom_gpt1.py)

The main model class that:

  • Stacks multiple decoder layers
  • Handles token and positional embeddings
  • Provides both standard forward pass and KV-cached forward pass for efficient inference

GPT1_DecoderLayer (gpt1_layer.py)

Implements a single decoder layer with:

  • Multi-head self-attention with residual connection and layer norm
  • Position-wise feedforward network with residual connection and layer norm
  • Dropout for regularization

Sublayers (gpt1_sublayers.py)

  • LearnedPositionalEmbedding: Combines token and positional embeddings
  • MultiHeadSelfAttention: Implements scaled dot-product attention with multiple heads
  • LayerNorm: Custom layer normalization implementation
  • PositionwiseFeedForwardNeuralNetwork: Two-layer MLP with GELU activation

Utilities (gpt1_utils.py)

Helper functions for:

  • Scaled dot-product attention computation
  • Creating attention masks (padding and causal)
  • Generating position IDs

Features

  • ✅ Full GPT-1 architecture implementation
  • ✅ KV caching for efficient autoregressive generation
  • ✅ Custom layer normalization
  • ✅ Causal masking for autoregressive language modeling
  • ✅ Padding mask support
  • ✅ Training and checkpointing utilities
  • ✅ Modular design for easy experimentation

Dataset

The model is trained on the Open-Phi Textbooks dataset from Hugging Face:

  • Dataset: open-phi/textbooks
  • Source: Hugging Face Datasets
  • Training examples: 1,795 original samples
  • Preprocessed samples: ~160,412 training sequences (after chunking)
  • Content: AI-generated textbook content in markdown format
  • Fields: Each example contains topic, model, concepts, outline, markdown, field, subfield, and rag
  • Training data: Uses the markdown field which contains full textbook content

Preprocessing

  • Long sequences are split into chunks of 512 tokens (max sequence length)
  • Each chunk becomes a separate training example
  • Sequences are tokenized using OpenAI GPT tokenizer
  • Special tokens added: <start/>, <end/>, <pad/>, <mask/>

Tokenizer

  • Tokenizer: OpenAI GPT tokenizer (OpenAIGPTTokenizer) from Hugging Face
  • Vocabulary size: 40,478 + 4 special tokens = 40,482
  • Encoding: Byte Pair Encoding (BPE) with 40,000 merges (matching original GPT-1)

Training

The training notebook (training_with_prebuilt_tokenizer.ipynb) includes:

  • Dataset loading and preprocessing
  • Model initialization with proper weight initialization
  • Training loop with checkpointing
  • Loss tracking and visualization
  • Custom learning rate scheduler (linear warmup + cosine annealing)
  • Adam optimizer with weight decay (L2 regularization)

Training Configuration

  • Batch size: 32
  • Epochs: 300
  • Learning rate: 2.5e-4 (max) with warmup and cosine annealing
  • Warmup steps: 2,000
  • Weight decay: 0.01 (L2 regularization on non-bias/gain weights)
  • Dropout: 0.1
  • Model size: 2 decoder layers (configurable)

Evaluation

The evaluation notebook (testing_eval.ipynb) provides comprehensive evaluation on multiple benchmarks:

Evaluation Metrics

  • Perplexity: Measures how well the model predicts the next token (lower is better)
  • Cross-entropy loss: Direct loss metric
  • Next token prediction accuracy: Top-1 and Top-5 accuracy for token prediction
  • Qualitative text generation: Sample text generation for manual inspection

Benchmarks Evaluated

  1. Open-Phi Textbooks Dataset

    • Evaluation on held-out test set from training data
    • Measures in-domain performance
    • 80/20 train/test split
  2. WikiText-2 Benchmark

    • Standard language modeling benchmark
    • Provides comparison to published GPT-1 results (~40.9 perplexity for full 12-layer model)
    • Evaluates on both validation and test splits
  3. HellaSwag Benchmark

    • Commonsense reasoning benchmark
    • Zero-shot evaluation (no fine-tuning required)
    • Multiple-choice task: model selects best continuation from 4 options
    • Original GPT-1 achieved ~78.9% accuracy (zero-shot)

Evaluation Features

  • Multi-checkpoint evaluation (epochs 0, 30, 60, 90, 120)
  • Visualization of results over training epochs
  • Comparison to original GPT-1 performance where applicable

Notes

  • The current implementation does not include weight tying between the embedding and output layers (as noted in the code comments)
  • The model follows the original GPT-1 paper specifications closely
  • Checkpoints are saved during training for model persistence
  • Evaluation includes both in-domain (open-phi) and out-of-domain (WikiText-2, HellaSwag) benchmarks
  • The 2-layer model will have higher perplexity than the full 12-layer GPT-1, which is expected given the smaller model size

Quick Start

See SETUP.md for detailed installation and setup instructions.

Requirements

  • PyTorch
  • Transformers (Hugging Face)
  • Datasets (Hugging Face)
  • NumPy
  • Pandas
  • Matplotlib
  • tqdm
  • Jupyter Notebook

See requirements.txt for a complete list of dependencies.

References

About

Just an implementation of the original GPT1 paper. However, i already found errors in it

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors