A PyTorch implementation of the Word2Vec Skip-gram model with negative sampling, built entirely from scratch for educational purposes. This project demonstrates the core concepts behind word embeddings without relying on pre-built NLP libraries.
- Overview
- Architecture
- Technical Background
- Project Structure
- Installation
- Usage
- Configuration
- Training Process
- Results
- Limitations
- Future Improvements
- License
- References
Word2Vec is a family of neural network models that learn dense vector representations (embeddings) of words from large corpora of text. These embeddings capture semantic relationships between words, enabling tasks such as analogy completion, similarity measurement, and downstream NLP applications.
This implementation focuses on the Skip-gram variant with negative sampling, where the model learns to predict context words given a center word while distinguishing true context words from randomly sampled negative examples.
The model uses a single shared embedding dictionary (embds) for all words. Both center words and context words are looked up from this same dictionary. The dot product between embeddings measures similarity, and BCE loss trains the model to score positive pairs high and negative pairs low.
For each word in the corpus, the model:
- Takes the center word and looks up its embedding from
embds - Extracts context words within a window and looks up their embeddings
- Samples random negative words from the vocabulary
- Computes dot product similarity between center and all context/negative embeddings
- Uses BCE loss to push positive pairs together and negative pairs apart
L = -[y * log(σ(e_center @ e_positive.T)) + (1 - y) * log(1 - σ(e_center @ e_negative.T))]
Embedding Initialization (model.py):
embds = {word: torch.randn((EMB_DIM), device=DEVICE, requires_grad=True) for word in vocab}Forward Pass (model.py):
class Word2Vec(nn.Module):
def __init__(self):
super().__init__()
def forward(self, center_emb, y_embds):
return center_emb @ y_embds.TWord2Vec-From-Scratch/
├── Word2Vec/
│ ├── model.py # Model definition and embedding initialization
│ ├── train.py # Training loop
│ ├── data_prep.py # Data loading and preprocessing
│ ├── utils.py # Config and utility functions
│ ├── check_top_k.py # Top k nearest words checking
│ └── check_sim.py # Word similarity checking
├── data/
│ └── train/
│ └── data-00000-of-00001.arrow
├── checkpoint.md
├── images/
├── .gitignore
├── LICENSE
└── README.md
| File | Description |
|---|---|
model.py |
Defines Word2Vec class, initializes embds, runs training |
train.py |
Training loop with context extraction and negative sampling |
data_prep.py |
Loads Flickr30k, removes punctuation, builds sorted vocabulary |
utils.py |
Hyperparameters and utilities (sample_negatives, euc_dist, cosine_sim) |
check_sim.py |
Computes similarity between WORD_A and WORD_B |
git clone https://github.com/franciszekparma/Word2Vec-From-Scratch.git
cd Word2Vec-From-Scratch
pip install torch numpy tqdm datasets| Package | Purpose |
|---|---|
torch |
Neural network framework |
numpy |
Numerical operations |
tqdm |
Progress bars |
datasets |
HuggingFace data loading |
cd Word2Vec
python model.pyConfigure words in utils.py:
WORD_A = "hard"
WORD_B = "work"Then run:
python check_sim.pyOutput:
Euclidean distance between "hard" and "work": 0.1234
Cosine Similarity between "hard" and "work": 0.8765
from model import embds
from utils import cosine_sim, euc_dist
# Get embedding
vec = embds["dog"]
# Compare words
emb_a = embds["cat"].unsqueeze(0)
emb_b = embds["dog"].unsqueeze(0)
print(f"Cosine similarity: {cosine_sim(emb_a, emb_b):.4f}")Hyperparameters in utils.py:
| Parameter | Default | Recommended | Description |
|---|---|---|---|
EMB_DIM |
128 | 128-300 | Embedding dimensions |
WINDOW |
5 | 5-10 | Context window size |
EPOCHS |
32 | 10-50 | Training epochs |
WEIGHT_DECAY |
1e-2 | 1e-2 | Weight decay |
LR |
1e-4 | 1e-4 | Learning rate |
SEED |
24 | - | Random seed |
DEVICE |
auto | cuda | Tensor storage location |
SHOW_DATA_STATS |
False | True | Show stats about data |
LOWER_WORDS |
True | True | Lowercase words |
LOAD_CHECKPOINT |
False | - | Load model checkpoint |
PATH_CHECHPOINT |
"" | - | Checkpoint file path |
WORD_A |
"man" | "boy" | Comparison word A |
WORD_B |
"woman" | "girl" | Comparison word B |
TOPK_WORD |
"road" | "water" | Word used in Top K |
K |
3 | 5 | K used in Top K |
The training loop (from train.py):
for epoch in tqdm(range(epochs)):
losses = []
for w_l, word_list in enumerate(all_words_in_sen):
for w, word in enumerate(word_list):
y_pos_words = word_list[max(0, w - WINDOW) : w] + word_list[w + 1 : w + WINDOW + 1] #the words that we want the target to be similar to
if len(y_pos_words) == 0:
continue
y_pos_embds = [embds[x] for x in y_pos_words]
y_pos_labels = [1 for x in range(len(y_pos_words))]
y_neg_words = sample_negatives(vocab, n=2*WINDOW, context_words=y_pos_words, center_word=word)
y_neg_embds = [embds[x] for x in y_neg_words]
y_neg_labels = [0 for x in range(len(y_neg_words))]
y_embds = torch.stack(y_pos_embds + y_neg_embds)
y_labels = torch.tensor(y_pos_labels + y_neg_labels, device=DEVICE, dtype=torch.float32)
center_word = word
center_emb = embds[center_word] #the word embedding that we are currently at
y_preds = model(center_emb, y_embds)
loss = loss_fn(y_preds, y_labels)
losses.append(loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()With sufficient training (EMB_DIM=128+, EPOCHS=20+), semantically similar words cluster together in embedding space.
def euc_dist(emb1, emb2):
return ((torch.sum((emb1 - emb2)**2))**0.5).item()
def cosine_sim(emb1, emb2):
return ((emb1 @ emb2.T) / (torch.norm(emb1) * torch.norm(emb2))).item()| Limitation | Solution |
|---|---|
| Uniform negative sampling | Frequency-based sampling (f^0.75) |
| No subsampling | Probabilistic downsampling |
| Dictionary storage | nn.Embedding layer |
| Sequential training | Mini-batching |
- Frequency-based negative sampling
- Subsampling frequent words
- Batch processing
-
nn.Embeddinglayer - Evaluation benchmarks (SimLex-999)
- CBOW variant
- Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space
- Mikolov et al. (2013). Distributed Representations of Words and Phrases
- Flickr30k Captions Dataset
This project is licensed under the MIT License.
© franciszekparma


