Official codebase for the paper:
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
(Kallini et al., 2025)
This repository contains the code and scripts for reproducing the experiments in our paper, which systematically investigates how subword vocabulary overlap affects cross-lingual transfer in bilingual language models.
Multilingual tokenizers naturally produce overlapping tokens. We explore whether such overlap facilitates or hinders transfer by training bilingual autoregressive models across six language pairs and under four controlled overlap settings:
- π’ Full Overlap: all naturally overlapping tokens are shared
- π‘ High-Similarity Overlap: only semantically similar tokens are shared
- π Low-Similarity Overlap: only semantically dissimilar tokens are shared
- π΄ No Overlap: vocabularies made fully disjoint
The overlap settings are implemented by remapping token IDs so only a chosen subset is shared, leaving all other tokens offset and therefore disjoint between languages. We train on CCMatrix bilingual data and cover diverse language families/scripts; English is paired with each target language and sentences are interleaved during pre-training. The language pairs we consider are ENβES, ENβDE, ENβTR, ENβZH, ENβAR, and ENβSW.
Our findings:
- Overlap creates embedding spaces that better capture cross-lingual relationships.
- Any overlap improves cross-lingual transfer compared to disjoint vocabularies.
- Overlapping semantically similar tokens is most beneficial to transfer.
If you use our code, please cite our paper:
@inproceedings{
kallini2025false,
title={False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models},
author={Julie Kallini and Dan Jurafsky and Christopher Potts and Martijn Bartelds},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=mIpRFuCa2h}
}Clone the repo and install dependencies:
git clone https://github.com/jkallini/false-friends.git
cd false-friends
conda create -n falsefriends python=3.9
conda activate falsefriends
pip install -r requirements.txtOpen utils.py and set the following (exact fields as in your draft):
BASE_PATH: root for datasets, preprocessed files, and model checkpoints.CCMATRIX_RAW_PATH: directory containing gzipped TSV files named{src}-{tgt}.tsv.gz. We recommend downloading these via the Amazon CCMatrix helper (amazon-science/multi-way-parallel-ccmatrix).TOKENIZER_NAMES: base tokenizers to consider for analysis (e.g.,xlmr).LANGUAGE_PAIRS: list of{src}-{tgt}pairs to process and train (e.g.,en-es,en-de,en-tr,en-zh,en-ar,en-sw).
Use placeholders consistently:
{src},{tgt},{tokenizer_name},{setting},{split}.
python tokenize_ccmatrix.py {src} {tgt} {tokenizer_name}Tokenizes CCMatrix {src}-{tgt}.tsv.gz with {tokenizer_name}, producing tokenized shards and vocab stats.
python get_overlap_sets.pyComputes native overlap tokens between {src} and {tgt}.
python extract_occurrences.py \
--src {src} \
--tgt {tgt} \
--tokenizer_name {tokenizer_name} \
--sentence_N 400_000_000 \
--token_list overlap_tokens/{src}-{tgt}-{tokenizer_name}.txtSamples occurrences for each overlapping token to enable similarity scoring.
python analyze_token_similarity.py pretrained \
--src {src} \
--tgt {tgt} \
--model_name {tokenizer_name} \
--layer_index {layer_index}Computes static embeddings for overlaps and ranks them by cross-lingual similarity (defaults: xlmr, layer 5).
python process_data_for_training.py \
--src {src} \
--tgt {tgt} \
--tokenizer_name {tokenizer_name} \
--setting {setting} \
--split {split}sbatch training/train_array.slurm training/args/pretraining/{args_file}accelerate launch train.py \
--src {src} \
--tgt {tgt} \
--tokenizer_name {tokenizer_name} \
--setting {setting} \
--random_seed {seed} \
--rope \
--resume_from_checkpointModel: GPT-2 style decoder-only Transformer, 12 layers, 12 heads, d=768. Trained for 100k steps with AdamW, cosine LR schedule, effective batch = 64Γ1024 tokens.
python select_similar_tokens.py --src {src} --tgt {tgt} --tokenizer_name {tokenizer_name}- Most/least similar:
python analyze_token_similarity.py custom \
--setting {setting} \
--model_seed {seed} \
--src {src} \
--tgt {tgt} \
--layer_index {layer_index} \
--tokens_file {tokens_file}- Random pairs:
python analyze_token_similarity.py custom \
--setting {setting} \
--model_seed {seed} \
--src {src} \
--tgt {tgt} \
--uid random \
--layer_index {layer_index} \
--tokens_file {all_tokens_file} \
--random_pairs 500Or run in batch with:
sbatch training/analysis_array.slurm {args_file}We fine-tune pretrained models on English and evaluate zero-shot on {tgt}.
- XNLI: MultiNLI β XNLI (
num_train_epochs=5,save_steps=500) - XQuAD: SQuAD β XQuAD (
num_train_epochs=7,save_steps=200)
Args file (training/args/finetune_args.txt):
TASK SRC TGT TOKENIZER SETTING MODEL_SEED RANDOM_SEED BATCH_SIZE LEARNING_RATE EVAL_STEPS NUM_EPOCHS DEVICE_BS
xnli en es xlmr full_overlap 21 21 256 5e-5 500 5 64
xquad en ar xlmr full_overlap 21 21 128 5e-5 200 7 16
Launch:
sbatch training/finetune_sweep.slurm training/args/finetune_args.txtXNLI:
python3 finetune.py \
--task xnli \
--src {src} \
--tgt {tgt} \
--tokenizer_name {tokenizer_name} \
--setting {setting} \
--effective_batch_size 256 \
--per_device_train_batch_size 64 \
--num_train_epochs 5 \
--save_steps 500 \
--eval_steps 500 \
--learning_rate 5e-5 \
--model_seed 21 \
--random_seed 21 \
--resume_from_checkpointXQuAD:
python3 finetune.py \
--task xquad \
--src {src} \
--tgt {tgt} \
--tokenizer_name {tokenizer_name} \
--setting {setting} \
--effective_batch_size 128 \
--per_device_train_batch_size 16 \
--num_train_epochs 7 \
--save_steps 200 \
--eval_steps 200 \
--learning_rate 5e-5 \
--model_seed 21 \
--random_seed 21 \
--resume_from_checkpointWe release the overlap sets and their similarity rankings used in our experiments:
-
Overlap token lists:
datasets/overlap_tokens/{src}-{tgt}-{tokenizer_name}.txt
Each file contains the native overlapO = V_src β© V_tgtunder the XLM-R tokenizer. -
Similarity scores:
datasets/similarities/pretrained_{src}_{tgt}_{tokenizer_name}.tsv
Each file has lines of the form:token_id token_string similarity_scorewheresimilarity_scoreis the cosine similarity between{src}and{tgt}contextual embeddings (mean-pooled across 100 sampled contexts).
These files allow you to:
- Reproduce the High- and Low-similarity overlap splits (
O_hi,O_lo). - Inspect which tokens are most/least semantically aligned across languages.
- Run your own analyses with
select_similar_tokens.pyor custom thresholds.
This project is licensed under the MIT License β see the LICENSE file for details.