Skip to content

jkallini/false-friends

Repository files navigation

False Friends Are Not Foes

Official codebase for the paper:
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
(Kallini et al., 2025)

πŸ“– Overview

This repository contains the code and scripts for reproducing the experiments in our paper, which systematically investigates how subword vocabulary overlap affects cross-lingual transfer in bilingual language models.

Multilingual tokenizers naturally produce overlapping tokens. We explore whether such overlap facilitates or hinders transfer by training bilingual autoregressive models across six language pairs and under four controlled overlap settings:

  1. 🟒 Full Overlap: all naturally overlapping tokens are shared
  2. 🟑 High-Similarity Overlap: only semantically similar tokens are shared
  3. 🟠 Low-Similarity Overlap: only semantically dissimilar tokens are shared
  4. πŸ”΄ No Overlap: vocabularies made fully disjoint

The overlap settings are implemented by remapping token IDs so only a chosen subset is shared, leaving all other tokens offset and therefore disjoint between languages. We train on CCMatrix bilingual data and cover diverse language families/scripts; English is paired with each target language and sentences are interleaved during pre-training. The language pairs we consider are ENβ€”ES, ENβ€”DE, ENβ€”TR, ENβ€”ZH, ENβ€”AR, and ENβ€”SW.

Our findings:

  • Overlap creates embedding spaces that better capture cross-lingual relationships.
  • Any overlap improves cross-lingual transfer compared to disjoint vocabularies.
  • Overlapping semantically similar tokens is most beneficial to transfer.

✨ Citation

If you use our code, please cite our paper:

@inproceedings{
kallini2025false,
title={False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models},
author={Julie Kallini and Dan Jurafsky and Christopher Potts and Martijn Bartelds},
booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://openreview.net/forum?id=mIpRFuCa2h}
}

πŸš€ Installation

Clone the repo and install dependencies:

git clone https://github.com/jkallini/false-friends.git
cd false-friends
conda create -n falsefriends python=3.9
conda activate falsefriends
pip install -r requirements.txt

πŸ”§ Preliminaries (one-time setup)

Open utils.py and set the following (exact fields as in your draft):

  • BASE_PATH: root for datasets, preprocessed files, and model checkpoints.
  • CCMATRIX_RAW_PATH: directory containing gzipped TSV files named {src}-{tgt}.tsv.gz. We recommend downloading these via the Amazon CCMatrix helper (amazon-science/multi-way-parallel-ccmatrix).
  • TOKENIZER_NAMES: base tokenizers to consider for analysis (e.g., xlmr).
  • LANGUAGE_PAIRS: list of {src}-{tgt} pairs to process and train (e.g., en-es, en-de, en-tr, en-zh, en-ar, en-sw).

πŸ“‚ Data Preparation

Use placeholders consistently: {src}, {tgt}, {tokenizer_name}, {setting}, {split}.

1) Tokenize CCMatrix

python tokenize_ccmatrix.py {src} {tgt} {tokenizer_name}

Tokenizes CCMatrix {src}-{tgt}.tsv.gz with {tokenizer_name}, producing tokenized shards and vocab stats.

2) Compute overlap sets

python get_overlap_sets.py

Computes native overlap tokens between {src} and {tgt}.

3) Extract token occurrences

python extract_occurrences.py \
    --src {src} \
    --tgt {tgt} \
    --tokenizer_name {tokenizer_name} \
    --sentence_N 400_000_000 \
    --token_list overlap_tokens/{src}-{tgt}-{tokenizer_name}.txt

Samples occurrences for each overlapping token to enable similarity scoring.

4) Cross-lingual token similarity

python analyze_token_similarity.py pretrained \
    --src {src} \
    --tgt {tgt} \
    --model_name {tokenizer_name} \
    --layer_index {layer_index}

Computes static embeddings for overlaps and ranks them by cross-lingual similarity (defaults: xlmr, layer 5).

5) Build training/validation splits

python process_data_for_training.py \
    --src {src} \
    --tgt {tgt} \
    --tokenizer_name {tokenizer_name} \
    --setting {setting} \
    --split {split}

πŸ‹οΈ Pre-training

Slurm (recommended)

sbatch training/train_array.slurm training/args/pretraining/{args_file}

Local (Accelerate)

accelerate launch train.py \
    --src {src} \
    --tgt {tgt} \
    --tokenizer_name {tokenizer_name} \
    --setting {setting} \
    --random_seed {seed} \
    --rope \
    --resume_from_checkpoint

Model: GPT-2 style decoder-only Transformer, 12 layers, 12 heads, d=768. Trained for 100k steps with AdamW, cosine LR schedule, effective batch = 64Γ—1024 tokens.


πŸ”¬ Analysis

1) Select tokens for analysis

python select_similar_tokens.py --src {src} --tgt {tgt} --tokenizer_name {tokenizer_name}

2) Similarity with trained models

  • Most/least similar:
python analyze_token_similarity.py custom \
    --setting {setting} \
    --model_seed {seed} \
    --src {src} \
    --tgt {tgt} \
    --layer_index {layer_index} \
    --tokens_file {tokens_file}
  • Random pairs:
python analyze_token_similarity.py custom \
    --setting {setting} \
    --model_seed {seed} \
    --src {src} \
    --tgt {tgt} \
    --uid random \
    --layer_index {layer_index} \
    --tokens_file {all_tokens_file} \
    --random_pairs 500

Or run in batch with:

sbatch training/analysis_array.slurm {args_file}

🎯 Fine-tuning

We fine-tune pretrained models on English and evaluate zero-shot on {tgt}.

  • XNLI: MultiNLI β†’ XNLI (num_train_epochs=5, save_steps=500)
  • XQuAD: SQuAD β†’ XQuAD (num_train_epochs=7, save_steps=200)

A) Slurm

Args file (training/args/finetune_args.txt):

TASK SRC TGT TOKENIZER SETTING MODEL_SEED RANDOM_SEED BATCH_SIZE LEARNING_RATE EVAL_STEPS NUM_EPOCHS DEVICE_BS
xnli en es xlmr full_overlap 21 21 256 5e-5 500 5 64
xquad en ar xlmr full_overlap 21 21 128 5e-5 200 7 16

Launch:

sbatch training/finetune_sweep.slurm training/args/finetune_args.txt

B) Direct run

XNLI:

python3 finetune.py \
    --task xnli \
    --src {src} \
    --tgt {tgt} \
    --tokenizer_name {tokenizer_name} \
    --setting {setting} \
    --effective_batch_size 256 \
    --per_device_train_batch_size 64 \
    --num_train_epochs 5 \
    --save_steps 500 \
    --eval_steps 500 \
    --learning_rate 5e-5 \
    --model_seed 21 \
    --random_seed 21 \
    --resume_from_checkpoint

XQuAD:

python3 finetune.py \
    --task xquad \
    --src {src} \
    --tgt {tgt} \
    --tokenizer_name {tokenizer_name} \
    --setting {setting} \
    --effective_batch_size 128 \
    --per_device_train_batch_size 16 \
    --num_train_epochs 7 \
    --save_steps 200 \
    --eval_steps 200 \
    --learning_rate 5e-5 \
    --model_seed 21 \
    --random_seed 21 \
    --resume_from_checkpoint

πŸ”‘ Overlap Tokens & Similarity Scores

We release the overlap sets and their similarity rankings used in our experiments:

  • Overlap token lists: datasets/overlap_tokens/{src}-{tgt}-{tokenizer_name}.txt
    Each file contains the native overlap O = V_src ∩ V_tgt under the XLM-R tokenizer.

  • Similarity scores: datasets/similarities/pretrained_{src}_{tgt}_{tokenizer_name}.tsv
    Each file has lines of the form: token_id token_string similarity_score where similarity_score is the cosine similarity between {src} and {tgt} contextual embeddings (mean-pooled across 100 sampled contexts).

These files allow you to:

  • Reproduce the High- and Low-similarity overlap splits (O_hi, O_lo).
  • Inspect which tokens are most/least semantically aligned across languages.
  • Run your own analyses with select_similar_tokens.py or custom thresholds.

πŸ“„ License

This project is licensed under the MIT License – see the LICENSE file for details.

About

Code repository for the paper "False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models."

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors