xl-sum/seq2seq at master · csebuetnlp/xl-sum

Name	Name	Last commit message	Last commit date
parent directory ..
transformers	transformers
README.md	README.md
__init__.py	__init__.py
distributed_trainer.sh	distributed_trainer.sh
evaluate.sh	evaluate.sh
extract_data.py	extract_data.py
job.sh	job.sh
pipeline.py	pipeline.py
requirements.txt	requirements.txt
sentence_splitter.py	sentence_splitter.py
setup.sh	setup.sh
trainer.sh	trainer.sh
utils.py	utils.py

We use a modified fork of huggingface transformers for our experiments.

Setup

$ git clone https://github.com/csebuetnlp/xl-sum
$ cd xl-sum/seq2seq
$ conda create python==3.7.9 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh

Use the newly created environment for running rest of the commands.

Extracting data

Before running the extractor, place all the .jsonl files (train, val, test) for all the languages you want to work with, under a single directory (without any subdirectories).

For example, to replicate our multilingual setup with all languages, run the following commands:

$ wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1fKxf9jAj0KptzlxUsI3jDbp4XLv_piiD' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1fKxf9jAj0KptzlxUsI3jDbp4XLv_piiD" -O XLSum_complete_v2.0.tar.bz2 && rm -rf /tmp/cookies.txt
$ tar -xjvf XLSum_complete_v2.0.tar.bz2
$ python extract_data.py -i XLSum_complete_v2.0/ -o XLSum_input/

This will create the source and target files for multilingual training within XLSum_input/multilingual and per language training and evaluation filepairs under XLSum_input/individual/<language>.

Training & Evaluation

To see list of all available options, do python pipeline.py -h

Multilingual training

For multilingual training on single GPU, a minimal example is as follows:

$ python pipeline.py \
    --model_name_or_path "google/mt5-base" \
    --data_dir "XLSum_input/multilingual" \
    --output_dir "XLSum_output/multilingual" \
    --lr_scheduler_type="transformer" \
    --learning_rate=1 \
    --warmup_steps 5000 \
    --weight_decay 0.01 \ 
    --per_device_train_batch_size=2 \
    --gradient_accumulation_steps=16  \
    --max_steps 50000 \
    --save_steps 5000 \
    --evaluation_strategy "no" \
    --logging_first_step \
    --adafactor \
    --label_smoothing_factor 0.1 \
    --upsampling_factor 0.5 \
    --do_train

For multilingual training on multiple nodes / GPUs launch the script with torch.distributed.launch, i.e.

$ python -m torch.distributed.launch \
    --nproc_per_node=<NPROC_PER_NODE> \
    --nnodes=<NUM_NODES> \
    --node_rank=<PROCID> \
    --master_addr=<ADDR> \
    --master_port=<PORT> \
    pipeline.py ...

To replicate our setup on 8 GPUs (4 nodes with 2 NVIDIA TESLA P100 GPUs each) using SLURM, refer to job.sh and distributed_trainer.sh

Per language training

Minimal training example (for example, onBengali) on a single GPU is given below:

$ python pipeline.py \
    --model_name_or_path "google/mt5-base" \
    --data_dir "XLSum_input/individual/bengali" \
    --output_dir "XLSum_output/individual/bengali" \
    --lr_scheduler_type="linear" \
    --learning_rate=5e-4 \
    --warmup_steps 100 \
    --weight_decay 0.01 \ 
    --per_device_train_batch_size=2 \
    --gradient_accumulation_steps=16  \
    --num_train_epochs=10 \
    --save_steps 100 \
    --predict_with_generate \
    --evaluation_strategy "epoch" \
    --logging_first_step \
    --adafactor \
    --label_smoothing_factor 0.1 \
    --do_train \
    --do_eval

Hyperparameters such as warmup_steps should be updated according to the language. For a detailed example, refer to trainer.sh

Evaluation

To calculate rouge scores on test sets (for example on Hindi) using a trained model, use the following snippet:

$ python pipeline.py \
    --model_name_or_path <path/to/trained/model/directory> \
    --data_dir "XLSum_input/individual/hindi" \
    --output_dir "XLSum_output/individual/hindi" \
    --rouge_lang "hindi" \ 
    --predict_with_generate \
    --length_penalty 0.6 \
    --no_repeat_ngram_size 2 \
    --max_source_length 512 \
    --test_max_target_length 84 \
    --do_predict

For a detailed example, refer to evaluate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Setup

Extracting data

Training & Evaluation

Multilingual training

Per language training

Evaluation

FilesExpand file tree

seq2seq

Directory actions

More options

Directory actions

More options

Latest commit

History

seq2seq

Folders and files

parent directory

README.md

Setup

Extracting data

Training & Evaluation

Multilingual training

Per language training

Evaluation