Skip to content

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

@Tender-Su

Description

@Tender-Su

System Info

  • transformers version: 4.47.0.dev0
  • Platform: Linux-5.4.0-171-generic-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@muellerzr @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm using LLama-Factory, running Qwen2-VL's multimodal sft tasks, and on an 8*A100 machine, everything works fine when I'm using the 4.45.2 version of transfomers, but when I've updated to the 4.46.0 and higher versions, none of them train properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _inner_training_loop
[rank0]: self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: a Tensor with 8 elements cannot be converted to Scalar
I looked at blame and the code that went wrong was introduced here #34198
The origin code is self.state.num_input_tokens_seen += torch.sum( self.accelerator.gather( torch.tensor( inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64 ) ) )
The modified code is input_tokens = inputs[main_input_name].numel() input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64) self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()

I'm not an expert on this, I consulted GPT and he told me that both codes work the same in a single GPU environment.
In a multi-GPU environment, the first code correctly accumulates the number of input tokens for all devices, while the second code makes an error or only counts the number of tokens for the current device.
Therefore, they are not equivalent in a multi-GPU environment and it is recommended to use the first writeup to ensure correctness in distributed training.

I tried reverting the code back here and rebuilding the transformers and the problem seems to be solved, would love to hear what you guys think!

Expected behavior

Fix this bug and release a new version as soon as possible

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions