Bug in calculating num_input_tokens_seen in multi-gpu environments

### System Info

- `transformers` version: 4.47.0.dev0
- Platform: Linux-5.4.0-171-generic-x86_64-with-glibc2.35
- Python version: 3.11.10
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.0.1
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB

### Who can help?

@muellerzr @ArthurZucker 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

I'm using LLama-Factory, running Qwen2-VL's multimodal sft tasks, and on an 8*A100 machine, everything works fine when I'm using the 4.45.2 version of transfomers, but when I've updated to the 4.46.0 and higher versions, none of them train properly.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _inner_training_loop
[rank0]:     self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()
[rank0]:                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: a Tensor with 8 elements cannot be converted to Scalar
I looked at blame and the code that went wrong was introduced here #34198
The origin code is `self.state.num_input_tokens_seen += torch.sum(
    self.accelerator.gather(
        torch.tensor(
            inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64
        )
    )
)`
The modified code is `input_tokens = inputs[main_input_name].numel()
input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64)
self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()`

I'm not an expert on this, I consulted GPT and he told me that both codes work the same in a single GPU environment.
In a multi-GPU environment, the first code correctly accumulates the number of input tokens for all devices, while the second code makes an error or only counts the number of tokens for the current device.
Therefore, they are not equivalent in a multi-GPU environment and it is recommended to use the first writeup to ensure correctness in distributed training.

I tried reverting the code back here and rebuilding the transformers and the problem seems to be solved, would love to hear what you guys think!

### Expected behavior

Fix this bug and release a new version as soon as possible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions