Skip to content

Distributed training will hang if log_vars has different length among GPUs #6495

@fingertap

Description

@fingertap

Similar to the issue in mmseg. Here I just quote:

In the _parse_log function of the mmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):

for loss_name, loss_value in log_vars.items():
   # reduce loss when distributed training
   if dist.is_available() and dist.is_initialized():
       loss_value = loss_value.data.clone()
       dist.all_reduce(loss_value.div_(dist.get_world_size()))
   log_vars[loss_name] = loss_value.item()

One GPU A does not have a "roi_acc" as loss_name (suppose it is the last key in log_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last "roi_acc" will try to call torch.distributed.all_reduce, which infinitely waits for the reply from GPU A.

This bug is hard to debug, as it just gets blocked without error messages.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions