Distributed training will hang if log_vars has different length among GPUs

Similar to the [issue in mmseg](https://github.com/open-mmlab/mmsegmentation/issues/1030). Here I just quote:

> In the `_parse_log` function of the `mmseg.segmentors.base.BaseSegmentor`, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop ([line 194](https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/segmentors/base.py#L194)):
>
> ```python
> for loss_name, loss_value in log_vars.items():
>    # reduce loss when distributed training
>    if dist.is_available() and dist.is_initialized():
>        loss_value = loss_value.data.clone()
>        dist.all_reduce(loss_value.div_(dist.get_world_size()))
>    log_vars[loss_name] = loss_value.item()
>```
>
>One GPU A does not have a `"roi_acc"` as `loss_name` (suppose it is the last key in `log_vars`). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last `"roi_acc"` will try to call `torch.distributed.all_reduce`, which infinitely waits for the reply from GPU A.

This bug is hard to debug, as it just gets blocked without error messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training will hang if log_vars has different length among GPUs #6495

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed training will hang if log_vars has different length among GPUs #6495

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions