-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Similar to the issue in mmseg. Here I just quote:
In the
_parse_logfunction of themmseg.segmentors.base.BaseSegmentor, it attempts to synchronize the loss values among all GPUs. What happened is that in this loop (line 194):for loss_name, loss_value in log_vars.items(): # reduce loss when distributed training if dist.is_available() and dist.is_initialized(): loss_value = loss_value.data.clone() dist.all_reduce(loss_value.div_(dist.get_world_size())) log_vars[loss_name] = loss_value.item()One GPU A does not have a
"roi_acc"asloss_name(suppose it is the last key inlog_vars). Then this GPU A thinks it has done all its work, and jump out of the loop. Other GPUs with the last"roi_acc"will try to calltorch.distributed.all_reduce, which infinitely waits for the reply from GPU A.
This bug is hard to debug, as it just gets blocked without error messages.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working