-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestneeds-follow-upIssue needs follow-upIssue needs follow-up
Description
Description
ProcessGroupCollection.__repr__ assumes every field is a single ProcessGroup with a .size() method. When hierarchical_context_parallel_sizes is set, the hierarchical CP field stores a list of ProcessGroup objects, causing AttributeError: 'list' object has no attribute 'size'.
Error
File "megatron/core/process_groups_config.py", line 150, in __repr__
active_pgs.append(f"{field_info.name}({pg.size()})")
AttributeError: 'list' object has no attribute 'size'
Triggered during checkpoint saving when modelopt's _parse_transformer_config calls str() on the config, which invokes __repr__.
Reproduction
model:
cp_comm_type: a2a+p2p
hierarchical_context_parallel_sizes: [8, 2]
context_parallel_size: 16Suggested Fix
Handle list-typed fields in __repr__:
def __repr__(self):
active_pgs = []
for field_info in fields(self):
pg = getattr(self, field_info.name, None)
if pg is not None:
if isinstance(pg, list):
sizes = [g.size() for g in pg]
active_pgs.append(f"{field_info.name}({sizes})")
else:
active_pgs.append(f"{field_info.name}({pg.size()})")
...Environment
- Container:
nvcr.io/nvidia/nemo:26.02 - Megatron-LM:
core_r0.16.0
Related
- save_sharded_modelopt_state crashes with hierarchical context parallel groups Model-Optimizer#981 — the modelopt side of the same crash
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestneeds-follow-upIssue needs follow-upIssue needs follow-up