Skip to content

ProcessGroupCollection.__repr__ crashes with list-typed hierarchical CP groups #3723

@shanecmoran

Description

@shanecmoran

Description

ProcessGroupCollection.__repr__ assumes every field is a single ProcessGroup with a .size() method. When hierarchical_context_parallel_sizes is set, the hierarchical CP field stores a list of ProcessGroup objects, causing AttributeError: 'list' object has no attribute 'size'.

Error

File "megatron/core/process_groups_config.py", line 150, in __repr__
    active_pgs.append(f"{field_info.name}({pg.size()})")
AttributeError: 'list' object has no attribute 'size'

Triggered during checkpoint saving when modelopt's _parse_transformer_config calls str() on the config, which invokes __repr__.

Reproduction

model:
  cp_comm_type: a2a+p2p
  hierarchical_context_parallel_sizes: [8, 2]
  context_parallel_size: 16

Suggested Fix

Handle list-typed fields in __repr__:

def __repr__(self):
    active_pgs = []
    for field_info in fields(self):
        pg = getattr(self, field_info.name, None)
        if pg is not None:
            if isinstance(pg, list):
                sizes = [g.size() for g in pg]
                active_pgs.append(f"{field_info.name}({sizes})")
            else:
                active_pgs.append(f"{field_info.name}({pg.size()})")
    ...

Environment

  • Container: nvcr.io/nvidia/nemo:26.02
  • Megatron-LM: core_r0.16.0

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions