--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn

# `--packed` flag causes `cudaErrorIllegalAddress` in NCCL watchdog during multi-GPU distributed training via `mp.spawn`

## Environment
- gsplat (a8d88d387f6e554b18153d309f5536696882de5c)
-  Cuda 13.0
- PyTorch 2.x with NCCL 2.27.7
- 4x NVIDIA GPUs (AWS `g5.12xlarge` 4xA10G)
- Python 3.12
- Launched via `python simple_trainer.py default --packed ... --world_size 4` (matching `basic_4gpus.sh`)

## Description
Similar issue: [https://github.com/nerfstudio-project/gsplat/issues/845](https://github.com/nerfstudio-project/gsplat/issues/845)
When running `simple_trainer.py` with `--packed` and `world_size > 1`, training crashes at step 0 with a `cudaErrorIllegalAddress` in the NCCL process group watchdog thread across all ranks simultaneously.

Without `--packed`, multi-GPU training runs correctly with the same configuration.

## Steps to Reproduce

```bash
python simple_trainer.py default \
    --packed \
    --world_size 4 \
    --data-dir /path/to/dataset \
    --result-dir /path/to/output \
    --disable_viewer \
    --max_steps 15000
```

## Error

```
[rank0]:[E] ProcessGroupNCCL.cpp:2057] [PG ID 0 PG GUID 0(default_pg) Rank 0]
Process group watchdog thread terminated with exception:
CUDA error: an illegal memory access was encountered

Exception raised from c10_cuda_check_implementation at
/pytorch/c10/cuda/CUDAException.cpp:44

terminate called after throwing an instance of 'c10::DistBackendError'
  what(): CUDA error: an illegal memory access was encountered
```

All 4 ranks (0-3) crash simultaneously at the first training step with the same error.

## Expected Behavior

Multi-GPU training with `--packed` should work as demonstrated in `examples/benchmarks/basic_4gpus.sh`.

## Additional Context

- Removing `--packed` resolves the crash — training completes successfully across all 4 ranks, but the speed-up is negligible
- The crash occurs at step 0 before any meaningful training, suggesting the issue is in the initial NCCL `all_reduce` over the packed sparse tensors produced by the first rasterization pass
- Hypothesis: packed mode produces variable-length sparse tensors per rank whose memory layout is incompatible with how NCCL accesses GPU memory during `all_reduce`, triggering an illegal memory access in the NCCL watchdog


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn #910

`--packed` flag causes `cudaErrorIllegalAddress` in NCCL watchdog during multi-GPU distributed training via `mp.spawn`

Environment

Description

Steps to Reproduce

Error

Expected Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn #910

Description

--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn

Environment

Description

Steps to Reproduce

Error

Expected Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`--packed` flag causes `cudaErrorIllegalAddress` in NCCL watchdog during multi-GPU distributed training via `mp.spawn`