--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn
Environment
- gsplat (a8d88d3)
- Cuda 13.0
- PyTorch 2.x with NCCL 2.27.7
- 4x NVIDIA GPUs (AWS
g5.12xlarge 4xA10G)
- Python 3.12
- Launched via
python simple_trainer.py default --packed ... --world_size 4 (matching basic_4gpus.sh)
Description
Similar issue: #845
When running simple_trainer.py with --packed and world_size > 1, training crashes at step 0 with a cudaErrorIllegalAddress in the NCCL process group watchdog thread across all ranks simultaneously.
Without --packed, multi-GPU training runs correctly with the same configuration.
Steps to Reproduce
python simple_trainer.py default \
--packed \
--world_size 4 \
--data-dir /path/to/dataset \
--result-dir /path/to/output \
--disable_viewer \
--max_steps 15000
Error
[rank0]:[E] ProcessGroupNCCL.cpp:2057] [PG ID 0 PG GUID 0(default_pg) Rank 0]
Process group watchdog thread terminated with exception:
CUDA error: an illegal memory access was encountered
Exception raised from c10_cuda_check_implementation at
/pytorch/c10/cuda/CUDAException.cpp:44
terminate called after throwing an instance of 'c10::DistBackendError'
what(): CUDA error: an illegal memory access was encountered
All 4 ranks (0-3) crash simultaneously at the first training step with the same error.
Expected Behavior
Multi-GPU training with --packed should work as demonstrated in examples/benchmarks/basic_4gpus.sh.
Additional Context
- Removing
--packed resolves the crash — training completes successfully across all 4 ranks, but the speed-up is negligible
- The crash occurs at step 0 before any meaningful training, suggesting the issue is in the initial NCCL
all_reduce over the packed sparse tensors produced by the first rasterization pass
- Hypothesis: packed mode produces variable-length sparse tensors per rank whose memory layout is incompatible with how NCCL accesses GPU memory during
all_reduce, triggering an illegal memory access in the NCCL watchdog
--packedflag causescudaErrorIllegalAddressin NCCL watchdog during multi-GPU distributed training viamp.spawnEnvironment
g5.12xlarge4xA10G)python simple_trainer.py default --packed ... --world_size 4(matchingbasic_4gpus.sh)Description
Similar issue: #845
When running
simple_trainer.pywith--packedandworld_size > 1, training crashes at step 0 with acudaErrorIllegalAddressin the NCCL process group watchdog thread across all ranks simultaneously.Without
--packed, multi-GPU training runs correctly with the same configuration.Steps to Reproduce
python simple_trainer.py default \ --packed \ --world_size 4 \ --data-dir /path/to/dataset \ --result-dir /path/to/output \ --disable_viewer \ --max_steps 15000Error
All 4 ranks (0-3) crash simultaneously at the first training step with the same error.
Expected Behavior
Multi-GPU training with
--packedshould work as demonstrated inexamples/benchmarks/basic_4gpus.sh.Additional Context
--packedresolves the crash — training completes successfully across all 4 ranks, but the speed-up is negligibleall_reduceover the packed sparse tensors produced by the first rasterization passall_reduce, triggering an illegal memory access in the NCCL watchdog