-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
We ran test_internode.py over RoCE network with 4 H800 servers with 8 GPUs as per one server. But the test result is pretty poor by comparing with the case of 4 H800-servers on IB network.
case#1, 4 H800 servers on IB network

case#2, 4 H800 servers on RoCE nework
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, RDMA chunk 8: 29.92 GB/s (RDMA), 60.35 GB/s (NVL)
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 12, RDMA chunk 4: 29.54 GB/s (RDMA), 59.58 GB/s (NVL)
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 16: 13.59 GB/s (RDMA), 27.41 GB/s (NVL)
I am not sure if we have the benchmark test result on RoCE network. Additionally, it would be highly appreciated if any comment.
Many thanks.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels