Skip to content

[PyTorch Migration] Add Reusable EFA Tests#5985

Open
Eren-Jeager123 wants to merge 47 commits intomainfrom
PT-EFA
Open

[PyTorch Migration] Add Reusable EFA Tests#5985
Eren-Jeager123 wants to merge 47 commits intomainfrom
PT-EFA

Conversation

@Eren-Jeager123
Copy link
Copy Markdown
Contributor

@Eren-Jeager123 Eren-Jeager123 commented Apr 21, 2026

Add reusable EC2 EFA integration test

What

A reusable EFA integration test that launches 2x p4d.24xlarge EC2 instances, runs NCCL all_reduce_perf across nodes, and verifies EFA (not TCP) is the actual transport. Ported from V1 (test/dlc_tests/ec2/test_efa.py + test/v2/ec2/efa/) and adapted for V2's AL2023-based runtime.

Architecture

reusable-efa-tests.yml              (reusable — takes image-uri as input)
  └── pytest test/efa/test_efa.py
        ├── efa_instances() context manager (ec2_helpers.py)
        │     ├── Create ephemeral SG (SSH from runner + all traffic within SG, both directions)
        │     ├── Launch 2x p4d.24xlarge via capacity reservations
        │     ├── Allocate + associate Elastic IPs (multi-NIC EFA has no auto public IP)
        │     ├── SFTP scripts, ECR login, docker pull on both hosts
        │     ├── Start containers with --network host + --device /dev/infiniband/uverbs*
        │     ├── Configure inter-container SSH on port 2022
        │     └── Write MPI hostfile
        ├── Build nccl-tests → all_reduce_perf on both nodes
        ├── Run EFA sanity on master (fi_info, ib_uverbs, fi_pingpong, ibv_devinfo, GDR)
        ├── Run all_reduce_perf across 2 nodes (16 ranks) via mpirun
        └── Cleanup: terminate instances, release EIPs, delete SG, delete keypair

Framework-agnostic — any DLC with NCCL + EFA can call reusable-efa-tests.yml. Same pattern as reusable-sanity-tests.yml and reusable-security-tests.yml.

What it verifies

Check Script What it proves
fi_info -p efa efa_sanity.sh EFA Libfabric provider detected
test -d /sys/module/ib_uverbs efa_sanity.sh ib_uverbs kernel module loaded
fi_pingpong -p efa efa_pingpong.sh EFA loopback data transfer works (quick pre-flight)
ibv_devinfo efa_sanity.sh RDMA devices present
GDR device check efa_sanity.sh GPU Direct RDMA available
grep "Selected provider is efa" nccl_allreduce.sh NCCL picked EFA transport
grep "Using network Libfabric" nccl_allreduce.sh Libfabric active (not sockets)
grep "NET/Libfabric/0/GDRDMA" nccl_allreduce.sh NCCL uses GDR over EFA
Bandwidth ≥ 3 GB/s nccl_allreduce.sh EFA path actually carries traffic (TCP fallback is ~1 GB/s)

Security group configuration

The ephemeral SG is configured per AWS's EFA + NCCL setup guide:

  • Ingress, SSH (port 22) from CodeBuild runner's public IP — for fabric.Connection() orchestration.
  • Ingress, ALL traffic from the SG itself — required for EFA inter-node traffic.
  • Egress, ALL traffic to the SG itself — required for EFA inter-node traffic. The default wide-open egress rule (0.0.0.0/0) is not sufficient for EFA's SRD handshake to complete; AWS's setup docs explicitly require an SG-scoped outbound rule.

Concurrency and capacity

  • Serialization: reusable-efa-tests.yml uses a global workflow-level concurrency group (efa-test-global, cancel-in-progress: false) so only one EFA test runs at a time across all PRs, avoiding p4d contention.
  • Capacity source: targeted capacity reservations only. No on-demand fallback (p4d on-demand availability is effectively 0%). Test fails fast with a clear error if no reservation has ≥ 2 slots.
  • NIC count: queried dynamically via DescribeInstanceTypes (NetworkInfo.EfaInfo.MaximumEfaInterfaces) — matches V1's approach, no hardcoded instance-type map.

Files

File Purpose
.github/workflows/reusable-efa-tests.yml Reusable workflow with global concurrency
.github/scripts/efa/ec2_helpers.py EC2 + EFA lifecycle (launch, SSH, container setup, cleanup)
test/efa/test_efa.py Pytest orchestrator
test/efa/scripts/efa_sanity.sh EFA sanity checks
test/efa/scripts/efa_pingpong.sh fi_pingpong over EFA loopback
test/efa/scripts/build_nccl_tests.sh Build all_reduce_perf from NVIDIA nccl-tests
test/efa/scripts/nccl_allreduce.sh Run all_reduce + verify EFA transport + bandwidth

AL2023 adaptations (vs V1 Ubuntu)

Change Reason
service ssh start/usr/sbin/sshd AL2023 containers have no service wrapper
service ssh statuspgrep -x sshd Verify sshd is running without sysvinit
git clonecurl tarball + tar xz Runtime image doesn't include git
Install cuda-cudart-devel-<ver> at test time Runtime image has cuda-nvcc but not CUDA headers
NCCL_HOME=/usr/localnvidia.nccl pip package path NCCL ships via pip; headers aren't in /usr/local
Create libnccl.so unversioned symlink at test time Pip NCCL ships versioned libnccl.so.N only; linker needs unversioned .so
lsmod | grep ib_uverbstest -d /sys/module/ib_uverbs AL2023 minimal containers don't install kmod; /sys/module/ is populated by the kernel directly

Dockerfile changes

  • Added ENV CUDA_HOME=/usr/local/cuda to Dockerfile.cuda runtime-base. V1 Ubuntu CUDA base image set this; AL2023 nvidia/cuda base does not.
  • Added /opt/amazon/openmpi/lib64 and /opt/amazon/efa/lib64 to LD_LIBRARY_PATH. AL2023's EFA installer places libraries under lib64 (V1 Ubuntu used lib). Without this, libmpi.so.40 etc. aren't found at runtime. Matches the pattern used by vLLM and SGLang AL2023 Dockerfiles.

Workflow integration

Added efa-test job to pr-pytorch-ec2-cuda.yml. Runs in parallel with single-gpu-test after build + sanity + security + unit pass. ~19-25 min runtime (most of which is instance provisioning).

Instance type configuration

The reusable workflow accepts an efa-instance-type input (default p4d.24xlarge). Any EFA-capable instance type will work — ec2_helpers.py queries DescribeInstanceTypes dynamically for the NIC count.

Supported targets:

Instance type EFA NICs Use case
p4d.24xlarge 4 Default (prod target)
p4de.24xlarge 4 Prod target
p5.48xlarge 32 Prod target

The GDRDMA NCCL log assertion in nccl_allreduce.sh is conditional on p4d/p5 — other instance types skip it.

Follow-ups

  • Migrate to a permanent SG (like V1). The ephemeral SG created by each test run sometimes fails to delete with DependencyViolation — ENIs are still detaching when the cleanup runs, and our 30-second sleep isn't always enough. Every leaked SG counts against the 2500-per-VPC quota and accumulates over time. V1's approach is cleaner: one long-lived SG created by infrastructure-as-code (CDK), looked up by name at test start, never deleted. Requires a CDK change in DLContainersInfraCDK/lib/stacks/github_cicd_infra_stack.ts and an ec2_helpers.py change to look up instead of create/delete.

Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Eren-Jeager123 and others added 27 commits April 21, 2026 23:01
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Eren-Jeager123 and others added 18 commits April 24, 2026 06:09
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant