[PyTorch Migration] Add Reusable EFA Tests#5985
Open
Eren-Jeager123 wants to merge 47 commits intomainfrom
Open
[PyTorch Migration] Add Reusable EFA Tests#5985Eren-Jeager123 wants to merge 47 commits intomainfrom
Eren-Jeager123 wants to merge 47 commits intomainfrom
Conversation
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
Signed-off-by: Kevin Wang <kwanggg@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add reusable EC2 EFA integration test
What
A reusable EFA integration test that launches 2x p4d.24xlarge EC2 instances, runs NCCL
all_reduce_perfacross nodes, and verifies EFA (not TCP) is the actual transport. Ported from V1 (test/dlc_tests/ec2/test_efa.py+test/v2/ec2/efa/) and adapted for V2's AL2023-based runtime.Architecture
Framework-agnostic — any DLC with NCCL + EFA can call
reusable-efa-tests.yml. Same pattern asreusable-sanity-tests.ymlandreusable-security-tests.yml.What it verifies
fi_info -p efaefa_sanity.shtest -d /sys/module/ib_uverbsefa_sanity.shib_uverbskernel module loadedfi_pingpong -p efaefa_pingpong.shibv_devinfoefa_sanity.shefa_sanity.shgrep "Selected provider is efa"nccl_allreduce.shgrep "Using network Libfabric"nccl_allreduce.shgrep "NET/Libfabric/0/GDRDMA"nccl_allreduce.shnccl_allreduce.shSecurity group configuration
The ephemeral SG is configured per AWS's EFA + NCCL setup guide:
fabric.Connection()orchestration.0.0.0.0/0) is not sufficient for EFA's SRD handshake to complete; AWS's setup docs explicitly require an SG-scoped outbound rule.Concurrency and capacity
reusable-efa-tests.ymluses a global workflow-level concurrency group (efa-test-global,cancel-in-progress: false) so only one EFA test runs at a time across all PRs, avoiding p4d contention.DescribeInstanceTypes(NetworkInfo.EfaInfo.MaximumEfaInterfaces) — matches V1's approach, no hardcoded instance-type map.Files
.github/workflows/reusable-efa-tests.yml.github/scripts/efa/ec2_helpers.pytest/efa/test_efa.pytest/efa/scripts/efa_sanity.shtest/efa/scripts/efa_pingpong.shtest/efa/scripts/build_nccl_tests.shall_reduce_perffrom NVIDIA nccl-teststest/efa/scripts/nccl_allreduce.shAL2023 adaptations (vs V1 Ubuntu)
service ssh start→/usr/sbin/sshdservicewrapperservice ssh status→pgrep -x sshdgit clone→curltarball +tar xzgitcuda-cudart-devel-<ver>at test timecuda-nvccbut not CUDA headersNCCL_HOME=/usr/local→nvidia.ncclpip package path/usr/locallibnccl.sounversioned symlink at test timelibnccl.so.Nonly; linker needs unversioned.solsmod | grep ib_uverbs→test -d /sys/module/ib_uverbskmod;/sys/module/is populated by the kernel directlyDockerfile changes
ENV CUDA_HOME=/usr/local/cudatoDockerfile.cudaruntime-base. V1 Ubuntu CUDA base image set this; AL2023nvidia/cudabase does not./opt/amazon/openmpi/lib64and/opt/amazon/efa/lib64toLD_LIBRARY_PATH. AL2023's EFA installer places libraries underlib64(V1 Ubuntu usedlib). Without this,libmpi.so.40etc. aren't found at runtime. Matches the pattern used by vLLM and SGLang AL2023 Dockerfiles.Workflow integration
Added
efa-testjob topr-pytorch-ec2-cuda.yml. Runs in parallel withsingle-gpu-testafter build + sanity + security + unit pass. ~19-25 min runtime (most of which is instance provisioning).Instance type configuration
The reusable workflow accepts an
efa-instance-typeinput (defaultp4d.24xlarge). Any EFA-capable instance type will work —ec2_helpers.pyqueriesDescribeInstanceTypesdynamically for the NIC count.Supported targets:
p4d.24xlargep4de.24xlargep5.48xlargeThe GDRDMA NCCL log assertion in
nccl_allreduce.shis conditional on p4d/p5 — other instance types skip it.Follow-ups
DependencyViolation— ENIs are still detaching when the cleanup runs, and our 30-second sleep isn't always enough. Every leaked SG counts against the 2500-per-VPC quota and accumulates over time. V1's approach is cleaner: one long-lived SG created by infrastructure-as-code (CDK), looked up by name at test start, never deleted. Requires a CDK change inDLContainersInfraCDK/lib/stacks/github_cicd_infra_stack.tsand anec2_helpers.pychange to look up instead of create/delete.