[PG][TENT] Fix first-collective hangs on NVLink/MNNVL bootstrap#1755
[PG][TENT] Fix first-collective hangs on NVLink/MNNVL bootstrap#1755yuechen-sys merged 6 commits intokvcache-ai:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses several critical issues that caused bootstrap and first-collective hangs in Mooncake PG, particularly on TENT with NVLink/MNNVL configurations. The changes enhance the stability and reliability of the system by proactively loading CUDA kernels, correctly classifying host memory regions for transport, improving RPC server initialization, and robustly handling inactive ranks. These fixes collectively eliminate various failure points, leading to a much more stable startup and collective operation. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors memory registration in ConnectionContext and P2PProxy to consistently use kWildcardLocation for host-side memory and initializes newly allocated arrays to zero. It introduces a preloadReduceKernels mechanism to address CUDA lazy loading for various data types. Additionally, the MooncakeWorker is enhanced with kInvalidTaskId checks for improved task management, and the RPC server now includes robust error handling and retry logic for async_start() to prevent silent failures from port conflicts. The std::string location parameter was removed from the ConnectionContext constructor. I have no feedback to provide as all review comments were filtered out.
| const auto err = server_->get_errc(); | ||
| if (err) { | ||
| LOG(WARNING) << "Failed to start RPC server(async_start) on port " << port | ||
| << ": " << err.message(); | ||
| delete server_; | ||
| server_ = nullptr; | ||
| port = 0; | ||
| continue; |
There was a problem hiding this comment.
The added error checking for server_->async_start() is a critical improvement. Previously, RPC port conflicts could lead to silent failures and incorrect reporting of successful startup. This change ensures that such errors are detected, logged, and handled by retrying port selection, directly addressing solution point 3 and significantly enhancing the robustness of the RPC server initialization.
const auto err = server_->get_errc();
if (err) {
LOG(WARNING) << "Failed to start RPC server(async_start) on port " << port
<< ": " << err.message();
delete server_;
server_ = nullptr;
port = 0;
continue;
}| int rc = engine_->registerLocalMemory(resources_.ctrl_send_region_, | ||
| kMaxNumRanks * sizeof(P2PControlSlot), | ||
| location_); | ||
| kWildcardLocation); |
There was a problem hiding this comment.
Registering resources_.ctrl_send_region_ with kWildcardLocation is a crucial change. This correctly identifies the P2P control regions as host-side memory, preventing misclassification and ensuring proper transport behavior on NVLink/MNNVL setups, as detailed in solution point 2.
| kWildcardLocation); | |
| kMaxNumRanks * sizeof(P2PControlSlot), | |
| kWildcardLocation); |
| rc = engine_->registerLocalMemory(resources_.ctrl_recv_region_, | ||
| kMaxNumRanks * sizeof(P2PControlSlot), | ||
| location_); | ||
| kWildcardLocation); |
There was a problem hiding this comment.
Similarly, registering resources_.ctrl_recv_region_ with kWildcardLocation ensures that this P2P control region is also correctly identified as host-side memory. This consistency is vital for avoiding bootstrap hangs and ensuring reliable P2P communication.
| kWildcardLocation); | |
| kMaxNumRanks * sizeof(P2PControlSlot), | |
| kWildcardLocation); |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Do we need to merge #1733 first? |
…che-ai#1755) * [PG]: integrate with tent's nvlink problem * [PG]: code format * [tent] code format
Description
This PR fixes several bootstrap and first-collective hangs when Mooncake PG runs on TENT, especially on NVLink / MNNVL setups.
The issue comes from a combination of initialization-time problems:
The first reduce kernel can be affected by CUDA lazy loading
During the first collective, the first reduce kernel may be launched only after a resident/spin kernel is already occupying the stream. In this case, the first reduce kernel effectively becomes a tail kernel.
Based on our investigation, this is closely related to CUDA lazy loading. NVIDIA documents that lazy loading can affect concurrent kernel execution, and explicitly recommends preloading kernels that are expected to run concurrently. NVIDIA also documents that
cudaFuncGetAttributes()can be used to force a kernel to load eagerly at runtime.Relevant CUDA documentation:
We observed a particularly surprising symptom here: when the first tail kernel has not been preloaded yet, a
cudaMemcpyAsyncsubmitted on another stream can remain blocked until the resident spin kernel is released. We did not find this exact cross-stream memcpy symptom described explicitly in the CUDA documentation, but the following minimal demo reproduces it reliably:cuda_tail_preload_deadlock_demo.cuThe demo shows:
cudaMemcpyAsyncMooncake PG can hit the same pattern because the reduce kernel may be the first real CUDA kernel touched by the process.
Host-side control/warmup/sync buffers were registered with a GPU-specific location
CPU sync regions, warmup regions, and P2P control regions were previously registered using the local GPU location. Under TENT NVLink, this can steer transport selection toward a GPU/NVLink path even though these buffers are plain host memory.
On MNNVL/fabric setups, such host buffers are not valid NVLink fabric targets. This can break warmup/control traffic and lead to bootstrap hangs.
This is also related to the earlier location changes in:
PR [PG] force register local memory for P2P memory regions #1690 adjusted the location in a way that helped the TE path work around the issue, but it did not reflect the actual memory type of these host-side regions. PR [TE] Fix simultaneous open handshake in RdmaEndpoint #1733 later addressed the real TE-side issue. After that fix, the PG-side location for these host control/warmup/sync buffers also needs to be corrected accordingly.
In other words, these buffers should use a host-compatible registration (
kWildcardLocation) rather than inherit the GPU location. This is required so that both TENT and TE choose the correct transport behavior.RPC port conflicts could be reported as successful startup
TENT selects an RPC port when starting the local RPC server. If two ranks choose the same port, startup should fail and retry. Previously,
async_start()errors were not checked, so a failed bind could still be treated as success. That can publish duplicatedserver_name/ segment names into the store and corrupt later peer connection logic.Inactive ranks could leave invalid task indices in worker threads
Some ranks may be skipped because they are inactive or excluded by collective semantics. However, their task indices were not explicitly initialized to an invalid value, and later status polling could still touch those entries.
Solution
This PR applies four fixes:
Preload reduce kernels during backend initialization
preloadReduceKernels()cudaFuncGetAttributes()Register host control/sync/warmup memory with wildcard location
The following host-side regions now use
kWildcardLocationinstead of a GPU-specific location:ConnectionContextwarmup send/recv regionsP2PProxycontrol send/recv regionsThese buffers are host-side metadata/control memory and should not inherit the GPU location. This aligns PG with the TE-side fix in PR [TE] Fix simultaneous open handshake in RdmaEndpoint #1733 and makes both TENT and TE choose the correct transport behavior.
Fail fast on RPC port conflicts
async_start()result in TENT RPC startupserver_namepublicationSkip inactive ranks safely in worker-thread polling
Why This Fix Works
The fragile part of the old path was the very first bootstrap / collective window:
cudaMemcpyAsyncpathThis PR removes those failure points, making PG bootstrap and the first collective much more stable on TENT/NVLink.
cc @caozhanhao @UNIDY2002 @yuechen-sys
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Tested with
mooncake-wheel/tests/test_mooncake_backend.py.The CUDA symptom can also be illustrated with:
cuda_tail_preload_deadlock_demo.cuExpected behavior from the demo:
Checklist
./scripts/code_format.shbefore submitting.