[TE] Fix simultaneous open handshake in RdmaEndpoint#1733
[TE] Fix simultaneous open handshake in RdmaEndpoint#1733alogfans merged 9 commits intokvcache-ai:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical race condition in the RDMA transport layer that occurred when two endpoints attempted to establish a connection simultaneously. The fix introduces a new connection state, refactors the handshake process to be lock-free during RPC calls, and enables symmetric merging of connection parameters, ensuring that endpoints are reused rather than aggressively deleted. This change aims to prevent connection hangs and improve the stability of RDMA communication, especially in scenarios with high concurrency. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust fix for a race condition during simultaneous RDMA handshake attempts. The solution is well-designed, incorporating a new CONNECTING state to serialize handshake attempts from the same endpoint and releasing the lock during the blocking RPC to prevent deadlocks. The changes correctly handle the "simultaneous open" scenario by allowing endpoints to merge symmetrically. My feedback includes a minor suggestion to improve code readability by replacing a magic number with a named constant.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
e2f389e to
af5c0f6
Compare
|
Great fix for the simultaneous open bug! One alternative worth considering: instead of merging after collision, we could eliminate the collision entirely by assigning roles deterministically — the side with the lexicographically smaller This would remove the need for the Both approaches have precedent: this PR follows a pattern similar to Raft's conflict resolution (let collisions happen, then reconcile), while the role-assignment approach mirrors etcd's peer connection convention ("lower ID initiates"). Curious whether you considered this direction, and if so, what made you prefer the current approach. |
Thanks for the insightful comment! The main consideration is that endpoints are established on demand, exactly when a worker needs to transmit data. Since we do not know the data flow ahead of time, we cannot deterministically assign the initiator in advance. If we enforce a rule such as “the smaller GID initiates, the larger GID waits,” we run into a problem when the larger-GID side is the first (or only) side that needs to send data. It would remain stuck (or until a timeout) in a passive state, while the smaller-GID side might never initiate if it has nothing to send. In other words, in In practice, our data flow is usually unidirectional, which naturally avoids simultaneous opens and is probably why this bug remained hidden until recently. By the way, your suggestion is perfectly valid for a proactive connection setup. In fact, Mooncake PG has a warm-up path in a background |
mooncake-transfer-engine/src/transport/rdma_transport/rdma_endpoint.cpp
Outdated
Show resolved
Hide resolved
mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp
Show resolved
Hide resolved
|
Hi @alogfans, @stmatengss, I've got access to an eRDMA machine and uploaded a simple test to verify the endpoint re-establishment behavior. The test scenario is as follows:
I tested two different implementations of
These results confirm that this workaround is necessary for eRDMA compatibility and this PR does not compromise that behavior. Detailed logs are attached below, and the reproduction commands are included in the test file comments. Click to show detailsDetailsibv_devinfoReset onlyTargetInitiatorDestroy + Recreate (current version)TargetInitiator |
There was a problem hiding this comment.
Pull request overview
Fixes a race in TE’s RDMA connection handshake (“simultaneous open”) by adjusting the RdmaEndPoint state machine so concurrent active/passive handshakes reuse/merge an endpoint instead of deleting/recreating it.
Changes:
- Add a
CONNECTINGstate and lock-free (unlock-before-RPC) active handshake flow to avoid deadlocks and duplicate handshakes. - Update passive handshake handling to reuse existing endpoints rather than aggressively deleting them.
- Add a manual eRDMA endpoint re-establishment test executable.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp | Stops deleting endpoints in the passive handshake path; always reuses/creates via context->endpoint(). |
| mooncake-transfer-engine/src/transport/rdma_transport/rdma_endpoint.cpp | Implements CONNECTING, lock-free RPC handshake, peer-QP matching/idempotence, and eRDMA reconstruct path. |
| mooncake-transfer-engine/include/transport/rdma_transport/rdma_endpoint.h | Adds CONNECTING, peer QP tracking, and handshake wait/backoff constants; introduces disconnectForReestablish(). |
| mooncake-transfer-engine/tests/erdma_endpoint_reestablish_test.cpp | Adds a manual two-process eRDMA re-establishment validation program. |
| mooncake-transfer-engine/tests/CMakeLists.txt | Builds the new manual test executable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // RPC and it is passively established in setupConnectionsByPassive, simply | ||
| // reuse the existing endpoint. | ||
| if (connected()) { | ||
| if (peer_qp_num_list_ == peer_desc.qp_num) { |
There was a problem hiding this comment.
Can peer_qp_num_list vectors be directly compared?
There was a problem hiding this comment.
Yes, we can compare them directly.
qp_num_list contains the peer QP numbers and is order-sensitive. If we received a different qp_num_list from the same peer NIC path, we can confirm the peer QP has been re-initialized (e.g. due to a process restart), and therefore a connection re-establishment is required.
If both the peer NIC path and qp_num_list remain unchanged, we treat it as a duplicate connection setup and avoid resetting the connection. This situation can occur when multiple endpoints are being established concurrently from the same peer NIC path. Reusing the existing connection is necessary; otherwise, later requests could overwrite and invalidate previously created QPs (as reported in TENT's PR #1705, that PR also compares qp_num_list). Although this PR introduces a 'wait existing handshake' mechanism that should prevent concurrent endpoint setup, I've chosen to keep the reuse logic to ensure safety and maintain consistency with TENT. Because concurrent endpoint setup is more likely to happen (if there were cases the 'wait existing handshake' fails to cover) than the rare edge case described below.
There is, however, a theoretically possible but rare edge case where peer QPs are re-initialized but happen to reuse the exact same QP numbers with the exact same order. In such cases, we would incorrectly skip the connection re-establishment.
A more robust solution could involve introducing a session id, but it seems adding unnecessary complexity at current stage. My initial thought is that simply attaching a session ID from the active side would not eliminate the need for qp_num_list comparison in setupConnectionsByActive. Because during simultaneous open, a node acts as both the active and passive side simultaneously. Consequently, the session ID sent by the active side thread would differ from the one received by the another passive thread handling the peer's request. Perhaps a more viable solution would be for both sides to generate and exchange their own unique IDs -- similar to how QP numbers are assigned -- forming a (local_session_id, peer_session_id) pair to uniquely identify the connection. However, integrating this into the handshake process in a robust manner doesn't seem straightforward.
| * 1. Start etcd: | ||
| * etcd --listen-client-urls http://127.0.0.1:18222 \ | ||
| * --advertise-client-urls http://127.0.0.1:18222 | ||
| * |
There was a problem hiding this comment.
- Make this test be more general. 2. Add this test to CI.yml.
There was a problem hiding this comment.
I've updated the test to use GTest, making it much cleaner and more generic. However, I think we can't enable it in CI, since this test is meaningless without RDMA devices.
There was a problem hiding this comment.
Our CI servers have RDMA NICs, so we're ready to test.
There was a problem hiding this comment.
- Make this test be more general. 2. Add this test to CI.yml.
@stmatengss Just double-checking: should I add this test to the T-one CI (under scripts/tone_tests/) instead of CI.yml? I noticed that the T-one CI happens to have two eRDMA NICs available, whereas ci.yml uses GitHub-hosted runners which don't seem to have RDMA NICs. Let me know what you think!
There was a problem hiding this comment.
- Make this test be more general. 2. Add this test to CI.yml.
@stmatengss Just double-checking: should I add this test to the T-one CI (under
scripts/tone_tests/) instead ofCI.yml? I noticed that the T-one CI happens to have two eRDMA NICs available, whereas ci.yml uses GitHub-hosted runners which don't seem to have RDMA NICs. Let me know what you think!
I looked into the T-one CI workflow and realized that integrating this test perhaps need to modify the external tone-cli repository.
Currently, ci.yml uploads the mooncake wheel as an artifact, and integration-test.yml passes that artifact ID to T-one CI, which then runs scripts/tone_tests/ on their servers. To include the transfer engine test, we would need to upload it as an additional artifact. However, passing a extra artifact ID also requires tone-cli changes to handle them.
Since this expands the scope quite a bit, would it be okay to handle this in a separate PR?
|
@alogfans Please double-check. These code changes are in critical path. |
* [TE] Fix simultaneous open handshake in RdmaEndpoint * Keep the same logic for ERDMA. * apply gemini-code-assist's suggestion. * Fix endpoint reinitialization. * Add disconnect and waiting with back-off. * Add eRDMA Endpoint Re-establishment Test * Include the test in cmake * Address reviewer comments * Better log message.
* [TE] Fix simultaneous open handshake in RdmaEndpoint * Keep the same logic for ERDMA. * apply gemini-code-assist's suggestion. * Fix endpoint reinitialization. * Add disconnect and waiting with back-off. * Add eRDMA Endpoint Re-establishment Test * Include the test in cmake * Address reviewer comments * Better log message.
Description
In the current RdmaTransport in TE, when two workers on different ranks attempt to establish an endpoint with each other concurrently (similar to a "simultaneous open" in TCP), they can fall into a race condition that leads eventually usage of deleted endpoints.
The exact failure sequence is as follows:
RdmaTransport::onSetupRdmaConnections) finds the existing (but unconnected) endpoints A1/B1. Instead of reusing them, it aggressively callsdeleteEndpoint()and creates new endpoints (A2 and B2) to respond.reclaimEndpoint()eventually triggers and completely destroys their QPs.Solution
We modified the connection establishment state machine to support simultaneous open, allowing endpoints to merge symmetrically rather than overwrite each other.
CONNECTINGState: Introduced to indicate that an endpoint is actively waiting for an RPC reply. This prevents duplicate RPC calls from the same worker.RWSpinlockis released before sending the blocking RPC. This avoids deadlocks and allows the Passive RPC handler to process incoming requests concurrently on the same endpoint.CONNECTINGendpoint, directly applying the peer's parameters to it and advancing its state toCONNECTED.Note: Due to the lack of an eRDMA testing environment, the original destruction and construction logic is preserved in
RdmaEndPoint::setupConnectionsByPassivewhenCONFIG_ERDMAis defined.Related PRs
Hopefully this PR can resolve the root cause of the above issues.
cc @UNIDY2002 @jiafuzha
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Tested with
test_mooncake_backend.pyand confirmed the previous hang issue is fixed: no hang is observed when location is not specified inregisterLocalMemory.Checklist
./scripts/code_format.shbefore submitting.