Skip to content

[TE] Fix simultaneous open handshake in RdmaEndpoint#1733

Merged
alogfans merged 9 commits intokvcache-ai:mainfrom
caozhanhao:pr/fix_simultaneous_open
Mar 30, 2026
Merged

[TE] Fix simultaneous open handshake in RdmaEndpoint#1733
alogfans merged 9 commits intokvcache-ai:mainfrom
caozhanhao:pr/fix_simultaneous_open

Conversation

@caozhanhao
Copy link
Copy Markdown
Contributor

Description

In the current RdmaTransport in TE, when two workers on different ranks attempt to establish an endpoint with each other concurrently (similar to a "simultaneous open" in TCP), they can fall into a race condition that leads eventually usage of deleted endpoints.

The exact failure sequence is as follows:

  1. Concurrent Active Requests: Both workers (e.g., Rank A and Rank B) reserve an endpoint in their local EndpointStore (let's call them A1 and B1), holding the lock to send an RPC handshake to each other.
  2. Aggressive Deletion: When the RPC requests arrive at the peers, the Passive handler (RdmaTransport::onSetupRdmaConnections) finds the existing (but unconnected) endpoints A1/B1. Instead of reusing them, it aggressively calls deleteEndpoint() and creates new endpoints (A2 and B2) to respond.
  3. Crossed Connection: The RPCs return to the original workers. The workers unknowingly initialize the original, now orphaned endpoints (A1 and B1) using the QPNs of the newly created endpoints (B2 and A2). Now A1 communicates to B2, and B1 communicates to A2.
  4. Fatal Garbage Collection: Because A1 and B1 were deleted from the store, reclaimEndpoint() eventually triggers and completely destroys their QPs.

Solution

We modified the connection establishment state machine to support simultaneous open, allowing endpoints to merge symmetrically rather than overwrite each other.

  1. New CONNECTING State: Introduced to indicate that an endpoint is actively waiting for an RPC reply. This prevents duplicate RPC calls from the same worker.
  2. Lock-Free RPC: The RWSpinlock is released before sending the blocking RPC. This avoids deadlocks and allows the Passive RPC handler to process incoming requests concurrently on the same endpoint.
  3. Symmetric Connection Merging: Instead of deleting existing endpoints, the Passive handler now accepts the incoming handshake and reuses the CONNECTING endpoint, directly applying the peer's parameters to it and advancing its state to CONNECTED.

Note: Due to the lack of an eRDMA testing environment, the original destruction and construction logic is preserved in RdmaEndPoint::setupConnectionsByPassive when CONFIG_ERDMA is defined.

Related PRs

Hopefully this PR can resolve the root cause of the above issues.

cc @UNIDY2002 @jiafuzha

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Tested with test_mooncake_backend.py and confirmed the previous hang issue is fixed: no hang is observed when location is not specified in registerLocalMemory.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition in the RDMA transport layer that occurred when two endpoints attempted to establish a connection simultaneously. The fix introduces a new connection state, refactors the handshake process to be lock-free during RPC calls, and enables symmetric merging of connection parameters, ensuring that endpoints are reused rather than aggressively deleted. This change aims to prevent connection hangs and improve the stability of RDMA communication, especially in scenarios with high concurrency.

Highlights

  • Simultaneous Open Race Condition Fix: Addressed a race condition in RdmaEndpoint during simultaneous connection attempts from two workers, which previously led to the usage of deleted endpoints and connection failures.
  • New CONNECTING State: Introduced a CONNECTING state to the RdmaEndPoint status machine to prevent duplicate active handshake attempts and manage concurrent connection setups more robustly.
  • Lock-Free RPC for Handshake: Modified the active connection setup to release the endpoint lock before sending the blocking RPC handshake, preventing deadlocks and allowing passive handlers to process incoming requests concurrently.
  • Symmetric Connection Merging: Updated the passive connection handler to reuse existing CONNECTING endpoints and merge connection parameters symmetrically, rather than aggressively deleting and recreating endpoints.
  • ERDMA Specific Logic Preservation: Preserved the original endpoint destruction and construction logic within setupConnectionsByPassive when CONFIG_ERDMA is defined, due to the lack of an eRDMA testing environment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust fix for a race condition during simultaneous RDMA handshake attempts. The solution is well-designed, incorporating a new CONNECTING state to serialize handshake attempts from the same endpoint and releasing the lock during the blocking RPC to prevent deadlocks. The changes correctly handle the "simultaneous open" scenario by allowing endpoints to merge symmetrically. My feedback includes a minor suggestion to improve code readability by replacing a magic number with a named constant.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 24, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 254 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ine/src/transport/rdma_transport/rdma_endpoint.cpp 0.00% 130 Missing ⚠️
...er-engine/tests/rdma_endpoint_reestablish_test.cpp 0.00% 124 Missing ⚠️

📢 Thoughts on this report? Let us know!

@caozhanhao caozhanhao force-pushed the pr/fix_simultaneous_open branch from e2f389e to af5c0f6 Compare March 24, 2026 06:37
@KMSorSMS KMSorSMS self-requested a review March 24, 2026 06:59
@staryxchen
Copy link
Copy Markdown
Collaborator

Great fix for the simultaneous open bug! One alternative worth considering: instead of merging after collision, we could eliminate the collision entirely by assigning roles deterministically — the side with the lexicographically smaller ibv_gid always initiates, the other side waits passively.

This would remove the need for the CONNECTING state, lock-free RPC, and peer_qp_num_list_ bookkeeping. The trade-off is that the passive side needs a timeout to promote itself to initiator if the other side fails to connect.

Both approaches have precedent: this PR follows a pattern similar to Raft's conflict resolution (let collisions happen, then reconcile), while the role-assignment approach mirrors etcd's peer connection convention ("lower ID initiates").

Curious whether you considered this direction, and if so, what made you prefer the current approach.

@caozhanhao
Copy link
Copy Markdown
Contributor Author

caozhanhao commented Mar 24, 2026

Great fix for the simultaneous open bug! One alternative worth considering: instead of merging after collision, we could eliminate the collision entirely by assigning roles deterministically — the side with the lexicographically smaller ibv_gid always initiates, the other side waits passively.

This would remove the need for the CONNECTING state, lock-free RPC, and peer_qp_num_list_ bookkeeping. The trade-off is that the passive side needs a timeout to promote itself to initiator if the other side fails to connect.

Both approaches have precedent: this PR follows a pattern similar to Raft's conflict resolution (let collisions happen, then reconcile), while the role-assignment approach mirrors etcd's peer connection convention ("lower ID initiates").

Curious whether you considered this direction, and if so, what made you prefer the current approach.

Thanks for the insightful comment! The main consideration is that endpoints are established on demand, exactly when a worker needs to transmit data. Since we do not know the data flow ahead of time, we cannot deterministically assign the initiator in advance. If we enforce a rule such as “the smaller GID initiates, the larger GID waits,” we run into a problem when the larger-GID side is the first (or only) side that needs to send data. It would remain stuck (or until a timeout) in a passive state, while the smaller-GID side might never initiate if it has nothing to send. In other words, in RdmaTransport, the "passive" side is purely reactive -- it doesn't know that a connection is even needed until the active side explicitly sends the handshake RPC.

In practice, our data flow is usually unidirectional, which naturally avoids simultaneous opens and is probably why this bug remained hidden until recently.

By the way, your suggestion is perfectly valid for a proactive connection setup. In fact, Mooncake PG has a warm-up path in a background ConnectionPoller that does use a role-assignment approach.

Copy link
Copy Markdown
Contributor

@KMSorSMS KMSorSMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job

@caozhanhao
Copy link
Copy Markdown
Contributor Author

caozhanhao commented Mar 27, 2026

Hi @alogfans, @stmatengss,

I've got access to an eRDMA machine and uploaded a simple test to verify the endpoint re-establishment behavior.

The test scenario is as follows:

  1. Target Setup: The target starts and registers a memory segment to etcd.
  2. Initiator Phase 1 (Write & "Crash"):
    • The initiator starts and connects to the target.
    • It writes a test data pattern to the target's memory.
    • It then shuts down completely to simulate a crash or a process restart.
  3. Initiator Phase 2 (Restart, Read & Verify):
    • The initiator restarts with a brand-new transfer engine instance.
    • It re-opens the target segment and reads the data back from the target.
    • Finally, it compares the read data against the original pattern to verify that the connection was successfully re-established and the data is correct.

I tested two different implementations of disconnectForReestablish on this machine:

  • Reset-only (replace disconnectForReestablish with disconnectUnlocked): This approach resets the QPs without destroying them.
    • Result: The initiator hangs when reading data back in Phase 2.
  • Full Recreate (current implementation): This approach fully destroys and recreates the QPs.
    • Result: The test passed and showed no abnormal behavior.

These results confirm that this workaround is necessary for eRDMA compatibility and this PR does not compromise that behavior.

Detailed logs are attached below, and the reproduction commands are included in the test file comments.

Click to show details

Details

ibv_devinfo

$ ibv_devinfo  
hca_id: erdma_0  
       transport:                      eRDMA (0)  
       fw_ver:                         0.2.0  
       node_guid:                      <REDACTED>
       sys_image_guid:                 <REDACTED>
       vendor_id:                      0x1ded  
       vendor_part_id:                 4223  
       hw_ver:                         0x0  
       phys_port_cnt:                  1  
               port:   1  
                       state:                  PORT_ACTIVE (4)  
                       max_mtu:                1024 (3)  
                       active_mtu:             1024 (3)  
                       sm_lid:                 0  
                       port_lid:               0  
                       port_lmc:               0x00  
                       link_layer:             Ethernet  
  
hca_id: erdma_1  
       transport:                      eRDMA (0)  
       fw_ver:                         0.2.0  
       node_guid:                      <REDACTED>
       sys_image_guid:                 <REDACTED>
       vendor_id:                      0x1ded  
       vendor_part_id:                 4223  
       hw_ver:                         0x0  
       phys_port_cnt:                  1  
               port:   1  
                       state:                  PORT_ACTIVE (4)  
                       max_mtu:                1024 (3)  
                       active_mtu:             1024 (3)  
                       sm_lid:                 0  
                       port_lid:               0  
                       port_lmc:               0x00  
                       link_layer:             Ethernet

Reset only

Target

$ sudo ./build/mooncake-transfer-engine/tests/erdma_endpoint_reestablish_test --mode=target \  
 --metadata_server=127.0.0.1:18222 \  
 --local_server_name=127.0.0.1:12345 --device_name=erdma_0  
WARNING: Logging before InitGoogleLogging() is written to STDERR  
I0327 21:47:23.000655 154215 transfer_engine_impl.cpp:597] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)  
I0327 21:47:23.000694 154215 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12345  
I0327 21:47:23.001685 154215 transfer_metadata_plugin.cpp:1127] Found active interface eth0 with IP 192.168.0.139  
I0327 21:47:23.001693 154215 transfer_metadata_plugin.cpp:1127] Found active interface eth1 with IP 192.168.0.140  
I0327 21:47:23.001695 154215 transfer_metadata_plugin.cpp:1127] Found active interface br-9ad27c85edb6 with IP 172.160.0.1  
I0327 21:47:23.001698 154215 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)  
I0327 21:47:23.001700 154215 transfer_metadata_plugin.cpp:1127] Found active interface br-84edf5efe0de with IP 172.18.0.1  
I0327 21:47:23.001703 154215 transfer_metadata_plugin.cpp:1127] Found active interface flannel.1 with IP 10.60.0.0  
I0327 21:47:23.001706 154215 transfer_metadata_plugin.cpp:1127] Found active interface cni0 with IP 10.60.0.1  
I0327 21:47:23.001724 154215 transfer_engine_impl.cpp:172] Transfer Engine RPC using new RPC mapping, listening on 192.168.0.139:15338  
I0327 21:47:23.010514 154215 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.  
I0327 21:47:23.010569 154215 rdma_context.cpp:77] Using SIEVE endpoint store  
I0327 21:47:23.011476 154215 rdma_context.cpp:605] Find best gid index: 1 on erdma_0/ (network state: with network device)  
I0327 21:47:23.011823 154215 rdma_context.cpp:140] RDMA device: erdma_0, LID: 0, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:c0:a8:00:8b  
I0327 21:47:23.115234 154215 erdma_endpoint_reestablish_test.cpp:276] Target is up. Waiting for RDMA connections and operations...  
W0327 21:47:29.284736 154257 rdma_endpoint.cpp:321] Re-establish connection: EndPoint: local 127.0.0.1:12345@erdma_0, peer 127.0.0.1:12346@erdma_1  

Initiator

$ sudo ./build/mooncake-transfer-engine/tests/erdma_endpoint_reestablish_test --metadata_server=127.0.0.1:18222 \  
 --segment_id=127.0.0.1:12345 --local_server_name=127.0.0.1:12346 \  
 --device_name=erdma_1  
WARNING: Logging before InitGoogleLogging() is written to STDERR  
I0327 21:47:26.205616 154294 erdma_endpoint_reestablish_test.cpp:136] ========== Phase 1: Start, Connect & Write ==========  
I0327 21:47:26.205670 154294 transfer_engine_impl.cpp:597] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)  
I0327 21:47:26.205679 154294 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12346  
I0327 21:47:26.206720 154294 transfer_metadata_plugin.cpp:1127] Found active interface eth0 with IP 192.168.0.139  
I0327 21:47:26.206728 154294 transfer_metadata_plugin.cpp:1127] Found active interface eth1 with IP 192.168.0.140  
I0327 21:47:26.206732 154294 transfer_metadata_plugin.cpp:1127] Found active interface br-9ad27c85edb6 with IP 172.160.0.1  
I0327 21:47:26.206735 154294 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)  
I0327 21:47:26.206738 154294 transfer_metadata_plugin.cpp:1127] Found active interface br-84edf5efe0de with IP 172.18.0.1  
I0327 21:47:26.206741 154294 transfer_metadata_plugin.cpp:1127] Found active interface flannel.1 with IP 10.60.0.0  
I0327 21:47:26.206745 154294 transfer_metadata_plugin.cpp:1127] Found active interface cni0 with IP 10.60.0.1  
I0327 21:47:26.206764 154294 transfer_engine_impl.cpp:172] Transfer Engine RPC using new RPC mapping, listening on 192.168.0.139:15714  
I0327 21:47:26.215229 154294 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.  
I0327 21:47:26.215288 154294 rdma_context.cpp:77] Using SIEVE endpoint store  
I0327 21:47:26.216233 154294 rdma_context.cpp:605] Find best gid index: 1 on erdma_1/ (network state: with network device)  
I0327 21:47:26.217396 154294 rdma_context.cpp:140] RDMA device: erdma_1, LID: 0, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:c0:a8:00:8c  
I0327 21:47:26.277781 154294 erdma_endpoint_reestablish_test.cpp:167] Writing 16777216 bytes to Target...  
I0327 21:47:26.283797 154294 erdma_endpoint_reestablish_test.cpp:181] Phase 1: Write Completed. Tearing down connection...  
I0327 21:47:27.221803 154294 erdma_endpoint_reestablish_test.cpp:188] Simulating Initiator Crash/Restart... Waiting 2 seconds.  
I0327 21:47:29.221902 154294 erdma_endpoint_reestablish_test.cpp:191] ========== Phase 2: Restart, Re-establish Endpoint & Read ==========  
I0327 21:47:29.222019 154294 transfer_engine_impl.cpp:597] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)  
I0327 21:47:29.222043 154294 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12346  
I0327 21:47:29.223660 154294 transfer_metadata_plugin.cpp:1127] Found active interface eth0 with IP 192.168.0.139  
I0327 21:47:29.223670 154294 transfer_metadata_plugin.cpp:1127] Found active interface eth1 with IP 192.168.0.140  
I0327 21:47:29.223673 154294 transfer_metadata_plugin.cpp:1127] Found active interface br-9ad27c85edb6 with IP 172.160.0.1  
I0327 21:47:29.223676 154294 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)  
I0327 21:47:29.223680 154294 transfer_metadata_plugin.cpp:1127] Found active interface br-84edf5efe0de with IP 172.18.0.1  
I0327 21:47:29.223681 154294 transfer_metadata_plugin.cpp:1127] Found active interface flannel.1 with IP 10.60.0.0  
I0327 21:47:29.223685 154294 transfer_metadata_plugin.cpp:1127] Found active interface cni0 with IP 10.60.0.1  
I0327 21:47:29.223697 154294 transfer_engine_impl.cpp:172] Transfer Engine RPC using new RPC mapping, listening on 192.168.0.139:15952  
I0327 21:47:29.225520 154294 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.  
I0327 21:47:29.225531 154294 rdma_context.cpp:77] Using SIEVE endpoint store  
I0327 21:47:29.225744 154294 rdma_context.cpp:605] Find best gid index: 1 on erdma_1/ (network state: with network device)  
I0327 21:47:29.226050 154294 rdma_context.cpp:140] RDMA device: erdma_1, LID: 0, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:c0:a8:00:8c  
I0327 21:47:29.283594 154294 erdma_endpoint_reestablish_test.cpp:213] Re-opening segment to reconstruct RDMA Endpoint...  
I0327 21:47:29.283883 154294 erdma_endpoint_reestablish_test.cpp:219] Reading data back over new Endpoint...

Destroy + Recreate (current version)

Target

$ sudo ./build/mooncake-transfer-engine/tests/erdma_endpoint_reestablish_test --mode=target \  
 --metadata_server=127.0.0.1:18222 \  
 --local_server_name=127.0.0.1:12345 --device_name=erdma_0  
WARNING: Logging before InitGoogleLogging() is written to STDERR  
I0327 21:49:35.164558 157413 transfer_engine_impl.cpp:597] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)  
I0327 21:49:35.164608 157413 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12345  
I0327 21:49:35.165606 157413 transfer_metadata_plugin.cpp:1127] Found active interface eth0 with IP 192.168.0.139  
I0327 21:49:35.165613 157413 transfer_metadata_plugin.cpp:1127] Found active interface eth1 with IP 192.168.0.140  
I0327 21:49:35.165616 157413 transfer_metadata_plugin.cpp:1127] Found active interface br-9ad27c85edb6 with IP 172.160.0.1  
I0327 21:49:35.165619 157413 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)  
I0327 21:49:35.165622 157413 transfer_metadata_plugin.cpp:1127] Found active interface br-84edf5efe0de with IP 172.18.0.1  
I0327 21:49:35.165624 157413 transfer_metadata_plugin.cpp:1127] Found active interface flannel.1 with IP 10.60.0.0  
I0327 21:49:35.165627 157413 transfer_metadata_plugin.cpp:1127] Found active interface cni0 with IP 10.60.0.1  
I0327 21:49:35.165642 157413 transfer_engine_impl.cpp:172] Transfer Engine RPC using new RPC mapping, listening on 192.168.0.139:15833  
I0327 21:49:35.174588 157413 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.  
I0327 21:49:35.174639 157413 rdma_context.cpp:77] Using SIEVE endpoint store  
I0327 21:49:35.175544 157413 rdma_context.cpp:605] Find best gid index: 1 on erdma_0/ (network state: with network device)  
I0327 21:49:35.184914 157413 rdma_context.cpp:140] RDMA device: erdma_0, LID: 0, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:c0:a8:00:8b  
I0327 21:49:35.296639 157413 erdma_endpoint_reestablish_test.cpp:276] Target is up. Waiting for RDMA connections and operations...  
W0327 21:49:42.237241 157454 rdma_endpoint.cpp:321] Re-establish connection: EndPoint: local 127.0.0.1:12345@erdma_0, peer 127.0.0.1:12346@erdma_1

Initiator

$ sudo ./build/mooncake-transfer-engine/tests/erdma_endpoint_reestablish_test --metadata_server=127.0.0.1:18222 \  
 --segment_id=127.0.0.1:12345 --local_server_name=127.0.0.1:12346 \  
 --device_name=erdma_1  
WARNING: Logging before InitGoogleLogging() is written to STDERR  
I0327 21:49:38.987094 157507 erdma_endpoint_reestablish_test.cpp:136] ========== Phase 1: Start, Connect & Write ==========  
I0327 21:49:38.987149 157507 transfer_engine_impl.cpp:597] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)  
I0327 21:49:38.987159 157507 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12346  
I0327 21:49:38.988116 157507 transfer_metadata_plugin.cpp:1127] Found active interface eth0 with IP 192.168.0.139  
I0327 21:49:38.988129 157507 transfer_metadata_plugin.cpp:1127] Found active interface eth1 with IP 192.168.0.140  
I0327 21:49:38.988132 157507 transfer_metadata_plugin.cpp:1127] Found active interface br-9ad27c85edb6 with IP 172.160.0.1  
I0327 21:49:38.988135 157507 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)  
I0327 21:49:38.988138 157507 transfer_metadata_plugin.cpp:1127] Found active interface br-84edf5efe0de with IP 172.18.0.1  
I0327 21:49:38.988142 157507 transfer_metadata_plugin.cpp:1127] Found active interface flannel.1 with IP 10.60.0.0  
I0327 21:49:38.988143 157507 transfer_metadata_plugin.cpp:1127] Found active interface cni0 with IP 10.60.0.1  
I0327 21:49:38.988157 157507 transfer_engine_impl.cpp:172] Transfer Engine RPC using new RPC mapping, listening on 192.168.0.139:16225  
I0327 21:49:39.006590 157507 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.  
I0327 21:49:39.006641 157507 rdma_context.cpp:77] Using SIEVE endpoint store  
I0327 21:49:39.007591 157507 rdma_context.cpp:605] Find best gid index: 1 on erdma_1/ (network state: with network device)  
I0327 21:49:39.007987 157507 rdma_context.cpp:140] RDMA device: erdma_1, LID: 0, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:c0:a8:00:8c  
I0327 21:49:39.101894 157507 erdma_endpoint_reestablish_test.cpp:167] Writing 16777216 bytes to Target...  
I0327 21:49:39.107182 157507 erdma_endpoint_reestablish_test.cpp:181] Phase 1: Write Completed. Tearing down connection...  
I0327 21:49:40.021786 157507 erdma_endpoint_reestablish_test.cpp:188] Simulating Initiator Crash/Restart... Waiting 2 seconds.  
I0327 21:49:42.021867 157507 erdma_endpoint_reestablish_test.cpp:191] ========== Phase 2: Restart, Re-establish Endpoint & Read ==========  
I0327 21:49:42.021926 157507 transfer_engine_impl.cpp:597] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)  
I0327 21:49:42.021936 157507 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12346  
I0327 21:49:42.023159 157507 transfer_metadata_plugin.cpp:1127] Found active interface eth0 with IP 192.168.0.139  
I0327 21:49:42.023166 157507 transfer_metadata_plugin.cpp:1127] Found active interface eth1 with IP 192.168.0.140  
I0327 21:49:42.023169 157507 transfer_metadata_plugin.cpp:1127] Found active interface br-9ad27c85edb6 with IP 172.160.0.1  
I0327 21:49:42.023172 157507 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)  
I0327 21:49:42.023175 157507 transfer_metadata_plugin.cpp:1127] Found active interface br-84edf5efe0de with IP 172.18.0.1  
I0327 21:49:42.023178 157507 transfer_metadata_plugin.cpp:1127] Found active interface flannel.1 with IP 10.60.0.0  
I0327 21:49:42.023181 157507 transfer_metadata_plugin.cpp:1127] Found active interface cni0 with IP 10.60.0.1  
I0327 21:49:42.023192 157507 transfer_engine_impl.cpp:172] Transfer Engine RPC using new RPC mapping, listening on 192.168.0.139:15045  
I0327 21:49:42.056448 157507 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.  
I0327 21:49:42.056463 157507 rdma_context.cpp:77] Using SIEVE endpoint store  
I0327 21:49:42.056684 157507 rdma_context.cpp:605] Find best gid index: 1 on erdma_1/ (network state: with network device)  
I0327 21:49:42.057088 157507 rdma_context.cpp:140] RDMA device: erdma_1, LID: 0, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:c0:a8:00:8c  
I0327 21:49:42.236001 157507 erdma_endpoint_reestablish_test.cpp:213] Re-opening segment to reconstruct RDMA Endpoint...  
I0327 21:49:42.236313 157507 erdma_endpoint_reestablish_test.cpp:219] Reading data back over new Endpoint...  
I0327 21:49:42.253219 157507 erdma_endpoint_reestablish_test.cpp:244] >>> ENDPOINT RECONSTRUCTION VERIFICATION: SUCCESS <<<

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in TE’s RDMA connection handshake (“simultaneous open”) by adjusting the RdmaEndPoint state machine so concurrent active/passive handshakes reuse/merge an endpoint instead of deleting/recreating it.

Changes:

  • Add a CONNECTING state and lock-free (unlock-before-RPC) active handshake flow to avoid deadlocks and duplicate handshakes.
  • Update passive handshake handling to reuse existing endpoints rather than aggressively deleting them.
  • Add a manual eRDMA endpoint re-establishment test executable.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp Stops deleting endpoints in the passive handshake path; always reuses/creates via context->endpoint().
mooncake-transfer-engine/src/transport/rdma_transport/rdma_endpoint.cpp Implements CONNECTING, lock-free RPC handshake, peer-QP matching/idempotence, and eRDMA reconstruct path.
mooncake-transfer-engine/include/transport/rdma_transport/rdma_endpoint.h Adds CONNECTING, peer QP tracking, and handshake wait/backoff constants; introduces disconnectForReestablish().
mooncake-transfer-engine/tests/erdma_endpoint_reestablish_test.cpp Adds a manual two-process eRDMA re-establishment validation program.
mooncake-transfer-engine/tests/CMakeLists.txt Builds the new manual test executable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// RPC and it is passively established in setupConnectionsByPassive, simply
// reuse the existing endpoint.
if (connected()) {
if (peer_qp_num_list_ == peer_desc.qp_num) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can peer_qp_num_list vectors be directly compared?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can compare them directly.

qp_num_list contains the peer QP numbers and is order-sensitive. If we received a different qp_num_list from the same peer NIC path, we can confirm the peer QP has been re-initialized (e.g. due to a process restart), and therefore a connection re-establishment is required.

If both the peer NIC path and qp_num_list remain unchanged, we treat it as a duplicate connection setup and avoid resetting the connection. This situation can occur when multiple endpoints are being established concurrently from the same peer NIC path. Reusing the existing connection is necessary; otherwise, later requests could overwrite and invalidate previously created QPs (as reported in TENT's PR #1705, that PR also compares qp_num_list). Although this PR introduces a 'wait existing handshake' mechanism that should prevent concurrent endpoint setup, I've chosen to keep the reuse logic to ensure safety and maintain consistency with TENT. Because concurrent endpoint setup is more likely to happen (if there were cases the 'wait existing handshake' fails to cover) than the rare edge case described below.

There is, however, a theoretically possible but rare edge case where peer QPs are re-initialized but happen to reuse the exact same QP numbers with the exact same order. In such cases, we would incorrectly skip the connection re-establishment.

A more robust solution could involve introducing a session id, but it seems adding unnecessary complexity at current stage. My initial thought is that simply attaching a session ID from the active side would not eliminate the need for qp_num_list comparison in setupConnectionsByActive. Because during simultaneous open, a node acts as both the active and passive side simultaneously. Consequently, the session ID sent by the active side thread would differ from the one received by the another passive thread handling the peer's request. Perhaps a more viable solution would be for both sides to generate and exchange their own unique IDs -- similar to how QP numbers are assigned -- forming a (local_session_id, peer_session_id) pair to uniquely identify the connection. However, integrating this into the handshake process in a robust manner doesn't seem straightforward.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* 1. Start etcd:
* etcd --listen-client-urls http://127.0.0.1:18222 \
* --advertise-client-urls http://127.0.0.1:18222
*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Make this test be more general. 2. Add this test to CI.yml.

Copy link
Copy Markdown
Contributor Author

@caozhanhao caozhanhao Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the test to use GTest, making it much cleaner and more generic. However, I think we can't enable it in CI, since this test is meaningless without RDMA devices.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI servers have RDMA NICs, so we're ready to test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Make this test be more general. 2. Add this test to CI.yml.

@stmatengss Just double-checking: should I add this test to the T-one CI (under scripts/tone_tests/) instead of CI.yml? I noticed that the T-one CI happens to have two eRDMA NICs available, whereas ci.yml uses GitHub-hosted runners which don't seem to have RDMA NICs. Let me know what you think!

Copy link
Copy Markdown
Contributor Author

@caozhanhao caozhanhao Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Make this test be more general. 2. Add this test to CI.yml.

@stmatengss Just double-checking: should I add this test to the T-one CI (under scripts/tone_tests/) instead of CI.yml? I noticed that the T-one CI happens to have two eRDMA NICs available, whereas ci.yml uses GitHub-hosted runners which don't seem to have RDMA NICs. Let me know what you think!

I looked into the T-one CI workflow and realized that integrating this test perhaps need to modify the external tone-cli repository.

Currently, ci.yml uploads the mooncake wheel as an artifact, and integration-test.yml passes that artifact ID to T-one CI, which then runs scripts/tone_tests/ on their servers. To include the transfer engine test, we would need to upload it as an additional artifact. However, passing a extra artifact ID also requires tone-cli changes to handle them.

Since this expands the scope quite a bit, would it be okay to handle this in a separate PR?

Copy link
Copy Markdown
Collaborator

@stmatengss stmatengss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stmatengss
Copy link
Copy Markdown
Collaborator

@alogfans Please double-check. These code changes are in critical path.

Copy link
Copy Markdown
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alogfans alogfans merged commit 4b3d44f into kvcache-ai:main Mar 30, 2026
17 of 19 checks passed
Vincent-Bo-ali pushed a commit to openanolis/Mooncake that referenced this pull request Mar 30, 2026
* [TE] Fix simultaneous open handshake in RdmaEndpoint

* Keep the same logic for ERDMA.

* apply gemini-code-assist's suggestion.

* Fix endpoint reinitialization.

* Add disconnect and waiting with back-off.

* Add eRDMA Endpoint Re-establishment Test

* Include the test in cmake

* Address reviewer comments

* Better log message.
whn09 pushed a commit to whn09/Mooncake that referenced this pull request Apr 4, 2026
* [TE] Fix simultaneous open handshake in RdmaEndpoint

* Keep the same logic for ERDMA.

* apply gemini-code-assist's suggestion.

* Fix endpoint reinitialization.

* Add disconnect and waiting with back-off.

* Add eRDMA Endpoint Re-establishment Test

* Include the test in cmake

* Address reviewer comments

* Better log message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants