fix qp_leak bug after PR#384 by litiantian00 · Pull Request #1643 · kvcache-ai/Mooncake

litiantian00 · 2026-03-10T09:30:56Z

PR#384 leaves the endpoint status unchanged to fix the concurrency bug in the following scenarios: "1 NIC for the prefill server, and config another NIC for the decode server. Or try setting up a greater MC_TRANSFER_TIMEOUT when your servers are under heavy load. Or try intra-node nvlink." However, this commit causes a QP leak during network fluctuations, accompanied by the log "Worker: Process failed for slice ......: transport retry counter exceeded". Therefore, we would like to revert this change, or find alternative approaches to avoid the QP leak issue.

gemini-code-assist · 2026-03-10T10:07:30Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical Queue Pair (QP) leak issue that emerged following the implementation of PR#384. While PR#384 aimed to fix a concurrency bug by maintaining endpoint status, it inadvertently led to resource leaks under network instability. The current changes introduce a targeted fix to properly deactivate endpoints when they are removed from active use, thereby preventing the QP leak and ensuring robust resource management without reintroducing the original concurrency problem.

Highlights

QP Leak Fix: Addressed a Queue Pair (QP) leak bug that was introduced by a previous pull request (PR#384) during network fluctuations.
Endpoint Deactivation: Ensured that endpoints are explicitly marked as inactive (set_active(false)) when they are being deleted and moved to a waiting list, preventing resource leaks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

mooncake-transfer-engine/src/transport/rdma_transport/endpoint_store.cpp
- Added a call to set_active(false) when an endpoint is moved to the waiting_list_ during deletion in FIFOEndpointStore::deleteEndpoint.
- Added a call to set_active(false) when an endpoint is moved to the waiting_list_ during deletion in SIEVEEndpointStore::deleteEndpoint.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a Queue Pair (QP) leak by marking endpoints as inactive upon deletion, which facilitates the reclamation of associated RDMA resources. While this is a correct step towards resolving the issue, the fix is currently incomplete. The set_active(false) call has not been applied to the eviction logic within FIFOEndpointStore, meaning QP leaks will persist when endpoints are evicted due to store capacity. To fully resolve the QP leak, it is recommended to also apply the set_active(false) call within FIFOEndpointStore::evictEndpoint.

gemini-code-assist · 2026-03-10T10:15:55Z

mooncake-transfer-engine/src/transport/rdma_transport/endpoint_store.cpp

    // in case it is setting up connection or submitting slice
    if (iter != endpoint_map_.end()) {
        waiting_list_.insert(iter->second);
+        iter->second->set_active(false);


The pull request aims to fix a QP leak by setting the endpoint status to inactive when it is deleted. This ensures that the endpoint can be reclaimed once all outstanding transfers are complete. However, the fix is only applied to the deleteEndpoint method and is missing from the FIFOEndpointStore::evictEndpoint method. When an endpoint is evicted from the FIFO store because it has reached its maximum size, it is moved to the waiting_list_ but its active_ status remains true. As a result, RdmaEndPoint::hasOutstandingSlice() will always return true, and FIFOEndpointStore::reclaimEndpoint() will never remove the endpoint from the waiting_list_. This leads to a leak of RDMA Queue Pairs (QPs) and other resources, which can eventually lead to a Denial of Service (DoS) when the system runs out of available QPs.

Note that SIEVEEndpointStore::evictEndpoint correctly calls set_active(false) on line 212, so FIFOEndpointStore should be updated to match this behavior.

Overall LGTM. But what do authors think about this issue?

The behaviors of SIEVEEndpointStore and FIFOEndpointStore do indeed need to match, but the purpose of this PR is to fix the qp leakage issue introduced in PR#384. PR#384 did not modify the evictEndpoint function, and I cannot predict what impact it would have.

codecov-commenter · 2026-03-10T10:56:55Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ne/src/transport/rdma_transport/endpoint_store.cpp	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

This reverts commit ff6274f.

UNIDY2002 · 2026-03-13T04:50:00Z

@litiantian00 Thank you for your contribution! Nevertheless, it seems to affect the function of Mooncake PG, a collective communication library. To be specific, Mooncake PG hangs during initialization after this commit is introduced.

We're investigating into this issue. Do you have any possible clues?

FYI, Mooncake PG involves massive RDMA connection setup among all peers during initialization. Each process can serve as a sender or a receiver to its peer. The correctness implicitly depends on the detailed implementation of TE (through a way I haven't figured out yet).

During initialization, deleteEndpoint is called here:

Mooncake/mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp

Line 639 in 901e838

if (context->deleteEndpoint(peer_desc.local_nic_path)) return ERR_ENDPOINT;

This reverts commit 4fb64ae.

UNIDY2002 · 2026-03-13T07:16:31Z

Update: When running mooncake-wheel/tests/test_mooncake_backend.py, I have located a hang in the iter->second.first->set_active(false);. There seems to be a deadlock, but I haven't yet identified which two functions are competing the lock.

litiantian00 · 2026-03-13T10:11:04Z

@UNIDY2002 This PR simply reverted the commit of PR #384. PR #384 did not destroy the endpoint. This PR, triggered the endpoint's deconstruction by calling set_active(false) in the delete endpoint method, thus destroying the invalid endpoint. I noticed that the changes in mooncake-wheel/tests/test_mooncake_backend.py and the Mooncake PG were later than PR #384, so I suspect improper usage, resulting in the use of a deleted endpoint for sending/receiving packets without prior checking for endpoint availability.

UNIDY2002 · 2026-03-13T15:31:44Z

@litiantian00 Thank you very much for your assistance, and for your PR unveiling the underlying problem. I have located the issue.

This reverts commit ff6274f.

Co-authored-by: litiantian.118 <litiantian.118@jd.com>

add endpoint set false in endpoint_store.cpp

b2085e5

litiantian00 requested review from alogfans, chestnut-Q and doujiang24 as code owners March 10, 2026 09:30

github-actions bot added run-ci Transfer Engine labels Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

stmatengss assigned alogfans Mar 10, 2026

alogfans approved these changes Mar 11, 2026

View reviewed changes

alogfans merged commit ff6274f into kvcache-ai:main Mar 11, 2026
15 of 16 checks passed

UNIDY2002 added a commit that referenced this pull request Mar 13, 2026

Revert "add endpoint set false in endpoint_store.cpp (#1643)"

4fb64ae

This reverts commit ff6274f.

UNIDY2002 added a commit that referenced this pull request Mar 13, 2026

Reapply "add endpoint set false in endpoint_store.cpp (#1643)"

5b7fb61

This reverts commit 4fb64ae.

UNIDY2002 added a commit that referenced this pull request Mar 18, 2026

Revert "add endpoint set false in endpoint_store.cpp (#1643)"

5c57379

This reverts commit ff6274f.

caozhanhao mentioned this pull request Mar 24, 2026

[TE] Fix simultaneous open handshake in RdmaEndpoint #1733

Merged

21 tasks

whn09 pushed a commit to whn09/Mooncake that referenced this pull request Apr 4, 2026

add endpoint set false in endpoint_store.cpp (kvcache-ai#1643)

d1b41db

Co-authored-by: litiantian.118 <litiantian.118@jd.com>

caozhanhao mentioned this pull request Apr 9, 2026

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario #1845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix qp_leak bug after PR#384#1643

fix qp_leak bug after PR#384#1643
alogfans merged 1 commit intokvcache-ai:mainfrom
litiantian00:qp-leak-patch

litiantian00 commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Uh oh!

alogfans Mar 11, 2026

Uh oh!

litiantian00 Mar 11, 2026

Uh oh!

codecov-commenter commented Mar 10, 2026

Uh oh!

Uh oh!

UNIDY2002 commented Mar 13, 2026

Uh oh!

UNIDY2002 commented Mar 13, 2026

Uh oh!

litiantian00 commented Mar 13, 2026

Uh oh!

UNIDY2002 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

litiantian00 commented Mar 10, 2026

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

alogfans Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

litiantian00 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 10, 2026

Codecov Report

Uh oh!

Uh oh!

UNIDY2002 commented Mar 13, 2026

Uh oh!

UNIDY2002 commented Mar 13, 2026

Uh oh!

litiantian00 commented Mar 13, 2026

Uh oh!

UNIDY2002 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants