Skip to content

fix qp_leak bug after PR#384#1643

Merged
alogfans merged 1 commit intokvcache-ai:mainfrom
litiantian00:qp-leak-patch
Mar 11, 2026
Merged

fix qp_leak bug after PR#384#1643
alogfans merged 1 commit intokvcache-ai:mainfrom
litiantian00:qp-leak-patch

Conversation

@litiantian00
Copy link
Copy Markdown
Contributor

PR#384 leaves the endpoint status unchanged to fix the concurrency bug in the following scenarios: "1 NIC for the prefill server, and config another NIC for the decode server. Or try setting up a greater MC_TRANSFER_TIMEOUT when your servers are under heavy load. Or try intra-node nvlink." However, this commit causes a QP leak during network fluctuations, accompanied by the log "Worker: Process failed for slice ......: transport retry counter exceeded". Therefore, we would like to revert this change, or find alternative approaches to avoid the QP leak issue.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical Queue Pair (QP) leak issue that emerged following the implementation of PR#384. While PR#384 aimed to fix a concurrency bug by maintaining endpoint status, it inadvertently led to resource leaks under network instability. The current changes introduce a targeted fix to properly deactivate endpoints when they are removed from active use, thereby preventing the QP leak and ensuring robust resource management without reintroducing the original concurrency problem.

Highlights

  • QP Leak Fix: Addressed a Queue Pair (QP) leak bug that was introduced by a previous pull request (PR#384) during network fluctuations.
  • Endpoint Deactivation: Ensured that endpoints are explicitly marked as inactive (set_active(false)) when they are being deleted and moved to a waiting list, preventing resource leaks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-transfer-engine/src/transport/rdma_transport/endpoint_store.cpp
    • Added a call to set_active(false) when an endpoint is moved to the waiting_list_ during deletion in FIFOEndpointStore::deleteEndpoint.
    • Added a call to set_active(false) when an endpoint is moved to the waiting_list_ during deletion in SIEVEEndpointStore::deleteEndpoint.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a Queue Pair (QP) leak by marking endpoints as inactive upon deletion, which facilitates the reclamation of associated RDMA resources. While this is a correct step towards resolving the issue, the fix is currently incomplete. The set_active(false) call has not been applied to the eviction logic within FIFOEndpointStore, meaning QP leaks will persist when endpoints are evicted due to store capacity. To fully resolve the QP leak, it is recommended to also apply the set_active(false) call within FIFOEndpointStore::evictEndpoint.

// in case it is setting up connection or submitting slice
if (iter != endpoint_map_.end()) {
waiting_list_.insert(iter->second);
iter->second->set_active(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The pull request aims to fix a QP leak by setting the endpoint status to inactive when it is deleted. This ensures that the endpoint can be reclaimed once all outstanding transfers are complete. However, the fix is only applied to the deleteEndpoint method and is missing from the FIFOEndpointStore::evictEndpoint method. When an endpoint is evicted from the FIFO store because it has reached its maximum size, it is moved to the waiting_list_ but its active_ status remains true. As a result, RdmaEndPoint::hasOutstandingSlice() will always return true, and FIFOEndpointStore::reclaimEndpoint() will never remove the endpoint from the waiting_list_. This leads to a leak of RDMA Queue Pairs (QPs) and other resources, which can eventually lead to a Denial of Service (DoS) when the system runs out of available QPs.

Note that SIEVEEndpointStore::evictEndpoint correctly calls set_active(false) on line 212, so FIFOEndpointStore should be updated to match this behavior.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. But what do authors think about this issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviors of SIEVEEndpointStore and FIFOEndpointStore do indeed need to match, but the purpose of this PR is to fix the qp leakage issue introduced in PR#384. PR#384 did not modify the evictEndpoint function, and I cannot predict what impact it would have.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ne/src/transport/rdma_transport/endpoint_store.cpp 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@alogfans alogfans merged commit ff6274f into kvcache-ai:main Mar 11, 2026
15 of 16 checks passed
UNIDY2002 added a commit that referenced this pull request Mar 13, 2026
@UNIDY2002
Copy link
Copy Markdown
Collaborator

@litiantian00 Thank you for your contribution! Nevertheless, it seems to affect the function of Mooncake PG, a collective communication library. To be specific, Mooncake PG hangs during initialization after this commit is introduced.

We're investigating into this issue. Do you have any possible clues?

FYI, Mooncake PG involves massive RDMA connection setup among all peers during initialization. Each process can serve as a sender or a receiver to its peer. The correctness implicitly depends on the detailed implementation of TE (through a way I haven't figured out yet).

During initialization, deleteEndpoint is called here:

if (context->deleteEndpoint(peer_desc.local_nic_path)) return ERR_ENDPOINT;

UNIDY2002 added a commit that referenced this pull request Mar 13, 2026
@UNIDY2002
Copy link
Copy Markdown
Collaborator

Update: When running mooncake-wheel/tests/test_mooncake_backend.py, I have located a hang in the iter->second.first->set_active(false);. There seems to be a deadlock, but I haven't yet identified which two functions are competing the lock.

@litiantian00
Copy link
Copy Markdown
Contributor Author

@UNIDY2002 This PR simply reverted the commit of PR #384. PR #384 did not destroy the endpoint. This PR, triggered the endpoint's deconstruction by calling set_active(false) in the delete endpoint method, thus destroying the invalid endpoint. I noticed that the changes in mooncake-wheel/tests/test_mooncake_backend.py and the Mooncake PG were later than PR #384, so I suspect improper usage, resulting in the use of a deleted endpoint for sending/receiving packets without prior checking for endpoint availability.

@UNIDY2002
Copy link
Copy Markdown
Collaborator

@litiantian00 Thank you very much for your assistance, and for your PR unveiling the underlying problem. I have located the issue.

UNIDY2002 added a commit that referenced this pull request Mar 18, 2026
whn09 pushed a commit to whn09/Mooncake that referenced this pull request Apr 4, 2026
Co-authored-by: litiantian.118 <litiantian.118@jd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants