fix qp_leak bug after PR#384#1643
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical Queue Pair (QP) leak issue that emerged following the implementation of PR#384. While PR#384 aimed to fix a concurrency bug by maintaining endpoint status, it inadvertently led to resource leaks under network instability. The current changes introduce a targeted fix to properly deactivate endpoints when they are removed from active use, thereby preventing the QP leak and ensuring robust resource management without reintroducing the original concurrency problem. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a Queue Pair (QP) leak by marking endpoints as inactive upon deletion, which facilitates the reclamation of associated RDMA resources. While this is a correct step towards resolving the issue, the fix is currently incomplete. The set_active(false) call has not been applied to the eviction logic within FIFOEndpointStore, meaning QP leaks will persist when endpoints are evicted due to store capacity. To fully resolve the QP leak, it is recommended to also apply the set_active(false) call within FIFOEndpointStore::evictEndpoint.
| // in case it is setting up connection or submitting slice | ||
| if (iter != endpoint_map_.end()) { | ||
| waiting_list_.insert(iter->second); | ||
| iter->second->set_active(false); |
There was a problem hiding this comment.
The pull request aims to fix a QP leak by setting the endpoint status to inactive when it is deleted. This ensures that the endpoint can be reclaimed once all outstanding transfers are complete. However, the fix is only applied to the deleteEndpoint method and is missing from the FIFOEndpointStore::evictEndpoint method. When an endpoint is evicted from the FIFO store because it has reached its maximum size, it is moved to the waiting_list_ but its active_ status remains true. As a result, RdmaEndPoint::hasOutstandingSlice() will always return true, and FIFOEndpointStore::reclaimEndpoint() will never remove the endpoint from the waiting_list_. This leads to a leak of RDMA Queue Pairs (QPs) and other resources, which can eventually lead to a Denial of Service (DoS) when the system runs out of available QPs.
Note that SIEVEEndpointStore::evictEndpoint correctly calls set_active(false) on line 212, so FIFOEndpointStore should be updated to match this behavior.
There was a problem hiding this comment.
Overall LGTM. But what do authors think about this issue?
There was a problem hiding this comment.
The behaviors of SIEVEEndpointStore and FIFOEndpointStore do indeed need to match, but the purpose of this PR is to fix the qp leakage issue introduced in PR#384. PR#384 did not modify the evictEndpoint function, and I cannot predict what impact it would have.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This reverts commit ff6274f.
|
@litiantian00 Thank you for your contribution! Nevertheless, it seems to affect the function of Mooncake PG, a collective communication library. To be specific, Mooncake PG hangs during initialization after this commit is introduced. We're investigating into this issue. Do you have any possible clues? FYI, Mooncake PG involves massive RDMA connection setup among all peers during initialization. Each process can serve as a sender or a receiver to its peer. The correctness implicitly depends on the detailed implementation of TE (through a way I haven't figured out yet). During initialization, |
This reverts commit 4fb64ae.
|
Update: When running |
|
@UNIDY2002 This PR simply reverted the commit of PR #384. PR #384 did not destroy the endpoint. This PR, triggered the endpoint's deconstruction by calling |
|
@litiantian00 Thank you very much for your assistance, and for your PR unveiling the underlying problem. I have located the issue. |
This reverts commit ff6274f.
Co-authored-by: litiantian.118 <litiantian.118@jd.com>
PR#384 leaves the endpoint status unchanged to fix the concurrency bug in the following scenarios: "1 NIC for the prefill server, and config another NIC for the decode server. Or try setting up a greater MC_TRANSFER_TIMEOUT when your servers are under heavy load. Or try intra-node nvlink." However, this commit causes a QP leak during network fluctuations, accompanied by the log "Worker: Process failed for slice ......: transport retry counter exceeded". Therefore, we would like to revert this change, or find alternative approaches to avoid the QP leak issue.