[Bugfix][DO NOT MERGE] Free req.req_pool_idx when it is finished in the mixed_chunk mode to solve memory leak#10871
[Bugfix][DO NOT MERGE] Free req.req_pool_idx when it is finished in the mixed_chunk mode to solve memory leak#10871Muqi1029 wants to merge 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Muqi1029, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical memory leak within the SGLang Engine's mixed chunking mode. Previously, when a request finished during the initial prefill stage, the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a memory leak that occurs in mixed-chunk mode when a request finishes during the prefill stage. The original code failed to free the req_pool_idx, leading to a resource leak. The proposed change adds the necessary call to self.req_to_token_pool.free(req.req_pool_idx), which correctly releases the request slot and resolves the memory leak. The fix is targeted and appears correct.
|
not work, still have memory leak |
|
@MichelleWu351 Maybe the situation of memory leak is not the same as yours. I test it in the latest sglang, it still happens. |
|
I can easily reproduce this bug about memory leak: Using the latest sglang: Serverpython -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 8888 --enable-mixed-chunkClientimport requests
import json
url = "http://127.0.0.1:8888/generate"
headers = {
"Content-Type": "application/json",
"Authorization": "Just Keep Me"
}
data = {
"input_ids": [1] * 40957,
}
response = requests.post(url, headers=headers, json=data)
if response.status_code == 200:
result = response.json()
print(json.dumps(result, ensure_ascii=False, indent=2))
else:
print(f"Error: {response.status_code}")
print(response.text)
Result of server:A bad request will make sglang engine fail to check memory leak, leading to the following error, making the whole sglang engine down: ReasonA request with an sglang/python/sglang/srt/managers/tokenizer_manager.py Lines 607 to 616 in a4a3d82 However, it fails later in the engine: sglang/python/sglang/srt/managers/scheduler.py Lines 1369 to 1381 in a4a3d82 This causes the request to be marked as aborted. The aborted request is then added to the sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py Lines 85 to 89 in a4a3d82 Eventually, this leads to a |
|
DO NOT MERGE since it may be incompatible with retracted req, I am fixing it now |
|
Thanks for bringing this. I think this issue has been solved by #12312 and #12224. In the previous implementation, we set abort and set the finish reason at the same time, which is not compatible with the regular finish check and the overlap scheduler. But now, we set |
Motivation
When using mixed chunking, a request may complete in the first prefill stage. In the original code, only
out_cache_locis freed, whilereq_pool_idxremains allocated. As a result, the scheduler’scheck_memoryfunction detects this as a memory leak and raises an error, which causes the entire SGLang Engine to fail and significantly impacts its stability.Modifications
Free
req_pool_idxinprocess_batch_result_prefill.Accuracy Tests
Benchmarking and Profiling
Checklist