Unify memory management across `(overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished)` by hnyls2002 · Pull Request #12224 · sgl-project/sglang

hnyls2002 · 2025-10-27T20:14:01Z

This PR replaces the old approach to releasing KV cache, which relied on len(self.origin_input_ids) + max(len(self.output_ids) - 1, 0) to determine the KV length.

That approach is brittle with overlap scheduling and with multiple finish paths (normal completion, disaggregation-decode finish, retract, abort). With speculative decoding, we also perform over-allocation, which makes allocation/freeing logic even more error-prone.

This PR introduces two explicit notions for a request’s KV cache state

KV committed len: number of KV token slots that have actually been written with real tokens (i.e., valid KV).
KV allocated len: number of KV token slots reserved from the allocator (page-aligned), regardless of whether they have been populated.

We tightly couple the allocation steps with updates to kv_committed_len and kv_allocated_len, so these fields faithfully reflect request-level memory usage at all times.

The previous memory allocation involves

for req in reqs_to_process:
    if req.is_finished():  # overlap scheduling may produce “extra” KV allocations
        # Previously required complex and error-prone offset math
        deallocate_extra_tokens(req)
        continue

    check_finish(req)

    if req.is_finished():
        # Free KV cache using the inferred length:
        # len(origin_input_ids) + max(len(output_ids) - 1, 0)
        deallocate_inferred_kv_length(req)

We now replace the ad-hoc “extra token” arithmetic with the recorded kv_committed_len and kv_allocated_len. Consumers no longer need to infer offsets: they simply consult these fields.

To ensure correctness:

We only free all the memory resources in one place.
After we decide a request is finished (check_finish, retract, abort…), we immediately remove all its KV cache.
We shall never allocate any memory resources after we have found that a request is finished.

Future TODOs (cc @cctry)

Resolve the possible data race when we enable overlap scheduler: the just-released req_to_token_idx is reused, and the page mapping changes during the current forwarding cycle.
Further decouple the memory deallocation logic from the prefix cache logic.

gemini-code-assist · 2025-10-27T20:14:19Z

Summary of Changes

Hello @hnyls2002, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors and unifies the Key-Value (KV) cache memory management logic across various operational modes and states within the system. By introducing dedicated request-level attributes for tracking allocated, committed, and freed KV cache lengths, the changes aim to standardize memory handling, improve accuracy, and enhance the overall robustness of memory management during request extension, decoding, and completion.

Highlights

Request-level KV Cache Tracking: Introduced new attributes kv_committed_len, kv_allocated_len, and kv_freed_len to the Req object to enable more granular and unified memory management for Key-Value (KV) cache across different request states.
Memory Allocation Updates: The newly added KV cache length fields are now consistently updated during critical phases such as request extension (prepare_for_extend) and token decoding (prepare_for_decode), ensuring accurate tracking of allocated and committed memory.
Memory Deallocation Logic Refinement: The cache_finished_req function has been updated to leverage req.kv_allocated_len for precise memory deallocation upon request completion, and req.kv_freed_len is now used to track freed memory, including an assertion to prevent incorrect freeing.
Consistency and Debugging Assertions: An assertion was added in process_batch_result_decode to verify that kv_allocated_len matches kv_freed_len when handling the freeing of delayed tokens, enhancing the robustness and debuggability of the memory management system.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request unifies request-level memory management by introducing kv_committed_len, kv_allocated_len, and kv_freed_len to the Req class. These fields are consistently updated during prefill and decode stages. The logic for freeing memory for finished requests is now centralized in cache_finished_req, which uses req.kv_allocated_len as the source of truth. This is a good refactoring that improves robustness. I have one minor suggestion to remove a stale comment that became misleading after the changes.

python/sglang/srt/managers/scheduler_output_processor_mixin.py

cctry · 2025-11-10T21:17:16Z

Resolve the possible data race when we enable overlap scheduler: the just-released req_to_token_idx is reused, and the page mapping changes during the current forwarding cycle

There is another thing we can do:
instead of assigning out_cache_loc to req_to_token in prepare_for_decode, we can assign it in result processing.
Any concern of this approach for spec dec?

hnyls2002 · 2025-11-11T17:15:50Z

@cctry Not sure, we can discuss offline.

…e>=1) x (spec, non-spec, spec v2) x (retract, finished)` (sgl-project#12224)" revert some change This reverts commit 665416f.

hnyls2002 added 2 commits October 28, 2025 03:56

udpate

d858347

support non-overlap with page 1

510a6f0

hnyls2002 requested review from Ying1123, merrymercy and xiezhq-hermann as code owners October 27, 2025 20:14

sglang-bot added the run-ci label Oct 27, 2025

hnyls2002 changed the title ~~Unify memory management across (overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished)~~ [WIP] Unify memory management across (overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished) Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

python/sglang/srt/managers/scheduler_output_processor_mixin.py Outdated Show resolved Hide resolved

fix other cache backend

9bb3fa3

zhyncs added the high priority label Oct 28, 2025

hnyls2002 added 2 commits October 28, 2025 18:50

fix spec

3748f33

fix v2 spec

b03fe26

hnyls2002 requested a review from kssteven418 as a code owner October 28, 2025 11:54

hnyls2002 added 8 commits October 28, 2025 20:55

fix topk > 1 + page = 1

684c399

remove unused imports

34688c0

more check

2ba96cb

by pass page size > 1

e38f027

fix standalone

33298a9

fix retract: reset all state

f420ea0

fix retract: remove code

28b6f54

fix pd

b8a6bcd

hnyls2002 requested a review from ByronHsu as a code owner October 28, 2025 14:56

hnyls2002 added 6 commits October 29, 2025 01:46

Merge branch 'main' into lsyin/committed-kv-len

b850819

fix mixed

6f482db

update

b24ced6

fix mamba

295329a

rename and make prototype

dd8d932

rename

f3096b1

hnyls2002 mentioned this pull request Oct 29, 2025

Refactor abortion in event loop #12312

Merged

hnyls2002 added 2 commits October 31, 2025 14:49

simplify

9bdd5f3

Merge branch 'main' into lsyin/committed-kv-len

ffb1c40

xiezhq-hermann self-assigned this Oct 31, 2025

hnyls2002 added 3 commits November 2, 2025 01:27

fix

4197652

Merge branch 'main' into lsyin/committed-kv-len

289b503

mamba out!

cbc26cd

hnyls2002 mentioned this pull request Nov 4, 2025

[Feature] Memory Cache System Refactoring Road Map (Mem Cache V2) #12587

Open

hnyls2002 and others added 2 commits November 5, 2025 01:46

Merge branch 'main' into lsyin/committed-kv-len

6b9db95

Merge branch 'main' into lsyin/committed-kv-len

5cd4d2b

hnyls2002 mentioned this pull request Nov 10, 2025

Enhance retract test (page cases, long output cases) #12781

Merged

Merge branch 'main' into lsyin/committed-kv-len

00af65e

github-actions bot added speculative-decoding hicache Hierarchical Caching for SGLang labels Nov 10, 2025

hnyls2002 added 8 commits November 10, 2025 22:09

fix

f0dea70

udpate

781e71d

fix

d1ba205

fix comments

7a6b9e4

Merge branch 'main' into lsyin/committed-kv-len

0bac7f8

add comments

13443a4

fix comment

80d812f

udpate comments

e716a86

hnyls2002 changed the title ~~[WIP] Unify memory management across (overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished)~~ Unify memory management across (overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished) Nov 10, 2025

hnyls2002 merged commit 665416f into main Nov 10, 2025
93 of 138 checks passed

hnyls2002 deleted the lsyin/committed-kv-len branch November 10, 2025 18:56

hnyls2002 mentioned this pull request Nov 17, 2025

[Bugfix][DO NOT MERGE] Free req.req_pool_idx when it is finished in the mixed_chunk mode to solve memory leak #10871

Closed

4 tasks

hzh0425 mentioned this pull request Nov 20, 2025

[Feature] Mixed ChunkPrefill Optimization Roadmap #13626

Open

Terry-Uv mentioned this pull request Dec 23, 2025

[Overlap Spec V2 Eagle] Support Triton spec v2 top k >1 and pagesize > 1 #15664

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Unify memory management across `(overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished)`#12224

Unify memory management across `(overlap, non-overlap) x (page>=1) x (spec, non-spec, spec v2) x (retract, finished)`#12224
hnyls2002 merged 51 commits intomainfrom
lsyin/committed-kv-len

hnyls2002 commented Oct 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

cctry commented Nov 10, 2025 •

edited

Loading

Uh oh!

hnyls2002 commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

hnyls2002 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cctry commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hnyls2002 commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hnyls2002 commented Oct 27, 2025 •

edited

Loading

cctry commented Nov 10, 2025 •

edited

Loading