Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion python/sglang/srt/managers/schedule_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -1359,7 +1359,11 @@ def new_page_count_next_decode(self):
return len(self.reqs)
# In the decoding phase, the length of a request's KV cache should be
# the total length of the request minus 1
return sum(1 for req in self.reqs if (req.seqlen - 1) % page_size == 0)
return (
sum(1 for req in self.reqs if req.seqlen % page_size == 0)
if self.enable_overlap
else sum(1 for req in self.reqs if (req.seqlen - 1) % page_size == 0)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix prevents OOM or simply reduce the chance of retracting decode? If it is the later case, let's keep it simple and conservative.

Copy link
Collaborator Author

@pansicheng pansicheng Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 adjustment here is primarily to align the calculation of new pages in non-overlap mode with the actual allocation code logic (where req.seqlen == seqlens in non-overlap scenarios). Without this -1 adjustment, there would be a discrepancy between the estimated new pages during scheduling and the actual allocation, which could ultimately lead to OOM

The actual allocation code logic: https://github.com/sgl-project/sglang/blob/v0.4.7/python/sglang/srt/mem_cache/paged_allocator.py#L133

image


def check_decode_mem(self, buf_multiplier=1):
tokens_required = (
Expand Down
Loading