Skip to content

Clean match_prefix and prepare_for_extend for mem cache V2#11200

Merged
cctry merged 8 commits intomainfrom
shiyang/mem_v2/clean
Oct 8, 2025
Merged

Clean match_prefix and prepare_for_extend for mem cache V2#11200
cctry merged 8 commits intomainfrom
shiyang/mem_v2/clean

Conversation

@cctry
Copy link
Collaborator

@cctry cctry commented Oct 3, 2025

Motivation

Preparation for mem_cache V2.
This PR tries to clean the existing code to seperate memory operations from other procedures.

Modifications

Two function implementations are changed

  1. init_next_round : remove adjust_max_prefix_ids to make most field setting before match_prefix

  2. prepare_for_extend:

  • move all memory related code together
  • Change the writing to req_to_token_pool to include both the setting of prefix_indices and out_cache_loc.
  • Therefore, the get_last_loc function is replaced by a simpler implementation without searching the req_to_token_pool

Also

  • for req.prefix_indices, the default value is changed from [] to torch.empty((0,), dtype=torch.int64) to respect type hint
  • fix chunk_cache with correct token indices dtype (int64)

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cctry, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on internal code cleanup and refactoring, primarily within the schedule_batch.py and schedule_policy.py files. The main goal is to prepare the codebase for mem_cache V2 by separating memory allocation operations from other procedures. This involves streamlining prefix matching logic, centralizing memory writing functions, and reorganizing the prepare_for_extend method for improved clarity and maintainability.

Highlights

  • Refactored Prefix Matching Logic: The logic for calculating the maximum prefix length and token IDs has been moved directly into the init_next_round_input method, eliminating the separate adjust_max_prefix_ids function for cleaner code and better flow.
  • Centralized Memory Writing: A new write_cache_indices method has been introduced to centralize the process of writing both prefix tokens and allocated output cache locations (out_cache_loc) to the req_to_token_pool, simplifying memory management within prepare_for_extend.
  • Streamlined prepare_for_extend: The prepare_for_extend method has been significantly refactored to consolidate memory allocation operations and delegate the complex req_to_token_pool writing logic to the new write_cache_indices method, improving modularity and readability.
  • Enhanced Triton Kernel for Prefix Handling: The write_req_to_token_pool_triton kernel has been updated to explicitly handle and write prefix tensors, ensuring consistent behavior across different memory writing paths (Triton and non-Triton).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors init_next_round and prepare_for_extend to improve code structure and separate memory operations. The changes are generally good and improve code clarity. I've provided a few suggestions to further enhance maintainability and performance, including simplifying tensor creation logic and optimizing a Triton kernel by pre-computing a cumulative sum.

@cctry cctry requested a review from zhyncs as a code owner October 3, 2025 22:37
@cctry cctry changed the title Clean match_prefix and prepare_for_extend Clean match_prefix and prepare_for_extend for mem cache V2 Oct 3, 2025
@cctry cctry merged commit f3764c2 into main Oct 8, 2025
96 of 110 checks passed
@cctry cctry deleted the shiyang/mem_v2/clean branch October 8, 2025 00:54
ch-tiger1 pushed a commit to ch-tiger1/sglang that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments