Skip to content

[Feature] Mixed ChunkPrefill Optimization Roadmap #13626

@hzh0425

Description

@hzh0425

Background

Mixed ChunkPrefill is an advanced scheduling mode in SGLang that executes Prefill and Decode requests within the same batch to improve GPU utilization. Refactoring work is needed to optimize performance and maintain compatibility with other features.

Action Items

  1. Kernel Optimization

    • Introduce a dedicated mixed chunk attention backend
    • Explore PodAttention or separate kernel launches for Extend/Decode operations
    • Benchmark and adopt the optimal approach for performance
  2. Scheduler Refactoring

    • Refactor memory allocation/deallocation logic for mixed chunk mode
    • Clean up inconsistent memory management patterns
    • Improve code maintainability and clarity
  3. Overlap Scheduler Compatibility

    • Fix memory leak issues when running with overlap scheduling
    • Support decode future tokens mode in mixed chunk context
    • Ensure proper integration with overlap pipeline
  4. Speculative Decoding Support

    • Enable mixed chunk mode with speculative decoding
  5. Testing & Validation

    • Add comprehensive unit tests for mixed chunk mode
    • Cover feature interactions and edge cases

Related resources

#12224

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions