-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Background
Mixed ChunkPrefill is an advanced scheduling mode in SGLang that executes Prefill and Decode requests within the same batch to improve GPU utilization. Refactoring work is needed to optimize performance and maintain compatibility with other features.
Action Items
-
Kernel Optimization
- Introduce a dedicated mixed chunk attention backend
- Explore PodAttention or separate kernel launches for Extend/Decode operations
- Benchmark and adopt the optimal approach for performance
-
Scheduler Refactoring
- Refactor memory allocation/deallocation logic for mixed chunk mode
- Clean up inconsistent memory management patterns
- Improve code maintainability and clarity
-
Overlap Scheduler Compatibility
- Fix memory leak issues when running with overlap scheduling
- Support decode future tokens mode in mixed chunk context
- Ensure proper integration with overlap pipeline
-
Speculative Decoding Support
- Enable mixed chunk mode with speculative decoding
-
Testing & Validation
- Add comprehensive unit tests for mixed chunk mode
- Cover feature interactions and edge cases
Related resources
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request