[Feature] Mixed ChunkPrefill Optimization Roadmap

### Background
Mixed ChunkPrefill is an advanced scheduling mode in SGLang that executes Prefill and Decode requests within the same batch to improve GPU utilization. Refactoring work is needed to optimize performance and maintain compatibility with other features.

### Action Items

1. **Kernel Optimization**
   - Introduce a dedicated mixed chunk attention backend
   - Explore **PodAttention** or separate kernel launches for Extend/Decode operations
   - Benchmark and adopt the optimal approach for performance

2. **Scheduler Refactoring**
   - Refactor memory allocation/deallocation logic for mixed chunk mode
   - Clean up inconsistent memory management patterns
   - Improve code maintainability and clarity

3. **Overlap Scheduler Compatibility**
   - Fix memory leak issues when running with overlap scheduling
   - Support decode future tokens mode in mixed chunk context
   - Ensure proper integration with overlap pipeline

4. **Speculative Decoding Support**
   - Enable mixed chunk mode with speculative decoding

5. **Testing & Validation**
   - Add comprehensive unit tests for mixed chunk mode
   - Cover feature interactions and edge cases

### Related resources

https://github.com/sgl-project/sglang/pull/12224

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Mixed ChunkPrefill Optimization Roadmap #13626

Background

Action Items

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Mixed ChunkPrefill Optimization Roadmap #13626

Description

Background

Action Items

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions