-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
7 / 87 of 8 issues completedLabels
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Features
CY25H2
- overlapped lora updates Support overlapped lora updates #8213 @lifuhuang
- compatibility with radix attention [Bug] Why can't I use multi-lora adapter and radix attention together? #2880 [Feature] Further support for Lora Radix Cache #9144 Support radix cache for Lora feature #7216 @Fridge003
- adapter GPU pinning [Feature] LRU Eviction Strategy for Lora Adapters: Evicting Adapters with Priority #8053 Support GPU pinning for LoRA #8697 Support pinning adapter via server args. #9249 @lifuhuang
- LRU cache support for lora memory pool [Feature] LRU Eviction Strategy for Lora Adapters: Evicting Adapters with Priority #8053 Implement LRU eviction policy for LoRA adapters #11041 @ConnorLi96
- FlashInfer deprecation [Refactor] Deprecate FlashInfer lora backend #7809 @lifuhuang
- Perf - LoRA Batch Preparation Optimization [Perf] Speed up LoRA Batch Initialization #6961 @lifuhuang @Fridge003
- Perf - Kernel Optimization [Perf] LoRA Kernel benchmark & optimization #9040 [Feature] Cutlass kernels for LoRA #7910 [2/4] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance #10286 @Qiaolin-Yu @Fridge003 @lifuhuang
- Perf - Async LoRA prefetch ([Feature] Asynchronous LoRA prefetch #8712)
- support lora for speculative decoding @ConnorLi96
- support lora for embedding layer [Feature] Add LoRA support for embedding layers #14177
- support lora for MoE layer [Feature] Comprehensive LoRA Adapter Support for MOE Models: Including Expert Weights Integration #9897 [Feature] Comprehensive LoRA Adapter Support for MOE Models #11894 @ConnorLi96
- unified paging (support lora with different ranks) [Feature] Support unified paging in multi-lora serving #3647 @Sunt-ing @jcbjcbjc
- OpenAI compatible API ([Feature] OpenAI compatible API in LoRA #11551 [FEATURE] Add OpenAI-Compatible LoRA Adapter Selection #11570) @ConnorLi96 @neelabhsinha
- LRU Offloading ([Feature] Optimize LoRA Loading Mechanism to Decouple User Limits from CPU Memory Constraints #10266)
- Support PDL for shrink & expand LoRA https://www.databricks.com/blog/fast-peft-serving-scale [LoRA] Add PDL to LoRA shrink and expand #14346
CY25H1
- triton kernel & benchmark [Feature] Define backends and add Triton backend for Lora #3161 @Fridge003
- dynamic load/unload Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support #7412 Support dynamic LoRA loading / unloading in engine/server API #7446 @lifuhuang @Fridge003
- accuracy alignment [Bug] HuggingFace and SGLang inference don't match #2671 [Fix] Fix accuracy bug and refactor codes for lora #3413 @Fridge003
- test cases enhancement [Feature] Test case enhancement for Lora features #3414 [Fix] Fix bugs and refactor codes in lora for better scalability. #3652 [Feature] add multi-rank support for Lora #4492 [Fix] Improve Lora tests and reduce CI runtime #4925 @aoshen524 @jcbjcbjc
- support multi-rank adaptors [Feature] add multi-rank support for Lora #4492 @jcbjcbjc
- support tensor parallel [Bug] tensor_model_parallel_all_reduce' is not defined #2931 [Feature] Support Tensor Parallelism and Weight Slicing for Lora #4274 @aoshen524
- compatibility with cuda graph [Feature] Support compatibility between Cuda Graph and Lora #3282 Feat: support cuda graph for LoRA #4115 @Qiaolin-Yu @Beichen-Ma
- support phi4mm [Feature] Phi-4-MM support #6544 @lifuhuang
- Documentation Add document for LoRA serving #5521 @Fridge003
Related resources
Reactions are currently unavailable