[Perf] Speed up LoRA Batch Initialization

### Motivation

`prepare_lora_batch` is triggered once per forward pass and is one of the main sources of perf overheads from LoRA. Based on suggestion from @Fridge003 , there are some low-hanging fruits for perf optimization such as eliminating unnecessary cuda device syncs.

**Current status:**

| | [Baseline ](https://github.com/sgl-project/sglang/commit/9d5fa68b903d295d2b39201d54905c6801f60f7f)| #6960  | + #6994 | + #8940 | 
|--------|--------|--------|--------|--------|
| ITL@P95 | 78.42 ms | 68.24 ms (-13.0%) | 52.51 (-33.0%)  |**38.40 （-51.0%)**|
| ITL@P50 | 34.36 ms | 32.85 ms (-4.4%) | 22.68 (-34.0%)  | **18.30 （-46.7%)** |
| TTFT@P50 | 91.37 ms | 85.52 ms (-6.5%) | 62.65 (-31.4%)  | **53.79 （-41.1%)** |



- [x] Eliminate unnecessary H2D transfer in set_lora_info (https://github.com/sgl-project/sglang/pull/6960)
- [x] Eliminate cuda stream syncs and redundant compute in LoRAManager (https://github.com/sgl-project/sglang/pull/6994)
- [ ] ~Experiment torch.compile / cuda grpah for prepare_lora_batch to reduce gaps between kernels (Idea from @hebiao064 , to be verified)~ (deprioritized as most tensor ops are now moved to server init phase by #8940)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Speed up LoRA Batch Initialization #6961

Motivation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Baseline	#6960	+ #6994	+ #8940
ITL@P95	78.42 ms	68.24 ms (-13.0%)	52.51 (-33.0%)	38.40 （-51.0%)
ITL@P50	34.36 ms	32.85 ms (-4.4%)	22.68 (-34.0%)	18.30 （-46.7%)
TTFT@P50	91.37 ms	85.52 ms (-6.5%)	62.65 (-31.4%)	53.79 （-41.1%)

[Perf] Speed up LoRA Batch Initialization #6961

Description

Motivation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions