-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Description
Motivation
prepare_lora_batch is triggered once per forward pass and is one of the main sources of perf overheads from LoRA. Based on suggestion from @Fridge003 , there are some low-hanging fruits for perf optimization such as eliminating unnecessary cuda device syncs.
Current status:
| Baseline | #6960 | + #6994 | + #8940 | |
|---|---|---|---|---|
| ITL@P95 | 78.42 ms | 68.24 ms (-13.0%) | 52.51 (-33.0%) | 38.40 (-51.0%) |
| ITL@P50 | 34.36 ms | 32.85 ms (-4.4%) | 22.68 (-34.0%) | 18.30 (-46.7%) |
| TTFT@P50 | 91.37 ms | 85.52 ms (-6.5%) | 62.65 (-31.4%) | 53.79 (-41.1%) |
- Eliminate unnecessary H2D transfer in set_lora_info (Speed up set_lora_info by eliminating unnecessary H2D transfers #6960)
- Eliminate cuda stream syncs and redundant compute in LoRAManager ([Perf] Refactor LoRAManager to eliminate stream syncs and redundant computations #6994)
-
Experiment torch.compile / cuda grpah for prepare_lora_batch to reduce gaps between kernels (Idea from @hebiao064 , to be verified)(deprioritized as most tensor ops are now moved to server init phase by Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops #8940)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels