[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA by Ying1123 · Pull Request #1728 · sgl-project/sglang

Ying1123 · 2024-10-20T09:13:25Z

This PR added the option --lora-backend to choose between triton and flashinfer backend.

Items before merging this PR.

Accuracy test

The triton kernels for shrink and 2-D segmented gemm will come up in follow-up PRs.

See example below:

# launch server
python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora lora1=/home/ying/test_lora_1 lora2=/home/ying/test_lora_2 --disable-radix --disable-cuda-graph --max-loras-per-batch 4 --lora-backend triton

# send requests
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
import json
import requests

url = "http://127.0.0.1:30000"
json_data = {
        "text": ["prompt 1", "prompt 2", "prompt 3", "prompt 4", "prompt 5", "prompt 6", "prompt7"],
        "sampling_params": {"max_new_tokens": 32},
        "lora_path": ["/home/ying/test_lora", "lora1", "lora2", "lora1", "lora2", None, None],
}
response = requests.post(
        url + "/generate",
        json=json_data,
)
print(json.dumps(response.json()))

For multi-LoRA serving, what has been done:

Initial LoRA support [Feature] Initial support for multi-LoRA serving #1307
This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (qkvo) and mlp (gate, up, down) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller --mem-frac to launch server with larger --max-loras-per-batch.
Misc: path renaming [Feature] Support LoRA path renaming and add LoRA serving benchmarks #1433

What is in progress:

Add triton backend and performance optimizations
- expand kernel for segmented gemm (this PR)
- shrink kernel for segmented gemm
- 2-D segmented gemm

You can expect the items below in the follow-up PRs.

OpenAI compatible API
compatibility with cuda graph
compatibility with radix attention
fully sharded LoRAs with tensor parallelism
performance optimization
memory optimization
support LoRAs with different ranks
test cases enhancement

References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving

add lora expand triton backend

a346852

Ying1123 marked this pull request as draft October 20, 2024 09:13

This was referenced Oct 20, 2024

[Performance] Add triton kernels for LoRA #1471

Closed

Development Roadmap (2024 Q4) #1487

Closed

merrymercy force-pushed the main branch from 55311eb to 2134f08 Compare November 2, 2024 01:26

merrymercy assigned Ying1123 Nov 9, 2024

merrymercy closed this Nov 23, 2024

zhyncs deleted the lora_triton branch January 16, 2025 11:42

Fridge003 mentioned this pull request Jan 16, 2025

[Roadmap] Lora Support #2929

Open

26 tasks

Fridge003 mentioned this pull request Jan 27, 2025

[Feature] Define backends and add Triton backend for Lora #3161

Merged

4 tasks

zhaochenyang20 mentioned this pull request Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA#1728

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA#1728
Ying1123 wants to merge 1 commit intomainfrom
lora_triton

Ying1123 commented Oct 20, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ying1123 commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ying1123 commented Oct 20, 2024 •

edited

Loading