Skip to content

docs: update SGLang status#173

Merged
Liyuhui-12 merged 1 commit intoSafeAILab:mainfrom
zhyncs:main
Jan 2, 2025
Merged

docs: update SGLang status#173
Liyuhui-12 merged 1 commit intoSafeAILab:mainfrom
zhyncs:main

Conversation

@zhyncs
Copy link
Contributor

@zhyncs zhyncs commented Jan 2, 2025

SGLang now has finished supported EAGLE speculative decoding. The following results are obtained on a single H100.
This work was done by @merrymercy and @yukavio

The EAGLE implemented in SGLang is likely the most efficient among open-source LLM engines.

cc @Liyuhui-12 @hongyanz

Official eagle code: 200 token/s

see https://github.com/SafeAILab/EAGLE

Normal decoding speed (SGLang): 156 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf

Eagle decoding speed (SGLang): 297 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7

Eagle decoding speed (SGLang w/ torch.comopile): 316 token/s

python3 -m sglang.launch_server --model meta-llama/Llama-2-7b-chat-hf  --speculative-algo EAGLE --speculative-draft lmzheng/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --enable-torch-compile --cuda-graph-max-bs 2

Benchmark script

import time
import requests

tic = time.time()
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "[INST] Give me a simple FastAPI server. Show the python code. [/INST]",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 256,
        },
    },
)
latency = time.time() - tic
ret = response.json()

print(ret["text"])
speed = ret["meta_info"]["completion_tokens"]
print(f"speed: {speed / latency:.2f} token/s")

sgl-project/sglang#2150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants