Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
b0f1fc4 to
99f1a85
Compare
|
Benchmark information is required. Thanks a lot! |
|
The following content are necessary for this PR. @pathfinder-pf Cache MissPlease add tests in test_features.py. You can follow Note: Please set logprob_start_len to -1 or 1, top_logprobs_num to be greater than 1, token_ids_logprob to be not None. FeaturePlease add the following tests in CI to ensure the feature works. Note: Please test the following cases when batch_size = 1 and batch_size > 1.
AccuracyHow to ensure the return logprobs are expected? It needs a discussion. Profile and Benchmark
BaselinePlease add baselines for three scenarios in blog. |
1991b5a to
6e256a2
Compare
cf1042b to
c76c7b7
Compare
c76c7b7 to
2aa824b
Compare
Co-authored-by: pathfinder-fp <slackexplorer@gmail.com>
This pr includes return input and outputs logprobs with float32 when set return_logprobs=true, note: performance doesn't optimized because of many condition judge in logits_process, there will occurs cache miss case.
func test
environment
v6e-1
test
server mode
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --dist-init-addr=0.0.0.0:10011 --nnodes=1 --tp-size=1 --device=tpu --random-seed=27 --node-rank=0 --mem-fraction-static=0.8 --chunked-prefill-size=8192 --download-dir=/tmp --dtype=bfloat16 --precompile-bs-paddings 1 64 --max-running-requests 64 --max-total-tokens 257536 --skip-server-warmup --attention-backend=fa --precompile-token-paddings 8192 --page-size=64 --disable-overlap-schedule
curl localhost:30000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"temperature": 0.7,
"logprobs":true,
"top_logprobs": 1
}'
engine mode
performance & accuracy
environment
v6e-4
performace & accuracy
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server --model-path Qwen/Qwen3-8B --trust-remote-code --dist-init-addr=0.0.0.0:10011 --nnodes=1 --tp-size=1 --device=tpu --random-seed=3 --node-rank=0 --mem-fraction-static=0.8 --chunked-prefill-size=8192 --download-dir=/tmp --dtype=bfloat16 --precompile-bs-paddings 1 64 --max-running-requests 64 --max-total-tokens 257536 --skip-server-warmup --attention-backend=fa --precompile-token-paddings 8192 --page-size=64
performace
python3 -m sgl_jax.bench_serving --backend sgl-jax --dataset-name random --num-prompts 24 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 8 --warmup-requests 0
accuracy
evalscope eval --model Qwen/Qwen3-8B --api-url http://127.0.0.1:30000/v1 --api-key EMPTY --eval-type server --datasets gsm8k --eval-batch-size 64