From 5972322dc7d75f1490538b786a81df29e41fe867 Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Mon, 3 Nov 2025 10:56:04 +0000 Subject: [PATCH 1/5] add qwen3 vl docs --- docs/basic_usage/qwen3_vl.md | 78 ++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 docs/basic_usage/qwen3_vl.md diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md new file mode 100644 index 000000000000..a1f9f188319f --- /dev/null +++ b/docs/basic_usage/qwen3_vl.md @@ -0,0 +1,78 @@ +# Qwen3-VL Usage + +[Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl) +is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities. +SGLang supports Qwen3-VL Family of models with Image and Video input support. + +## Launch Qwen3-VL with SGLang + +To serve the model: + +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-30B-A3B-Instruct \ + --host 0.0.0.0 + --tp 4 +``` + +## Sending Image/Video Requests + +#### Image input: + +```python +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s in this image?"}, + { + "type": "image_url", + "image_url": { + "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +#### Video Input: + +```python +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s happening in this video?"}, + { + "type": "video_url", + "video_url": { + "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` From 9f5a94adfd1a9f5abdf25016626073c3aee0dde1 Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Mon, 3 Nov 2025 10:58:52 +0000 Subject: [PATCH 2/5] add to index --- docs/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/index.rst b/docs/index.rst index ae2bd684ebf5..b8eee4ed4073 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,6 +31,7 @@ Its core features include: basic_usage/gpt_oss.md basic_usage/llama4.md basic_usage/qwen3.md + basic_usage/qwen3_vl.md .. toctree:: :maxdepth: 1 From ca2542b634565e2146e140532fc81fd4ff07056d Mon Sep 17 00:00:00 2001 From: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Date: Mon, 3 Nov 2025 21:22:15 +0530 Subject: [PATCH 3/5] Update qwen3_vl.md --- docs/basic_usage/qwen3_vl.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md index a1f9f188319f..5bb925fb5a11 100644 --- a/docs/basic_usage/qwen3_vl.md +++ b/docs/basic_usage/qwen3_vl.md @@ -10,14 +10,14 @@ To serve the model: ```bash python3 -m sglang.launch_server \ - --model-path Qwen/Qwen3-VL-30B-A3B-Instruct \ - --host 0.0.0.0 - --tp 4 + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ + --tp 8 \ + --ep 8 ``` ## Sending Image/Video Requests -#### Image input: +### Image input: ```python import requests @@ -47,7 +47,7 @@ response = requests.post(url, json=data) print(response.text) ``` -#### Video Input: +### Video Input: ```python import requests From 2257966d871aadfaa75ac878d14d762163379137 Mon Sep 17 00:00:00 2001 From: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Date: Thu, 6 Nov 2025 13:05:17 +0530 Subject: [PATCH 4/5] Revise Qwen3-VL launch commands and recommendations Updated launch commands and added hardware-specific recommendations for Qwen3-VL model in SGLang documentation. --- docs/basic_usage/qwen3_vl.md | 58 ++++++++++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 3 deletions(-) diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md index 5bb925fb5a11..d973ce35765a 100644 --- a/docs/basic_usage/qwen3_vl.md +++ b/docs/basic_usage/qwen3_vl.md @@ -4,17 +4,39 @@ is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities. SGLang supports Qwen3-VL Family of models with Image and Video input support. -## Launch Qwen3-VL with SGLang +## Launch commands for SGLang -To serve the model: +Below are suggested launch commands tailored for different hardware / precision modes +### FP8 (quantised) mode +For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported: ```bash python3 -m sglang.launch_server \ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ --tp 8 \ - --ep 8 + --ep 8 \ + --host 0.0.0.0 \ + --port 30000 \ + --keep-mm-feature-on-device ``` +### Non-FP8 (BF16 / full precision) mode +For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used): +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ + --tp 8 \ + --ep 8 \ + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Hardware-specific notes / recommendations + +- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. +- On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference. +- On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing. + ## Sending Image/Video Requests ### Image input: @@ -76,3 +98,33 @@ data = { response = requests.post(url, json=data) print(response.text) ``` + +## Important Server Parameters and Flags + +When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior: + +- `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3) +- `--mm-max-concurrent-calls `: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference. +- `--mm-per-request-timeout `: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated. +- `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads. +- `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency. + +### Example usage with the above optimizations: +```bash +SGLANG_USE_CUDA_IPC_TRANSPORT=1 \ +SGLANG_VLM_CACHE_SIZE_MB=0 \ +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code \ + --tp-size 8 \ + --enable-cache-report \ + --log-level info \ + --max-running-requests 64 \ + --mem-fraction-static 0.65 \ + --chunked-prefill-size 8192 \ + --attention-backend fa3 \ + --mm-attention-backend fa3 \ + --enable-metrics +``` From 26070ba7a278477b37dc79aa890247e810db3131 Mon Sep 17 00:00:00 2001 From: adarshxs Date: Thu, 6 Nov 2025 13:13:08 +0530 Subject: [PATCH 5/5] lint --- docs/basic_usage/qwen3_vl.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md index d973ce35765a..b8af19b26c90 100644 --- a/docs/basic_usage/qwen3_vl.md +++ b/docs/basic_usage/qwen3_vl.md @@ -33,7 +33,7 @@ python3 -m sglang.launch_server \ ## Hardware-specific notes / recommendations -- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. +- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. - On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference. - On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing.