diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md new file mode 100644 index 000000000000..b8af19b26c90 --- /dev/null +++ b/docs/basic_usage/qwen3_vl.md @@ -0,0 +1,130 @@ +# Qwen3-VL Usage + +[Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl) +is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities. +SGLang supports Qwen3-VL Family of models with Image and Video input support. + +## Launch commands for SGLang + +Below are suggested launch commands tailored for different hardware / precision modes + +### FP8 (quantised) mode +For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported: +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ + --tp 8 \ + --ep 8 \ + --host 0.0.0.0 \ + --port 30000 \ + --keep-mm-feature-on-device +``` + +### Non-FP8 (BF16 / full precision) mode +For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used): +```bash +python3 -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ + --tp 8 \ + --ep 8 \ + --host 0.0.0.0 \ + --port 30000 \ +``` + +## Hardware-specific notes / recommendations + +- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. +- On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference. +- On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing. + +## Sending Image/Video Requests + +### Image input: + +```python +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s in this image?"}, + { + "type": "image_url", + "image_url": { + "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +### Video Input: + +```python +import requests + +url = f"http://localhost:30000/v1/chat/completions" + +data = { + "model": "Qwen/Qwen3-VL-30B-A3B-Instruct", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "What’s happening in this video?"}, + { + "type": "video_url", + "video_url": { + "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4" + }, + }, + ], + } + ], + "max_tokens": 300, +} + +response = requests.post(url, json=data) +print(response.text) +``` + +## Important Server Parameters and Flags + +When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior: + +- `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3) +- `--mm-max-concurrent-calls `: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference. +- `--mm-per-request-timeout `: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated. +- `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads. +- `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency. + +### Example usage with the above optimizations: +```bash +SGLANG_USE_CUDA_IPC_TRANSPORT=1 \ +SGLANG_VLM_CACHE_SIZE_MB=0 \ +python -m sglang.launch_server \ + --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \ + --host 0.0.0.0 \ + --port 30000 \ + --trust-remote-code \ + --tp-size 8 \ + --enable-cache-report \ + --log-level info \ + --max-running-requests 64 \ + --mem-fraction-static 0.65 \ + --chunked-prefill-size 8192 \ + --attention-backend fa3 \ + --mm-attention-backend fa3 \ + --enable-metrics +``` diff --git a/docs/index.rst b/docs/index.rst index bc167571b27e..07e128155875 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,6 +31,7 @@ Its core features include: basic_usage/gpt_oss.md basic_usage/llama4.md basic_usage/qwen3.md + basic_usage/qwen3_vl.md .. toctree:: :maxdepth: 1