From 5972322dc7d75f1490538b786a81df29e41fe867 Mon Sep 17 00:00:00 2001
From: Ubuntu
 <azureuser@athena.w2cgneqjjboeneyk2w5mje3jyf.bx.internal.cloudapp.net>
Date: Mon, 3 Nov 2025 10:56:04 +0000
Subject: [PATCH 1/5] add qwen3 vl docs

---
 docs/basic_usage/qwen3_vl.md | 78 ++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)
 create mode 100644 docs/basic_usage/qwen3_vl.md

diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md
new file mode 100644
index 000000000000..a1f9f188319f
--- /dev/null
+++ b/docs/basic_usage/qwen3_vl.md
@@ -0,0 +1,78 @@
+# Qwen3-VL Usage
+
+[Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl)
+is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities.
+SGLang supports Qwen3-VL Family of models with Image and Video input support.
+
+## Launch Qwen3-VL with SGLang
+
+To serve the model:
+
+```bash
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen3-VL-30B-A3B-Instruct \
+  --host 0.0.0.0
+  --tp 4
+```
+
+## Sending Image/Video Requests
+
+#### Image input:
+
+```python
+import requests
+
+url = f"http://localhost:30000/v1/chat/completions"
+
+data = {
+    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What’s in this image?"},
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
+                    },
+                },
+            ],
+        }
+    ],
+    "max_tokens": 300,
+}
+
+response = requests.post(url, json=data)
+print(response.text)
+```
+
+#### Video Input:
+
+```python
+import requests
+
+url = f"http://localhost:30000/v1/chat/completions"
+
+data = {
+    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What’s happening in this video?"},
+                {
+                    "type": "video_url",
+                    "video_url": {
+                        "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
+                    },
+                },
+            ],
+        }
+    ],
+    "max_tokens": 300,
+}
+
+response = requests.post(url, json=data)
+print(response.text)
+```

From 9f5a94adfd1a9f5abdf25016626073c3aee0dde1 Mon Sep 17 00:00:00 2001
From: Ubuntu
 <azureuser@athena.w2cgneqjjboeneyk2w5mje3jyf.bx.internal.cloudapp.net>
Date: Mon, 3 Nov 2025 10:58:52 +0000
Subject: [PATCH 2/5] add to index

---
 docs/index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/index.rst b/docs/index.rst
index ae2bd684ebf5..b8eee4ed4073 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -31,6 +31,7 @@ Its core features include:
    basic_usage/gpt_oss.md
    basic_usage/llama4.md
    basic_usage/qwen3.md
+   basic_usage/qwen3_vl.md
 
 .. toctree::
    :maxdepth: 1

From ca2542b634565e2146e140532fc81fd4ff07056d Mon Sep 17 00:00:00 2001
From: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Date: Mon, 3 Nov 2025 21:22:15 +0530
Subject: [PATCH 3/5] Update qwen3_vl.md

---
 docs/basic_usage/qwen3_vl.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md
index a1f9f188319f..5bb925fb5a11 100644
--- a/docs/basic_usage/qwen3_vl.md
+++ b/docs/basic_usage/qwen3_vl.md
@@ -10,14 +10,14 @@ To serve the model:
 
 ```bash
 python3 -m sglang.launch_server \
-  --model-path Qwen/Qwen3-VL-30B-A3B-Instruct \
-  --host 0.0.0.0
-  --tp 4
+  --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
+  --tp 8 \
+  --ep 8
 ```
 
 ## Sending Image/Video Requests
 
-#### Image input:
+### Image input:
 
 ```python
 import requests
@@ -47,7 +47,7 @@ response = requests.post(url, json=data)
 print(response.text)
 ```
 
-#### Video Input:
+### Video Input:
 
 ```python
 import requests

From 2257966d871aadfaa75ac878d14d762163379137 Mon Sep 17 00:00:00 2001
From: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com>
Date: Thu, 6 Nov 2025 13:05:17 +0530
Subject: [PATCH 4/5] Revise Qwen3-VL launch commands and recommendations

Updated launch commands and added hardware-specific recommendations for Qwen3-VL model in SGLang documentation.
---
 docs/basic_usage/qwen3_vl.md | 58 ++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md
index 5bb925fb5a11..d973ce35765a 100644
--- a/docs/basic_usage/qwen3_vl.md
+++ b/docs/basic_usage/qwen3_vl.md
@@ -4,17 +4,39 @@
 is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities.
 SGLang supports Qwen3-VL Family of models with Image and Video input support.
 
-## Launch Qwen3-VL with SGLang
+## Launch commands for SGLang
 
-To serve the model:
+Below are suggested launch commands tailored for different hardware / precision modes
 
+### FP8 (quantised) mode
+For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported:
 ```bash
 python3 -m sglang.launch_server \
   --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
   --tp 8 \
-  --ep 8
+  --ep 8 \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --keep-mm-feature-on-device
 ```
 
+### Non-FP8 (BF16 / full precision) mode
+For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used):
+```bash
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \
+  --tp 8 \
+  --ep 8 \
+  --host 0.0.0.0 \
+  --port 30000 \
+```
+
+## Hardware-specific notes / recommendations
+
+- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. 
+- On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference.
+- On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing.
+
 ## Sending Image/Video Requests
 
 ### Image input:
@@ -76,3 +98,33 @@ data = {
 response = requests.post(url, json=data)
 print(response.text)
 ```
+
+## Important Server Parameters and Flags
+
+When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior:
+
+- `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3)
+- `--mm-max-concurrent-calls <value>`: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference.
+- `--mm-per-request-timeout <seconds>`: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated.
+- `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads.
+- `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency.
+
+### Example usage with the above optimizations:
+```bash
+SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
+SGLANG_VLM_CACHE_SIZE_MB=0 \
+python -m sglang.launch_server \
+  --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --trust-remote-code \
+  --tp-size 8 \
+  --enable-cache-report \
+  --log-level info \
+  --max-running-requests 64 \
+  --mem-fraction-static 0.65 \
+  --chunked-prefill-size 8192 \
+  --attention-backend fa3 \
+  --mm-attention-backend fa3 \
+  --enable-metrics
+```

From 26070ba7a278477b37dc79aa890247e810db3131 Mon Sep 17 00:00:00 2001
From: adarshxs <adarsh.shirawalmath@gmail.com>
Date: Thu, 6 Nov 2025 13:13:08 +0530
Subject: [PATCH 5/5] lint

---
 docs/basic_usage/qwen3_vl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/basic_usage/qwen3_vl.md b/docs/basic_usage/qwen3_vl.md
index d973ce35765a..b8af19b26c90 100644
--- a/docs/basic_usage/qwen3_vl.md
+++ b/docs/basic_usage/qwen3_vl.md
@@ -33,7 +33,7 @@ python3 -m sglang.launch_server \
 
 ## Hardware-specific notes / recommendations
 
-- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency. 
+- On H100 with FP8: Use the FP8 checkpoint for best memory efficiency.
 - On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference.
 - On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing.