Conversation
|
@yizhang2077 @mickqian could you help to review on this? |
python/sglang/srt/conversation.py
Outdated
There was a problem hiding this comment.
This modification also applies to qwen2.5-vl, and perhaps it could be adjusted to make it compatible with more models.
|
I tried your test, but it didn’t work. Do you have any advice to help me fix it? thanks @xiaomin-D root@71b970322e07:/sgl-workspace/sglang_1/test/srt# python3 test_vision_openai_server.py TestInternVL2_5Server
command=python3 -m sglang.launch_server --model-path OpenGVLab/InternVL2_5-2B --trust-remote-code --chat-template internvl-2-5 --host 127.0.0.1 --port 8000
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:03 [__init__.py:239] Automatically detected platform cuda.
[2025-04-20 14:10:04] server_args=ServerArgs(model_path='OpenGVLab/InternVL2_5-2B', tokenizer_path='OpenGVLab/InternVL2_5-2B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='OpenGVLab/InternVL2_5-2B', chat_template='internvl-2-5', completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=723275796, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_llama4_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, flashinfer_mla_disable_ragged=False, warmups=None, n_share_experts_fusion=0, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake')
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.05k/4.05k [00:00<00:00, 25.2MB/s]
configuration_internvl_chat.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.09k/4.09k [00:00<00:00, 38.0MB/s]
configuration_internlm2.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.00k/7.00k [00:00<00:00, 19.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
configuration_intern_vit.py: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.55k/5.55k [00:00<00:00, 15.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internvl_chat.py
- configuration_internlm2.py
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[2025-04-20 14:10:04] vision_select_layer: -1
[2025-04-20 14:10:04] ps_version: v2
[2025-04-20 14:10:04] min_dynamic_patch: 1
[2025-04-20 14:10:04] max_dynamic_patch: 12
[2025-04-20 14:10:04] Ignore import error when loading sglang.srt.managers.multimodal_processors.gemma3: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.internvl: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.janus_pro: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.llava: Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.minicpm: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.mlama: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.mllama4: Failed to import transformers.models.llama4.modeling_llama4 because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.qwen_vl: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 1.01MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.01k/4.01k [00:00<00:00, 14.0MB/s]
tokenization_internlm2.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.79k/8.79k [00:00<00:00, 24.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 18.0MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 668kB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 844/844 [00:00<00:00, 3.22MB/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/sgl-workspace/sglang_1/python/sglang/launch_server.py", line 14, in <module>
launch_server(server_args)
File "/sgl-workspace/sglang_1/python/sglang/srt/entrypoints/http_server.py", line 704, in launch_server
tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
File "/sgl-workspace/sglang_1/python/sglang/srt/entrypoints/engine.py", line 569, in _launch_subprocesses
tokenizer_manager = TokenizerManager(server_args, port_args)
File "/sgl-workspace/sglang_1/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
self.mm_processor = get_mm_processor(
File "/sgl-workspace/sglang_1/python/sglang/srt/managers/multimodal_processor.py", line 61, in get_mm_processor
raise ValueError(
ValueError: No processor registered for architecture: ['InternVLChatModel'].
Registered architectures: ['CLIPModel', 'DeepseekVL2ForCausalLM']
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/mp-vo_juru8'
E
======================================================================
ERROR: setUpClass (__main__.TestInternVL2_5Server)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/sgl-workspace/sglang_1/test/srt/test_vision_openai_server.py", line 615, in setUpClass
cls.process = popen_launch_server(
File "/sgl-workspace/sglang_1/python/sglang/test/test_utils.py", line 453, in popen_launch_server
raise Exception(f"Server unexpectedly exits ({return_code=}).")
Exception: Server unexpectedly exits (return_code=1).
----------------------------------------------------------------------
Ran 0 tests in 20.022s
FAILED (errors=1)
root@71b970322e07:/sgl-workspace/sglang_1/test/srt# git branch
client
doc
doc_timeout
extrabody
* internVL
main
nest_asyncio
separate_reasoning
structural_tag
unbalance
verl |
|
@xiaomin-D please rebase with main, we gonna merge it pretty quick |
docker run \
-itd \
--gpus all \
--privileged --cap-add=IPC_LOCK \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /data0:/data \
-v /cfs:/cfs \
--net=host \
--ipc=host \
--name=sgldev lmsysorg/sglang:dev
docker exec -it sgldev bash
cd /sgl-workspace && mv sglang sglang_back
cd /sgl-workspace && git clone --branch internVL https://github.com/xiaomin-D/sglang.git
cd /sgl-workspace/sglang/python && pip install -e .then, launch sglang server and post |
|
LGTM, can we paste a benchmark result here? MMMU or lmms benchmark is available |
|
FYI: Currently InternVL3 inside implement a flashattention for vision model, which do not support TP and may be a bottleneck, we need replace it in the future PR |
Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>
* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <mail@kebe7jun.com> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <zhanweidu@163.com> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <sighingnow@gmail.com> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857) * simplify fused_moe config logging (sgl-project#5801) * [CI] tune the test order to warmup the server (sgl-project#5860) * Cutlass MLA decode - fix dtype error (sgl-project#5868) * cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820) * [Feature] support auto chat template (sgl-project#4949) * Feat: support cuda graph for LoRA (sgl-project#4115) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * Add qwen3 30b fused moe config (sgl-project#5859) * [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875) Co-authored-by: pengcuo <dgpengcuo@gmail.com> * Add A800 fused moe config for qwen3 30b (sgl-project#5880) * [Misc] add service discovery for sgl router * [fix]: PyO3 macOS linking and consolidate on tracing for logging * chore: update Dockerfile (sgl-project#5894) * [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836) * [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841) * chore: update CODEOWNERS (sgl-project#5895) * [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746) * [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896) * Auto set draft model path for MTP (sgl-project#5793) * [fix] relax mem_fraction_static for h200 (sgl-project#5893) Co-authored-by: alcanerian <alcanerian@gmail.com> * feat: support pythonic tool call and index in tool call streaming (sgl-project#5725) * [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696) * Add AMD MI300x Nightly Testing. (sgl-project#5861) * chore: use torch 2.6 for sgl-kernel build (sgl-project#5898) * Fix check_env script (sgl-project#5901) * [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830) * Bump Flashinfer to 0.2.5 (sgl-project#5870) Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> * [Fix] Unload lora in HF_Runner if needed (sgl-project#5899) * Add A800 fused moe config for qwen3 235b (sgl-project#5900) * Add sm_120 for blackwell (sgl-project#5903) * [Feature] add support kimi vl model (sgl-project#5383) Co-authored-by: wenju.li <wenju.li@deepctr.cn> * support vlm benchmark profile (sgl-project#5905) * [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910) * [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919) * [qwen3] support qwen3 ep moe (sgl-project#5917) Co-authored-by: sleepcoo <sleepcoo@gmail.com> * Add TP2 MOE benchmarks for AMD. (sgl-project#5909) * [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912) Co-authored-by: zhyncs <me@zhyncs.com> * chore: bump sgl-kernel 0.1.1 (sgl-project#5932) * chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933) * Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783) * [PP] Add pipeline parallelism (sgl-project#5724) * Fix lora batch processing when input lora_path contains None (sgl-project#5930) * add Thor & Spark (sgl-project#5915) * fix: correct stream response when enable_thinking is set to false (sgl-project#5881) * fix: update model runner (sgl-project#5934) * chore: bump v0.4.6.post2 (sgl-project#5939) * Support XiaomiMiMo/MiMo model inference (sgl-project#5921) * [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Remove extra contiguous (sgl-project#5953) * Update ci test and doc for MTP api change (sgl-project#5952) * docs: Fix Qwen model typo (sgl-project#5944) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Optimize a pad operation to accelerate 25us (sgl-project#5945) * Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956) * feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782) * Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960) * feat: Refactor DeepSeekV3 function call (sgl-project#5908) * Remove token in token out in Native API (sgl-project#5967) * Support InternVL3 (sgl-project#5350) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> * Support MMMU benchmark for InternVL (sgl-project#5968) * FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681) * Fix set kv cache multi-stream (sgl-project#5975) * Overlap qk norm with two streams (sgl-project#5977) * fix: only upgrade nccl for cu128 (sgl-project#5986) * Fix Phi3 serving which was broke by earlier change (sgl-project#5991) Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> * [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998) * [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992) * [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Fix flaky issues of lora and add multi batch tests (sgl-project#5957) * Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679) * fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997) * [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002) * Update dev container config to support live code sync and improve docker setup guide (sgl-project#6018) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] Optimize disaggregation ib device help info (sgl-project#5781) * [Test] Add flashmla attention backend test (sgl-project#5587) * Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555) * feat: Add a unified merge_state API (sgl-project#5428) * feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996) * [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752) * Fix prefill OOM error in the case of large page size (sgl-project#5081) * Fix problem of large page size with chunked prefill (sgl-project#6046) * docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047) * docs: add new blog (sgl-project#6048) * Fix not "import os" (sgl-project#6057) * Better PD initialization (sgl-project#5751) * fix: deepep dockerfile, use pip install deepep. (sgl-project#5885) * [Fix] Fix and rename flashmla CI test (sgl-project#6045) * chore: upgrade cutlass 3.9.2 (sgl-project#6004) Co-authored-by: yizhang2077 <1109276519@qq.com> * Fix sgl-kernel build on aarch64 platforms (sgl-project#6062) * Add DeepEP to CI PR Test (sgl-project#5655) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * fix custom_allreduce namespace (sgl-project#6039) * feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * [Feature] Support for Ascend NPU backend (sgl-project#3853) Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> * Fix the timeout for 8 gpu tests (sgl-project#6084) * Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014) * Super tiny fix doc (sgl-project#5233) * [Doc]Fix description for dp_size argument (sgl-project#6063) * feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075) * [refactor] slightly tidy fp8 module (sgl-project#5993) * Clean up fa3 test from 8 gpus (sgl-project#6105) * Deferring 8 GPU test (sgl-project#6102) * Update doc for MLA attention backends (sgl-project#6034) * Clean logs for DeepSeek-V3 launching (sgl-project#6079) * [CI]Add performance CI for VLM (sgl-project#6038) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111) * optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077) * Overlap shared expert and routed expert computations (sgl-project#5121) * Tiny refactor ModelConfig.from_server_args (sgl-project#5219) * Tiny refactor weight loading logic (sgl-project#5232) * [PD] Add control to slow down a server (sgl-project#5572) * Change AMD test threshold (sgl-project#6091) * DeepEP normal support deepgemm-contiguous (sgl-project#5626) Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> * [fix] fix pyproject.toml dependencies (sgl-project#6119) * [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764) Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yi Zhang <1109276519@qq.com> * [perf] dsv3 bmm fallback to bf16 (sgl-project#5662) * [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097) * [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123) Co-authored-by: zhyncs <me@zhyncs.com> * upgrade xgrammar to 0.1.19 (sgl-project#6129) * Remove unecessary is_fa3_supported check (sgl-project#6112) * chore: bump sgl-kernel 0.1.2 (sgl-project#6131) * docs: update README (sgl-project#6132) * [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745) * Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101) * opt flashinfer mla cat (sgl-project#5822) Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> * Update amd nightly concurrency. (sgl-project#6141) * feat: add thinking_budget (sgl-project#6089) * [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162) * fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * chore: bump v0.4.6.post3 (sgl-project#6165) * KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [fix] fix determine_n_share_experts_fusion (sgl-project#6118) * Fix and Clean up chat-template requirement for VLM (sgl-project#6114) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * [Docs]Delete duplicate content (sgl-project#6146) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181) * Added async_encode method to Engine (sgl-project#4701) * Fix data parallel perf regression (sgl-project#6183) * Fix request abortion (sgl-project#6184) * Add typo checker in pre-commit (sgl-project#6179) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Remove duplicate IO Struct test (sgl-project#6180) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> * [PD] Add simple unit test for disaggregation feature (sgl-project#5654) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186) * feat: support loogle eval (sgl-project#6190) * [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191) * fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169) * chore: upgrade deepgemm (sgl-project#6073) * chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195) * chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196) Co-authored-by: alcanderian <alcanderian@gmail.com> * Handle empty input string for embedding models (sgl-project#5621) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199) * [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032) * Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188) * [CI] Reorganize the 8 gpu tests (sgl-project#6192) * Add dev-deepep docker image (sgl-project#6198) * Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Update README.md (sgl-project#6202) * Fix release-docs.yml to not use python 3.9 (sgl-project#6204) * Fix start_profile does not support with_stack and record_shapes (sgl-project#6043) * [doc] add a note for --n-share-experts-fusion args (sgl-project#6154) * Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> * Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213) * Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [CI] Fix PD mooncake dependency error (sgl-project#6212) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Re-enable pd disaggregation test (sgl-project#6231) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix some typos (sgl-project#6209) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206) * [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223) * Revert "fix some typos" (sgl-project#6244) * chore: add hf_xet dep (sgl-project#6243) * Update AMD nightly deps. (sgl-project#6241) * [PD] Add support for different TP sizes per DP rank (sgl-project#5922) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * fix typo (sgl-project#6248) * Support tuning moe for llama 4 model (sgl-project#6042) * Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251) * [Llama4] Add docs note about enable multimodal (sgl-project#6235) * [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247) * Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> * model(vlm): pixtral (sgl-project#5084) * [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252) * Enable MI325X AMD CI. (sgl-project#6259) * chore: bump v0.4.6.post4 (sgl-project#6245) * formatting fix for the rebased commit for 4.6.0_post4 Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix issues in model runner and python packages fix for following issues: > vLLM dependency for xgrammar==0.1.17 > 'Scheduler' object has no attribute 'device > 'pp_proxy_tensors' unexpected arg in HPUGraphRunner > TODO: Add pipeline parallelism support in HPUGraphRunner Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix formatting in model runner Signed-off-by: Mohit Sinha <msinha@habana.ai> * base grammar fix for the is_terminated case > 'OutlinesGrammar' object has no attribute 'is_terminated' Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: congcongke <zhanweidu@163.com> Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Wenxuan Tan <wtan45@wisc.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com> Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: zhanweidu <zhanweidu@163.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: JiLi <leege233@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: PGFLMG <1106310035@qq.com> Co-authored-by: sighingnow <sighingnow@gmail.com> Co-authored-by: XTY <xutianyi1999@live.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: pengcuo <pengcbupt@163.com> Co-authored-by: pengcuo <dgpengcuo@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: alcanerian <alcanerian@gmail.com> Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com> Co-authored-by: liwenju0 <like4hub@gmail.com> Co-authored-by: wenju.li <wenju.li@deepctr.cn> Co-authored-by: laixin <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Co-authored-by: KCFindstr <shimakaze@google.com> Co-authored-by: xm:D <38322020+xiaomin-D@users.noreply.github.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Hank Han <54751605+HanHan009527@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Song Zhang <70674731+botieking98@users.noreply.github.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> Co-authored-by: Zhu Chen <51010608+Othame@users.noreply.github.com> Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: xu-yfei <xu_yfei@qq.com> Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> Co-authored-by: thyecust <tienhoayu@gmail.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Steven Shimizu <shimizust@gmail.com> Co-authored-by: applesaucethebun <113181361+applesaucethebun@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Yusong Gao <yusong.gao@gmail.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Kiv Chen <34561254+KivenChen@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>
Motivation
Support InternVL3
Modifications
Based on PR #3351. and #4433
Support both InternLM2ForCausalLM & Qwen2ForCausalLM as language model.
Support InternVL2.5 and InternVL3
Checklist