Support InternVL3 by xiaomin-D · Pull Request #5350 · sgl-project/sglang

xiaomin-D · 2025-04-13T11:58:32Z

Motivation

Support InternVL3

Modifications

Based on PR #3351. and #4433

Support both InternLM2ForCausalLM & Qwen2ForCausalLM as language model.
Support InternVL2.5 and InternVL3

Checklist

[✅] Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhaochenyang20 · 2025-04-15T15:54:46Z

@yizhang2077 @mickqian could you help to review on this?

GeLee-Q · 2025-04-17T15:37:58Z

python/sglang/srt/conversation.py

This modification also applies to qwen2.5-vl, and perhaps it could be adjusted to make it compatible with more models.

minleminzui · 2025-04-20T14:13:46Z

I tried your test, but it didn’t work. Do you have any advice to help me fix it? thanks @xiaomin-D

root@71b970322e07:/sgl-workspace/sglang_1/test/srt# python3 test_vision_openai_server.py TestInternVL2_5Server
command=python3 -m sglang.launch_server --model-path OpenGVLab/InternVL2_5-2B --trust-remote-code --chat-template internvl-2-5 --host 127.0.0.1 --port 8000
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:03 [__init__.py:239] Automatically detected platform cuda.
[2025-04-20 14:10:04] server_args=ServerArgs(model_path='OpenGVLab/InternVL2_5-2B', tokenizer_path='OpenGVLab/InternVL2_5-2B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='OpenGVLab/InternVL2_5-2B', chat_template='internvl-2-5', completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=723275796, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_llama4_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, flashinfer_mla_disable_ragged=False, warmups=None, n_share_experts_fusion=0, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake')
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.05k/4.05k [00:00<00:00, 25.2MB/s]
configuration_internvl_chat.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.09k/4.09k [00:00<00:00, 38.0MB/s]
configuration_internlm2.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.00k/7.00k [00:00<00:00, 19.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
configuration_intern_vit.py: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.55k/5.55k [00:00<00:00, 15.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internvl_chat.py
- configuration_internlm2.py
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[2025-04-20 14:10:04] vision_select_layer: -1
[2025-04-20 14:10:04] ps_version: v2
[2025-04-20 14:10:04] min_dynamic_patch: 1
[2025-04-20 14:10:04] max_dynamic_patch: 12
[2025-04-20 14:10:04] Ignore import error when loading sglang.srt.managers.multimodal_processors.gemma3: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.internvl: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.janus_pro: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.llava: Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.minicpm: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.mlama: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.mllama4: Failed to import transformers.models.llama4.modeling_llama4 because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.qwen_vl: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 1.01MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.01k/4.01k [00:00<00:00, 14.0MB/s]
tokenization_internlm2.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.79k/8.79k [00:00<00:00, 24.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 18.0MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 668kB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 844/844 [00:00<00:00, 3.22MB/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang_1/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang_1/python/sglang/srt/entrypoints/http_server.py", line 704, in launch_server
    tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
  File "/sgl-workspace/sglang_1/python/sglang/srt/entrypoints/engine.py", line 569, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang_1/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
    self.mm_processor = get_mm_processor(
  File "/sgl-workspace/sglang_1/python/sglang/srt/managers/multimodal_processor.py", line 61, in get_mm_processor
    raise ValueError(
ValueError: No processor registered for architecture: ['InternVLChatModel'].
Registered architectures: ['CLIPModel', 'DeepseekVL2ForCausalLM']
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-vo_juru8'
E
======================================================================
ERROR: setUpClass (__main__.TestInternVL2_5Server)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_1/test/srt/test_vision_openai_server.py", line 615, in setUpClass
    cls.process = popen_launch_server(
  File "/sgl-workspace/sglang_1/python/sglang/test/test_utils.py", line 453, in popen_launch_server
    raise Exception(f"Server unexpectedly exits ({return_code=}).")
Exception: Server unexpectedly exits (return_code=1).

----------------------------------------------------------------------
Ran 0 tests in 20.022s

FAILED (errors=1)
root@71b970322e07:/sgl-workspace/sglang_1/test/srt# git branch
  client
  doc
  doc_timeout
  extrabody
* internVL
  main
  nest_asyncio
  separate_reasoning
  structural_tag
  unbalance
  verl

zhaochenyang20 · 2025-04-24T03:41:10Z

@xiaomin-D please rebase with main, we gonna merge it pretty quick

yizhang2077

some comments

python/sglang/srt/models/internvl.py

test/srt/test_vision_openai_server.py

python/sglang/srt/managers/mm_utils.py

python/sglang/srt/managers/multimodal_processors/internvl.py

xiaomin-D · 2025-04-27T03:16:17Z

I tried your test, but it didn’t work. Do you have any advice to help me fix it? thanks @xiaomin-D

root@71b970322e07:/sgl-workspace/sglang_1/test/srt# python3 test_vision_openai_server.py TestInternVL2_5Server
command=python3 -m sglang.launch_server --model-path OpenGVLab/InternVL2_5-2B --trust-remote-code --chat-template internvl-2-5 --host 127.0.0.1 --port 8000
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-20 14:10:03 [__init__.py:239] Automatically detected platform cuda.
[2025-04-20 14:10:04] server_args=ServerArgs(model_path='OpenGVLab/InternVL2_5-2B', tokenizer_path='OpenGVLab/InternVL2_5-2B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='OpenGVLab/InternVL2_5-2B', chat_template='internvl-2-5', completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=723275796, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_llama4_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, flashinfer_mla_disable_ragged=False, warmups=None, n_share_experts_fusion=0, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake')
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.05k/4.05k [00:00<00:00, 25.2MB/s]
configuration_internvl_chat.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.09k/4.09k [00:00<00:00, 38.0MB/s]
configuration_internlm2.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.00k/7.00k [00:00<00:00, 19.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
configuration_intern_vit.py: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.55k/5.55k [00:00<00:00, 15.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- configuration_internvl_chat.py
- configuration_internlm2.py
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[2025-04-20 14:10:04] vision_select_layer: -1
[2025-04-20 14:10:04] ps_version: v2
[2025-04-20 14:10:04] min_dynamic_patch: 1
[2025-04-20 14:10:04] max_dynamic_patch: 12
[2025-04-20 14:10:04] Ignore import error when loading sglang.srt.managers.multimodal_processors.gemma3: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.internvl: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.janus_pro: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.llava: Failed to import transformers.models.clip.modeling_clip because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.minicpm: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.mlama: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
cannot import name 'Int4WeightOnlyConfig' from 'torchao.quantization' (/usr/local/lib/python3.10/dist-packages/torchao/quantization/__init__.py)
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.mllama4: Failed to import transformers.models.llama4.modeling_llama4 because of the following error (look up to see its traceback):
/usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
[2025-04-20 14:10:05] Ignore import error when loading sglang.srt.managers.multimodal_processors.qwen_vl: /usr/local/lib/python3.10/dist-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 1.01MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.01k/4.01k [00:00<00:00, 14.0MB/s]
tokenization_internlm2.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.79k/8.79k [00:00<00:00, 24.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/OpenGVLab/InternVL2_5-2B:
- tokenization_internlm2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48M/1.48M [00:00<00:00, 18.0MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 668kB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 844/844 [00:00<00:00, 3.22MB/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang_1/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang_1/python/sglang/srt/entrypoints/http_server.py", line 704, in launch_server
    tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
  File "/sgl-workspace/sglang_1/python/sglang/srt/entrypoints/engine.py", line 569, in _launch_subprocesses
    tokenizer_manager = TokenizerManager(server_args, port_args)
  File "/sgl-workspace/sglang_1/python/sglang/srt/managers/tokenizer_manager.py", line 197, in __init__
    self.mm_processor = get_mm_processor(
  File "/sgl-workspace/sglang_1/python/sglang/srt/managers/multimodal_processor.py", line 61, in get_mm_processor
    raise ValueError(
ValueError: No processor registered for architecture: ['InternVLChatModel'].
Registered architectures: ['CLIPModel', 'DeepseekVL2ForCausalLM']
/usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
    cache[rtype].remove(name)
KeyError: '/mp-vo_juru8'
E
======================================================================
ERROR: setUpClass (__main__.TestInternVL2_5Server)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_1/test/srt/test_vision_openai_server.py", line 615, in setUpClass
    cls.process = popen_launch_server(
  File "/sgl-workspace/sglang_1/python/sglang/test/test_utils.py", line 453, in popen_launch_server
    raise Exception(f"Server unexpectedly exits ({return_code=}).")
Exception: Server unexpectedly exits (return_code=1).

----------------------------------------------------------------------
Ran 0 tests in 20.022s

FAILED (errors=1)
root@71b970322e07:/sgl-workspace/sglang_1/test/srt# git branch
  client
  doc
  doc_timeout
  extrabody
* internVL
  main
  nest_asyncio
  separate_reasoning
  structural_tag
  unbalance
  verl

docker run \
    -itd \
    --gpus all \
    --privileged --cap-add=IPC_LOCK \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /data0:/data \
    -v /cfs:/cfs \
    --net=host \
    --ipc=host \
    --name=sgldev lmsysorg/sglang:dev

docker exec -it sgldev bash
cd /sgl-workspace && mv sglang sglang_back

cd /sgl-workspace && git clone --branch internVL https://github.com/xiaomin-D/sglang.git
cd /sgl-workspace/sglang/python && pip install -e .

then, launch sglang server and post

python3 -m sglang.launch_server --model-path OpenGVLab/InternVL2_5-2B --trust-remote-code --chat-template internvl-2-5 --host 127.0.0.1 --port 30000


curl -X POST -H "Content-Type: application/json" -d '{
    "model": "",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg"
                    }
                }
            ]
        }
    ],
    "max_tokens": 300
}' http://localhost:30000/v1/chat/completions

yizhang2077

Some comments

python/sglang/srt/models/internvl.py

yizhang2077

trival comments

python/sglang/lang/chat_template.py

yizhang2077 · 2025-04-30T16:45:14Z

LGTM, can we paste a benchmark result here? MMMU or lmms benchmark is available

yizhang2077 · 2025-04-30T17:13:27Z

FYI: Currently InternVL3 inside implement a flashattention for vision model, which do not support TP and may be a bottleneck, we need replace it in the future PR

Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

* Use device_id in dist init to reduce NCCL communicator warmup & creation overhead (sgl-project#5728) * [fix] fix potential bumpy throughtput with deepgemm (sgl-project#5722) * Resolves the `404 Not Found` error when running `compile_deep_gemm.py` in multi-node setups (sgl-project#5720) * perf: update H20 fused_moe_triton kernel config to get higher throughput during prefilling (sgl-project#5716) * we fix the non existent access of `decrypted_config_file` (sgl-project#5685) * CI: rewrite test_vision_chunked_prefill to speedup (sgl-project#5682) * Fuse MLA set kv cache kernel (sgl-project#5748) * Update amd docker image to `sglang:v0.4.5.post3-rocm630`. (sgl-project#5697) * [feature] support for roberta embedding models (sgl-project#5730) * [fix] fix bench_one_batch_server (sgl-project#5607) * support for the DeepSeek model by enabling streaming response parsing (sgl-project#5592) * fix: Use `is not None` instead of `!= None` for None checks. (sgl-project#5687) * Add Llama 4 to FA3 test (sgl-project#5509) * [misc] more decode step log for batch_one_batch (sgl-project#5565) * Handle JSONDecodeError while processing request data (sgl-project#5599) * fix(srt): check if sample_indices is not None before usage. (sgl-project#5633) * update llguidance to 0.7.11; adds StructTag (sgl-project#4870) * Use sgl-kernel sgl_per_token_group_quant_int8 (sgl-project#4971) * Add memory_saver check (sgl-project#4986) Signed-off-by: Kebe <mail@kebe7jun.com> * add switch to disable open api doc (sgl-project#3744) Signed-off-by: congcongke <zhanweidu@163.com> * Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512" (sgl-project#5772) * Fix eagle test case (sgl-project#5776) * Split local attention test from fa3 test (sgl-project#5774) * Revert "Revert "fix: import vllm_rotary_embedding error when head_size not in 64, 128, 256, 512"" (sgl-project#5777) * Simplify FA3 tests (sgl-project#5779) * Revert "[fix] fix bench_one_batch_server" (sgl-project#5785) * Revert "Use device_id in dist init to reduce NCCL communicator warmup & creation overhead" (sgl-project#5786) * [CI] Tune threshold (sgl-project#5787) * [CI] fix port conflicts (sgl-project#5789) * [CI] Fix ci tests (sgl-project#5769) * [PD]Reduce kv transfer threads (sgl-project#5791) * [CI] Fix test case (sgl-project#5790) * Add 8-GPU Test for Deepseek-V3 (sgl-project#5691) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Release v0.4.6 (sgl-project#5795) * Update nightly-test.yml (sgl-project#5797) * [CI] Improve github summary & enable fa3 for more models (sgl-project#5796) * [Docs] update grafana setup guide in production metrics (sgl-project#5643) Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> * [Misc] add structure logging, write to file and log tracing for SGL Router * Improve overlap scheduling (sgl-project#5788) * Add Cutlass MLA attention backend (sgl-project#5390) * chore: upgrade sgl-kernel 0.1.0 (sgl-project#5690) * Dockerfile.dev pip scikit_build_core (sgl-project#5807) * Add a doc to fix sgl-kernel build link error in py39 with ccache (sgl-project#5809) * Turn on overlap scheduler for multimodal models (sgl-project#5771) * Tiny refactor DefaultModelLoader.Source (sgl-project#5482) * [Docs] Replace lists with tables for cleanup and readability in server_arguments (sgl-project#5276) * Revert "Tiny refactor DefaultModelLoader.Source" (sgl-project#5825) * Feat: add support for thinking mode via chat_template_kwargs.enable_t… (sgl-project#5551) Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: fix the error where the content is None when reasoning and tool … (sgl-project#5838) * feat: Add fused moe triton config for qwen3 moe on h100 (sgl-project#5833) * fused moe triton tuning script support qwen3 (sgl-project#5842) * feat: Add fused moe triton config for qwen3bf16 moe on h20 (sgl-project#5839) * [PD] support pd fake transfer for warmup (sgl-project#5726) * [config] qwen3moe_tune_h20 fp8 tp4 (sgl-project#5846) * [Doc] Recover history of server_arguments.md (sgl-project#5851) * feat: Add fused moe triton config for qwen3-30b-fp8 moe on h20 (sgl-project#5850) * [CI] test chunked prefill more (sgl-project#5798) * ROCm: update AITER (sgl-project#5816) * [Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel (sgl-project#5847) Co-authored-by: sighingnow <sighingnow@gmail.com> * [Fix] Missing bootstrap_port field (sgl-project#5823) * feat: update is_fa3_default_architecture (sgl-project#5854) * add fused moe config for qwen3moe fp8/bf16 (sgl-project#5849) * chore: bump v0.4.6.post1 (sgl-project#5845) * Support `max_completion_tokens` for OpenAIChatCompletions (sgl-project#5857) * simplify fused_moe config logging (sgl-project#5801) * [CI] tune the test order to warmup the server (sgl-project#5860) * Cutlass MLA decode - fix dtype error (sgl-project#5868) * cutlass 3.9 supported to improve fp8_blockwise_gemm (sgl-project#5820) * [Feature] support auto chat template (sgl-project#4949) * Feat: support cuda graph for LoRA (sgl-project#4115) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * Add qwen3 30b fused moe config (sgl-project#5859) * [Fix] Fix a bug for flashmla to run R1 model (sgl-project#5875) Co-authored-by: pengcuo <dgpengcuo@gmail.com> * Add A800 fused moe config for qwen3 30b (sgl-project#5880) * [Misc] add service discovery for sgl router * [fix]: PyO3 macOS linking and consolidate on tracing for logging * chore: update Dockerfile (sgl-project#5894) * [Docs] Update docs for Qwen3 and Qwen3MoE (sgl-project#5836) * [Doc] Tables instead of bulletpoints for sampling doc (sgl-project#5841) * chore: update CODEOWNERS (sgl-project#5895) * [FEATURE] Enhance platform compatibility for ARM (sgl-project#5746) * [CI] Add test_function_calling.py to run_suite.py (sgl-project#5896) * Auto set draft model path for MTP (sgl-project#5793) * [fix] relax mem_fraction_static for h200 (sgl-project#5893) Co-authored-by: alcanerian <alcanerian@gmail.com> * feat: support pythonic tool call and index in tool call streaming (sgl-project#5725) * [Bugfix]: fix missing queue_time_start for requests from grammar_queue (sgl-project#5696) * Add AMD MI300x Nightly Testing. (sgl-project#5861) * chore: use torch 2.6 for sgl-kernel build (sgl-project#5898) * Fix check_env script (sgl-project#5901) * [PD] Fix Assertion failed: /DeepEP/csrc/kernels/internode.cu:483, condition: ibgda_get_state()->num_rc_per_pe >= num_channels sgl-project#134 (sgl-project#5830) * Bump Flashinfer to 0.2.5 (sgl-project#5870) Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> * [Fix] Unload lora in HF_Runner if needed (sgl-project#5899) * Add A800 fused moe config for qwen3 235b (sgl-project#5900) * Add sm_120 for blackwell (sgl-project#5903) * [Feature] add support kimi vl model (sgl-project#5383) Co-authored-by: wenju.li <wenju.li@deepctr.cn> * support vlm benchmark profile (sgl-project#5905) * [fix] kimi-vl test in test_vision_openai_server.py (sgl-project#5910) * [Misc] use parallel build for cmake in sgl-kernel (sgl-project#5919) * [qwen3] support qwen3 ep moe (sgl-project#5917) Co-authored-by: sleepcoo <sleepcoo@gmail.com> * Add TP2 MOE benchmarks for AMD. (sgl-project#5909) * [Feat] Scale up fa3 kernel to sm8x arch (sgl-project#5912) Co-authored-by: zhyncs <me@zhyncs.com> * chore: bump sgl-kernel 0.1.1 (sgl-project#5932) * chore: upgrade sgl-kernel 0.1.1 (sgl-project#5933) * Remove unused method `calculate_num_image_tokens` from qwen2_vl.py (sgl-project#5783) * [PP] Add pipeline parallelism (sgl-project#5724) * Fix lora batch processing when input lora_path contains None (sgl-project#5930) * add Thor & Spark (sgl-project#5915) * fix: correct stream response when enable_thinking is set to false (sgl-project#5881) * fix: update model runner (sgl-project#5934) * chore: bump v0.4.6.post2 (sgl-project#5939) * Support XiaomiMiMo/MiMo model inference (sgl-project#5921) * [PD] Vectorise group_concurrent_contiguous in NumPy (sgl-project#5834) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Remove extra contiguous (sgl-project#5953) * Update ci test and doc for MTP api change (sgl-project#5952) * docs: Fix Qwen model typo (sgl-project#5944) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Optimize a pad operation to accelerate 25us (sgl-project#5945) * Properly return error response in vertex_generate HTTP endpoint (sgl-project#5956) * feat: add concurrency evaluation logic in mmmu benchmark (sgl-project#5782) * Add 1 gpu perf and 2 gpu accuracy tests for AMD MI300x CI. (sgl-project#5960) * feat: Refactor DeepSeekV3 function call (sgl-project#5908) * Remove token in token out in Native API (sgl-project#5967) * Support InternVL3 (sgl-project#5350) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> * Support MMMU benchmark for InternVL (sgl-project#5968) * FA3 speed up: skip len operation and get batch size directly from forward batch (sgl-project#5969) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] NIXL backend Prefill TP & Decode TP+DP (sgl-project#5681) * Fix set kv cache multi-stream (sgl-project#5975) * Overlap qk norm with two streams (sgl-project#5977) * fix: only upgrade nccl for cu128 (sgl-project#5986) * Fix Phi3 serving which was broke by earlier change (sgl-project#5991) Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> * [perf] H100 DeepSeek-V3 fused moe tuned config (sgl-project#5998) * [Fix] Suppress dynamo logging when using flashinfer backend with torch compile (sgl-project#5992) * [Minor] Fix duplicate method definitions in conversation.py (sgl-project#6012) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Fix flaky issues of lora and add multi batch tests (sgl-project#5957) * Tool Call: Add `chat_template_kwargs` documentation (sgl-project#5679) * fix: fix broadcast_pyobj breaking VerlEngine (sgl-project#5997) * [PD] Allow customizing reserved tokens to avoid KV cache waste (sgl-project#6002) * Update dev container config to support live code sync and improve docker setup guide (sgl-project#6018) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * [PD] Optimize disaggregation ib device help info (sgl-project#5781) * [Test] Add flashmla attention backend test (sgl-project#5587) * Fix "Avoid computing lse in Ragged Prefill when there's no prefix match" (sgl-project#5555) * feat: Add a unified merge_state API (sgl-project#5428) * feat: append more comprehensive fields in messages instead of merely role and content (sgl-project#5996) * [Security][Bug] Prevent binding to all TCP interfaces (sgl-project#5752) * Fix prefill OOM error in the case of large page size (sgl-project#5081) * Fix problem of large page size with chunked prefill (sgl-project#6046) * docs: add Google Cloud Vertex AI in Adoption and Sponsorship (sgl-project#6047) * docs: add new blog (sgl-project#6048) * Fix not "import os" (sgl-project#6057) * Better PD initialization (sgl-project#5751) * fix: deepep dockerfile, use pip install deepep. (sgl-project#5885) * [Fix] Fix and rename flashmla CI test (sgl-project#6045) * chore: upgrade cutlass 3.9.2 (sgl-project#6004) Co-authored-by: yizhang2077 <1109276519@qq.com> * Fix sgl-kernel build on aarch64 platforms (sgl-project#6062) * Add DeepEP to CI PR Test (sgl-project#5655) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * fix custom_allreduce namespace (sgl-project#6039) * feat: add release workflow for SGLang kernels on aarch64 (sgl-project#6010) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> * [Feature] Support for Ascend NPU backend (sgl-project#3853) Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> * Fix the timeout for 8 gpu tests (sgl-project#6084) * Hint users DeepEP normal mode is incompatible with CUDA Graph (sgl-project#5014) * Super tiny fix doc (sgl-project#5233) * [Doc]Fix description for dp_size argument (sgl-project#6063) * feat(engine): add bootstrap parameters to generate methods (dynamo) (sgl-project#6075) * [refactor] slightly tidy fp8 module (sgl-project#5993) * Clean up fa3 test from 8 gpus (sgl-project#6105) * Deferring 8 GPU test (sgl-project#6102) * Update doc for MLA attention backends (sgl-project#6034) * Clean logs for DeepSeek-V3 launching (sgl-project#6079) * [CI]Add performance CI for VLM (sgl-project#6038) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * adding Triton configs for DeepSeekV3 FusedMoE kernel on Blackwell (sgl-project#6111) * optimize pad operations in fa3 to accelarate 100+us (sgl-project#6077) * Overlap shared expert and routed expert computations (sgl-project#5121) * Tiny refactor ModelConfig.from_server_args (sgl-project#5219) * Tiny refactor weight loading logic (sgl-project#5232) * [PD] Add control to slow down a server (sgl-project#5572) * Change AMD test threshold (sgl-project#6091) * DeepEP normal support deepgemm-contiguous (sgl-project#5626) Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> * [fix] fix pyproject.toml dependencies (sgl-project#6119) * [Feature] Add FlashAttention3 as a backend for VisionAttention (sgl-project#5764) Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yi Zhang <1109276519@qq.com> * [perf] dsv3 bmm fallback to bf16 (sgl-project#5662) * [AMD] switch to custom allreduce regardless of MSCCL setting on ROCm (sgl-project#6097) * [sgl-kernel] fix: fix cu118 compile error (sgl-project#6123) Co-authored-by: zhyncs <me@zhyncs.com> * upgrade xgrammar to 0.1.19 (sgl-project#6129) * Remove unecessary is_fa3_supported check (sgl-project#6112) * chore: bump sgl-kernel 0.1.2 (sgl-project#6131) * docs: update README (sgl-project#6132) * [Fix] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP (sgl-project#5745) * Cutlass MLA: Disable split kv due to NVIDIA/cutlass#2274 (sgl-project#6101) * opt flashinfer mla cat (sgl-project#5822) Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> * Update amd nightly concurrency. (sgl-project#6141) * feat: add thinking_budget (sgl-project#6089) * [Bugfix] Fix Llama4 gibberish output with long context and CUDA graph (sgl-project#6162) * fix bug that gpu0 occupies more memory when hicache is turned on (sgl-project#5778) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * chore: bump v0.4.6.post3 (sgl-project#6165) * KV‑Cache (MHA, MLA): add missing start_layer / end_layer fields to MHATokenToKVPoolHost and MLATokenToKVPoolHost (sgl-project#6016) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [fix] fix determine_n_share_experts_fusion (sgl-project#6118) * Fix and Clean up chat-template requirement for VLM (sgl-project#6114) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * [Docs]Delete duplicate content (sgl-project#6146) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * Revert "feat: add thinking_budget (sgl-project#6089)" (sgl-project#6181) * Added async_encode method to Engine (sgl-project#4701) * Fix data parallel perf regression (sgl-project#6183) * Fix request abortion (sgl-project#6184) * Add typo checker in pre-commit (sgl-project#6179) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Remove duplicate IO Struct test (sgl-project#6180) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> * [PD] Add simple unit test for disaggregation feature (sgl-project#5654) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Disabled deepep tests temporarily because it takes too much time. (sgl-project#6186) * feat: support loogle eval (sgl-project#6190) * [fix] remove mixtral from is_fa3_default_architecture (sgl-project#6191) * fix: handle None multimodal_inputs during merging and filtering batches in disaggregation decode mode (sgl-project#6169) * chore: upgrade deepgemm (sgl-project#6073) * chore: bump sgl-kernel v0.1.2.post1 (sgl-project#6195) * chore: upgrade sgl-kernel v0.1.2.post1 (sgl-project#6196) Co-authored-by: alcanderian <alcanderian@gmail.com> * Handle empty input string for embedding models (sgl-project#5621) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * doc: fix the erroneous documents and example codes about Alibaba-NLP/gme-Qwen2-VL-2B-Instruct (sgl-project#6199) * [Docs] minor Qwen3 and reasoning parser docs fix (sgl-project#6032) * Improve structured outputs: fix race condition, server crash, metrics and style (sgl-project#6188) * [CI] Reorganize the 8 gpu tests (sgl-project#6192) * Add dev-deepep docker image (sgl-project#6198) * Replace time.time() to time.perf_counter() for benchmarking. (sgl-project#6178) Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> * Update README.md (sgl-project#6202) * Fix release-docs.yml to not use python 3.9 (sgl-project#6204) * Fix start_profile does not support with_stack and record_shapes (sgl-project#6043) * [doc] add a note for --n-share-experts-fusion args (sgl-project#6154) * Performing Vocabulary Parallelism for LM Head across Attention TP Groups (sgl-project#5558) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> * Update AMD CI docker to v0.4.6.post3-rocm630. (sgl-project#6213) * Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (sgl-project#6201) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * [CI] Fix PD mooncake dependency error (sgl-project#6212) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * [CI] Re-enable pd disaggregation test (sgl-project#6231) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * fix some typos (sgl-project#6209) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Docs] Add docs for `SGLANG_` and `SGL_` environment variables (sgl-project#6206) * [PP] Fix init_memory_pool desync & add PP for mixtral (sgl-project#6223) * Revert "fix some typos" (sgl-project#6244) * chore: add hf_xet dep (sgl-project#6243) * Update AMD nightly deps. (sgl-project#6241) * [PD] Add support for different TP sizes per DP rank (sgl-project#5922) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Support incremental streaming of logprob/token_ids between scheduler and detokenizer (sgl-project#6225) Co-authored-by: SangBin Cho <rkooo567@gmail.com> * fix typo (sgl-project#6248) * Support tuning moe for llama 4 model (sgl-project#6042) * Skip the flaky test_stateful_custom_logit_processor (sgl-project#6251) * [Llama4] Add docs note about enable multimodal (sgl-project#6235) * [VERL Use Case] Add torch_memory_saver into deps (sgl-project#6247) * Fix two issues related to `--moe-dense-tp-size=1` (sgl-project#5657) Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> * model(vlm): pixtral (sgl-project#5084) * [misc] deep_gemm fallback to NVRTC when NVCC not found (sgl-project#6252) * Enable MI325X AMD CI. (sgl-project#6259) * chore: bump v0.4.6.post4 (sgl-project#6245) * formatting fix for the rebased commit for 4.6.0_post4 Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix issues in model runner and python packages fix for following issues: > vLLM dependency for xgrammar==0.1.17 > 'Scheduler' object has no attribute 'device > 'pp_proxy_tensors' unexpected arg in HPUGraphRunner > TODO: Add pipeline parallelism support in HPUGraphRunner Signed-off-by: Mohit Sinha <msinha@habana.ai> * fix formatting in model runner Signed-off-by: Mohit Sinha <msinha@habana.ai> * base grammar fix for the is_terminated case > 'OutlinesGrammar' object has no attribute 'is_terminated' Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: congcongke <zhanweidu@163.com> Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Lifu Huang <lifu.hlf@gmail.com> Signed-off-by: Song Zhang <gepin.zs@antgroup.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: Wenxuan Tan <wtan45@wisc.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: vzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DavidBao <121073073+DavidBao03@users.noreply.github.com> Co-authored-by: Frankey_8080 <32973306+Frank-Jie@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: yan97ao <580776+yan97ao@users.noreply.github.com> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Michał Moskal <michal@moskal.me> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: zhanweidu <zhanweidu@163.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Huapeng Zhou <73010314+PopSoda2002@users.noreply.github.com> Co-authored-by: NoahM <88418672+zhudianGG@users.noreply.github.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: JiLi <leege233@gmail.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: PGFLMG <1106310035@qq.com> Co-authored-by: sighingnow <sighingnow@gmail.com> Co-authored-by: XTY <xutianyi1999@live.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: woodx <124784234+woodx9@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: pengcuo <pengcbupt@163.com> Co-authored-by: pengcuo <dgpengcuo@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Johnny <johnnync13@gmail.com> Co-authored-by: alcanerian <alcanerian@gmail.com> Co-authored-by: Yuhao Chen <yxckeis8@gmail.com> Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com> Co-authored-by: liwenju0 <like4hub@gmail.com> Co-authored-by: wenju.li <wenju.li@deepctr.cn> Co-authored-by: laixin <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Co-authored-by: KCFindstr <shimakaze@google.com> Co-authored-by: xm:D <38322020+xiaomin-D@users.noreply.github.com> Co-authored-by: Lifu Huang <lifu.hlf@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: Hank Han <54751605+HanHan009527@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Johnny <johnnynuca14@gmail.com> Co-authored-by: Song Zhang <70674731+botieking98@users.noreply.github.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: lukec <118525388+sleepcoo@users.noreply.github.com> Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> Co-authored-by: Xuting Zhou <xutingz@nvidia.com> Co-authored-by: ZhengHSI <zhenghsi@qq.com> Co-authored-by: Zhu Chen <51010608+Othame@users.noreply.github.com> Co-authored-by: othame <chenzhu_912@zju.edu.cn> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: xu-yfei <xu_yfei@qq.com> Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com> Co-authored-by: thyecust <tienhoayu@gmail.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: chus-chus <chus-chus@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Steven Shimizu <shimizust@gmail.com> Co-authored-by: applesaucethebun <113181361+applesaucethebun@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Yusong Gao <yusong.gao@gmail.com> Co-authored-by: alcanderian <alcanderian@gmail.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: liusy58 <liusy58@linux.alibaba.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com> Co-authored-by: Kiv Chen <34561254+KivenChen@users.noreply.github.com>

Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

xiaomin-D requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners April 13, 2025 11:58

zhaochenyang20 mentioned this pull request Apr 14, 2025

VLM SGLang Tracker zhaochenyang20/Awesome-ML-SYS-Tutorial#111

Open

GeLee-Q reviewed Apr 17, 2025

View reviewed changes

minleminzui self-assigned this Apr 20, 2025

xiaomin-D force-pushed the internVL branch from 84a698a to 7c345a6 Compare April 24, 2025 13:11

yizhang2077 reviewed Apr 26, 2025

View reviewed changes

python/sglang/srt/managers/multimodal_processors/internvl.py Outdated Show resolved Hide resolved

xiaomin-D force-pushed the internVL branch from b2390cc to 9e316d7 Compare April 26, 2025 16:35

xiaomin-D requested a review from zhaochenyang20 as a code owner April 26, 2025 16:35

mickqian and others added 5 commits April 27, 2025 00:39

model: InternVL 2.5

4e1e35e

rebase

42bbbba

rebase: test mixed batch failed to pass

58971b4

model: InternVL 3

82d8b58

modeling internVL with sgl-kernel flash-attn

cb7bcb7

xiaomin-D force-pushed the internVL branch from a494de3 to cb7bcb7 Compare April 26, 2025 16:47

xiaomin-D and others added 2 commits April 27, 2025 22:20

fix VisionModelEmbedding's device error

691b3c8

Merge branch 'main' into internVL

bec72ce

yizhang2077 reviewed Apr 30, 2025

View reviewed changes

xiaomin-D and others added 2 commits April 30, 2025 23:51

Merge branch 'main' into internVL

5871e61

Merge branch 'main' into internVL

f373d9c

simplify InternVL fa3 implementation

3440bc4

yizhang2077 reviewed Apr 30, 2025

View reviewed changes

python/sglang/lang/chat_template.py Show resolved Hide resolved

yizhang2077 approved these changes Apr 30, 2025

View reviewed changes

zhaochenyang20 added 5 commits April 30, 2025 10:24

Merge branch 'main' into internVL

c85d0d8

Merge branch 'main' into internVL

1816801

add reference and mmmu results

3a7e9ef

delete val sglang

66d444f

Merge branch 'main' into internVL

5b4d1fb

zhaochenyang20 merged commit 3409aaa into sgl-project:main May 2, 2025
40 of 42 checks passed

JustinTong0323 mentioned this pull request May 4, 2025

[Feature] Support InterVL #3092

Closed

2 tasks

RunkaiTao pushed a commit to RunkaiTao/sglang that referenced this pull request May 9, 2025

Support InternVL3 (sgl-project#5350)

6484a28

Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

lifuhuang mentioned this pull request May 14, 2025

[Bug] Fix accidental logger override caused by internVL. #6282

Merged

6 tasks

Li-Jicheng mentioned this pull request Jun 8, 2025

internvl3 support #6978

Closed

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

Support InternVL3 (sgl-project#5350)

e18046e

Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

Support InternVL3 (sgl-project#5350)

1c89b77

Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>

Conversation

xiaomin-D commented Apr 13, 2025

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 commented Apr 15, 2025

Uh oh!

GeLee-Q Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

minleminzui commented Apr 20, 2025

Uh oh!

zhaochenyang20 commented Apr 24, 2025

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaomin-D commented Apr 27, 2025

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yizhang2077 commented Apr 30, 2025

Uh oh!

yizhang2077 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yizhang2077 commented Apr 30, 2025 •

edited

Loading