[VLM] Support PP for Qwen2.5-VL #13075

yuan-luo · 2025-11-11T12:36:00Z

Motivation

This PR is to support PP for Qwen2.5-VL model.

[root  /root] 二 11月 11 20:23:51 
$python3 -m sglang.launch_server --model /home/admin/Qwen2.5-VL-7B-Instruct --tp 2 --pp-size=2
INFO 11-11 20:29:05 [__init__.py:216] Automatically detected platform cuda.
[2025-11-11 20:29:05] WARNING server_args.py:1183: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-11-11 20:29:05] WARNING server_args.py:1454: Pipeline parallelism is incompatible with overlap schedule.
[2025-11-11 20:29:05] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:06] server_args=ServerArgs(model_path='/home/admin/Qwen2.5-VL-7B-Instruct', tokenizer_path='/home/admin/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.7701471874999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=2, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=132408838, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='/home/admin/Qwen2.5-VL-7B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-11-11 20:29:07] Using default HuggingFace chat template with detected content format: openai
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:15 [__init__.py:216] Automatically detected platform cuda.
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15 PP0 TP0] Init torch distributed begin.
[2025-11-11 20:29:15 PP0 TP1] Init torch distributed begin.
[2025-11-11 20:29:15 PP1 TP1] Init torch distributed begin.
[2025-11-11 20:29:16 PP1 TP0] Init torch distributed begin.
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:17 PP1 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:17 PP0 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:20 PP0 TP0] sglang is using nccl==2.27.3
[2025-11-11 20:29:20 PP0 TP1] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:21 PP0 TP0] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP1 TP0] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP0 TP1] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP1 TP1] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP0 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-11 20:29:21 PP1 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-11 20:29:22 PP1 TP0] Load weight begin. avail mem=93.25 GB
[2025-11-11 20:29:22 PP0 TP0] Load weight begin. avail mem=93.16 GB
[2025-11-11 20:29:22 PP0 TP1] Load weight begin. avail mem=93.25 GB
[2025-11-11 20:29:22 PP1 TP0] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP1 TP0] Using fa3 as multimodal attention backend.
[2025-11-11 20:29:22 PP0 TP1] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP0 TP1] Using fa3 as multimodal attention backend.
[2025-11-11 20:29:22 PP0 TP0] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP0 TP0] Using fa3 as multimodal attention backend.
[2025-11-11 20:29:22 PP1 TP1] Load weight begin. avail mem=93.25 GB
[2025-11-11 20:29:22 PP1 TP1] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP1 TP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:00, 10.20it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:00<00:00,  4.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  3.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  3.64it/s]

[2025-11-11 20:29:24 PP1 TP0] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.92 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP0 TP0] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.83 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP1 TP1] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.92 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP1 TP0] Using KV cache dtype: torch.bfloat16
[2025-11-11 20:29:24 PP0 TP1] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.92 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP0 TP0] Using KV cache dtype: torch.bfloat16
[2025-11-11 20:29:24 PP0 TP0] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP0 TP0] Memory pool end. avail mem=19.37 GB
[2025-11-11 20:29:24 PP0 TP1] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP0 TP1] Memory pool end. avail mem=19.46 GB
[2025-11-11 20:29:24 PP1 TP1] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP1 TP1] Memory pool end. avail mem=19.46 GB
[2025-11-11 20:29:24 PP1 TP0] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP1 TP0] Memory pool end. avail mem=19.46 GB
[2025-11-11 20:29:24 PP0 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=18.80 GB
[2025-11-11 20:29:24 PP0 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2025-11-11 20:29:24 PP0 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=18.89 GB
[2025-11-11 20:29:24 PP1 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=18.89 GB
[2025-11-11 20:29:24 PP1 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=18.89 GB
[2025-11-11 20:29:24 PP1 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=17.90 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:04<00:00,  7.52it/s]
[2025-11-11 20:29:29 PP0 TP0] Registering 1044 cuda graph addresses
Capturing batches (bs=1 avail_mem=17.83 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:05<00:00,  7.19it/s]
[2025-11-11 20:29:29 PP1 TP0] Registering 1008 cuda graph addresses
[2025-11-11 20:29:29 PP0 TP1] Capture cuda graph end. Time elapsed: 5.44 s. mem usage=0.90 GB. avail mem=17.99 GB.
[2025-11-11 20:29:29 PP0 TP0] Capture cuda graph end. Time elapsed: 5.47 s. mem usage=0.90 GB. avail mem=17.90 GB.
[2025-11-11 20:29:30 PP1 TP1] Capture cuda graph end. Time elapsed: 5.70 s. mem usage=1.06 GB. avail mem=17.83 GB.
[2025-11-11 20:29:30 PP1 TP0] Capture cuda graph end. Time elapsed: 5.74 s. mem usage=1.06 GB. avail mem=17.83 GB.
[2025-11-11 20:29:30 PP1 TP0] max_total_num_tokens=2524524, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=17.83 GB
[2025-11-11 20:29:30 PP0 TP0] max_total_num_tokens=2524524, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=17.90 GB
[2025-11-11 20:29:31] INFO:     Started server process [376596]
[2025-11-11 20:29:31] INFO:     Waiting for application startup.
[2025-11-11 20:29:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-11-11 20:29:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-11-11 20:29:31] INFO:     Application startup complete.
[2025-11-11 20:29:31] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-11 20:29:32] INFO:     127.0.0.1:48306 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-11 20:29:32 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 29, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[2025-11-11 20:29:34 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 29, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[2025-11-11 20:29:39] INFO:     127.0.0.1:48320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-11 20:29:39] The server is fired up and ready to roll!
[2025-11-11 20:29:50 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 955, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:29:51 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 955, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:29:52 PP0 TP0] Decode batch, #running-req: 1, #token: 1003, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.93, #queue-req: 0, 
[2025-11-11 20:29:52 PP1 TP0] Decode batch, #running-req: 1, #token: 1003, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.93, #queue-req: 0, 
[2025-11-11 20:29:52 PP0 TP0] Decode batch, #running-req: 1, #token: 1043, token usage: 0.00, cuda graph: True, gen throughput (token/s): 144.49, #queue-req: 0, 
[2025-11-11 20:29:52 PP1 TP0] Decode batch, #running-req: 1, #token: 1043, token usage: 0.00, cuda graph: True, gen throughput (token/s): 144.69, #queue-req: 0, 
[2025-11-11 20:29:52] INFO:     127.0.0.1:55692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-11 20:30:00 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 969, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:30:00 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 969, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:30:00 PP0 TP0] Decode batch, #running-req: 1, #token: 995, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4.74, #queue-req: 0, 
[2025-11-11 20:30:00 PP1 TP0] Decode batch, #running-req: 1, #token: 995, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4.74, #queue-req: 0, 
[2025-11-11 20:30:00 PP0 TP0] Decode batch, #running-req: 1, #token: 1035, token usage: 0.00, cuda graph: True, gen throughput (token/s): 156.34, #queue-req: 0, 
[2025-11-11 20:30:00 PP1 TP0] Decode batch, #running-req: 1, #token: 1035, token usage: 0.00, cuda graph: True, gen throughput (token/s): 156.41, #queue-req: 0, 
[2025-11-11 20:30:01] INFO:     127.0.0.1:39012 - "POST /v1/chat/completions HTTP/1.1" 200 OK

$bash bench_local_video.sh 
{"id":"48e575c6d45a484fa6d0f1a2b0c2be75","object":"chat.completion","created":1762864259,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"视频里的招牌上写着“小鞋匠洗鞋”，并且旁边还有一些中文文字，可能是店铺的联系方式或其他相关信息。招牌的颜色主要是黄色和红色，背景是店铺的外墙。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":20906,"total_tokens":20945,"completion_tokens":39,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m5.617s
user    0m0.001s
sys     0m0.003s
{"id":"b04b117dd06f46e381c6a0b0dff0557b","object":"chat.completion","created":1762864264,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"视频里的招牌上写着“小鞋匠洗鞋”，并且旁边还有一些中文文字，可能是店铺的联系方式或其他相关信息。招牌的颜色主要是黄色和红色，背景是店铺的外墙。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":20906,"total_tokens":20945,"completion_tokens":39,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m5.500s
user    0m0.001s
sys     0m0.003s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-11T12:36:21Z

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Pipeline Parallelism (PP) to the Qwen2.5-VL model. The changes enable efficient distribution of the model's layers across multiple devices, optimizing memory and computational resources for large-scale multimodal inference. It ensures that multimodal inputs are processed correctly at the initial stage, and model weights are loaded judiciously per pipeline stage, leading to improved scalability and performance.

Highlights

Pipeline Parallelism (PP) Support: Implemented comprehensive support for Pipeline Parallelism (PP) for the Qwen2.5-VL model, enabling efficient distribution of model layers across multiple devices.
Multimodal Input Handling: Modified multimodal input embedding logic to execute exclusively on the first pipeline parallelism rank, preventing redundant computations and optimizing resource usage.
Distributed Broadcasting Correction: Adjusted distributed broadcasting for multimodal inputs to correctly utilize the designated entry rank within the pipeline parallelism group, ensuring proper data flow.
Layer-specific Weight Loading and LM Head Initialization: Updated the language model head (lm_head) initialization to use a placeholder on non-last PP ranks and the actual head only on the final rank. Additionally, enhanced weight loading to selectively load model weights based on the current pipeline parallelism stage, reducing memory footprint for individual ranks.
Forward Pass Adaptation: Adapted the model's forward pass to correctly handle data flow and return values across different pipeline parallelism stages, passing hidden states between intermediate ranks and producing final output only from the last rank.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for pipeline parallelism (PP) to the Qwen2.5-VL model. The changes correctly adapt the model's embedding, forward pass, and weight loading logic to work across multiple pipeline stages. For instance, embeddings are now computed only on the first rank, and the language model head is only initialized on the last rank. I've found one potential bug where auxiliary hidden states are computed but not passed to the logits processor, which could affect features like speculative decoding. Overall, the changes are well-structured for enabling pipeline parallelism.

python/sglang/srt/models/qwen2_5_vl.py

Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: Tianyu Guo <[email protected]>

yuan-luo · 2025-11-12T02:41:14Z

Added logic in #12762 which is needed. Adding @gty111 as co-author.

JustinTong0323 · 2025-11-12T06:18:38Z

nit: for code readability, you should better not use nested if statements.

yuan-luo requested review from Ying1123, hnyls2002, merrymercy, xiezhq-hermann and zhyncs as code owners November 11, 2025 12:36

sglang-bot added the run-ci label Nov 11, 2025

yuan-luo force-pushed the support_qwen_2_5_vl_pp branch from 659e6c1 to db18a33 Compare November 11, 2025 12:36

yuan-luo requested review from BBuf, JustinTong0323 and ispobock November 11, 2025 12:37

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

python/sglang/srt/models/qwen2_5_vl.py Show resolved Hide resolved

yuan-luo mentioned this pull request Nov 11, 2025

[Feature] Support Pipeline Parallelism in VLM #12958

Open

6 tasks

yuan-luo requested a review from mickqian November 12, 2025 01:36

Support qwen2.5-vl pp

8b16cb0

Co-authored-by: Yuan Luo <[email protected]> Co-authored-by: Tianyu Guo <[email protected]>

yuan-luo force-pushed the support_qwen_2_5_vl_pp branch from db18a33 to 8b16cb0 Compare November 12, 2025 02:39

yuan-luo added feature performance Multi-modal multi-modal language model vlm labels Nov 12, 2025

JustinTong0323 approved these changes Nov 12, 2025

View reviewed changes

hnyls2002 approved these changes Nov 12, 2025

View reviewed changes

yhyang201 self-requested a review November 12, 2025 12:37

yhyang201 approved these changes Nov 12, 2025

View reviewed changes

Merge branch 'main' into support_qwen_2_5_vl_pp

c27a36f

hnyls2002 merged commit 706502f into sgl-project:main Nov 12, 2025
121 of 127 checks passed

ShangmingCai mentioned this pull request Nov 13, 2025

Fix entry rank introduced by #11910 #12762

Closed

4 tasks

yuan-luo mentioned this pull request Nov 13, 2025

[VLM] Support Piecewise CUDA Graph for Qwen2.5-VL #13055

Merged

4 tasks

Lzhang-hub mentioned this pull request Nov 25, 2025

[Bugfix] qwen2.5-vl spec decode accept_len low #13904

Merged

6 tasks

XucSh mentioned this pull request Nov 26, 2025

[PP] Add pp support for Qwen3-VL #12333

Open

4 tasks

bluecoffee8 mentioned this pull request Dec 2, 2025

fix: Support PP for Mistral Small 3.1 #14254

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VLM] Support PP for Qwen2.5-VL #13075

[VLM] Support PP for Qwen2.5-VL #13075

yuan-luo commented Nov 11, 2025

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

yuan-luo commented Nov 12, 2025

Uh oh!

JustinTong0323 commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[VLM] Support PP for Qwen2.5-VL #13075

[VLM] Support PP for Qwen2.5-VL #13075

Conversation

yuan-luo commented Nov 11, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yuan-luo commented Nov 12, 2025

Uh oh!

JustinTong0323 commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants