Skip to content

Conversation

@yuan-luo
Copy link
Collaborator

Motivation

This PR is to support PP for Qwen2.5-VL model.

[root  /root] 二 11月 11 20:23:51 
$python3 -m sglang.launch_server --model /home/admin/Qwen2.5-VL-7B-Instruct --tp 2 --pp-size=2
INFO 11-11 20:29:05 [__init__.py:216] Automatically detected platform cuda.
[2025-11-11 20:29:05] WARNING server_args.py:1183: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-11-11 20:29:05] WARNING server_args.py:1454: Pipeline parallelism is incompatible with overlap schedule.
[2025-11-11 20:29:05] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:06] server_args=ServerArgs(model_path='/home/admin/Qwen2.5-VL-7B-Instruct', tokenizer_path='/home/admin/Qwen2.5-VL-7B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.7701471874999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=2, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=132408838, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='/home/admin/Qwen2.5-VL-7B-Instruct', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-11-11 20:29:07] Using default HuggingFace chat template with detected content format: openai
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:14 [__init__.py:216] Automatically detected platform cuda.
INFO 11-11 20:29:15 [__init__.py:216] Automatically detected platform cuda.
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-11-11 20:29:15 PP0 TP0] Init torch distributed begin.
[2025-11-11 20:29:15 PP0 TP1] Init torch distributed begin.
[2025-11-11 20:29:15 PP1 TP1] Init torch distributed begin.
[2025-11-11 20:29:16 PP1 TP0] Init torch distributed begin.
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:17 PP1 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:17 PP0 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:20 PP0 TP0] sglang is using nccl==2.27.3
[2025-11-11 20:29:20 PP0 TP1] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-11-11 20:29:21 PP0 TP0] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP1 TP0] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP0 TP1] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP1 TP1] Init torch distributed ends. mem usage=1.43 GB
[2025-11-11 20:29:21 PP0 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-11 20:29:21 PP1 TP0] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-11-11 20:29:22 PP1 TP0] Load weight begin. avail mem=93.25 GB
[2025-11-11 20:29:22 PP0 TP0] Load weight begin. avail mem=93.16 GB
[2025-11-11 20:29:22 PP0 TP1] Load weight begin. avail mem=93.25 GB
[2025-11-11 20:29:22 PP1 TP0] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP1 TP0] Using fa3 as multimodal attention backend.
[2025-11-11 20:29:22 PP0 TP1] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP0 TP1] Using fa3 as multimodal attention backend.
[2025-11-11 20:29:22 PP0 TP0] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP0 TP0] Using fa3 as multimodal attention backend.
[2025-11-11 20:29:22 PP1 TP1] Load weight begin. avail mem=93.25 GB
[2025-11-11 20:29:22 PP1 TP1] Multimodal attention backend not set. Use fa3.
[2025-11-11 20:29:22 PP1 TP1] Using fa3 as multimodal attention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:00, 10.20it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:00<00:00,  4.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  3.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  3.64it/s]

[2025-11-11 20:29:24 PP1 TP0] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.92 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP0 TP0] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.83 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP1 TP1] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.92 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP1 TP0] Using KV cache dtype: torch.bfloat16
[2025-11-11 20:29:24 PP0 TP1] Load weight end. type=Qwen2_5_VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=88.92 GB, mem usage=4.34 GB.
[2025-11-11 20:29:24 PP0 TP0] Using KV cache dtype: torch.bfloat16
[2025-11-11 20:29:24 PP0 TP0] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP0 TP0] Memory pool end. avail mem=19.37 GB
[2025-11-11 20:29:24 PP0 TP1] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP0 TP1] Memory pool end. avail mem=19.46 GB
[2025-11-11 20:29:24 PP1 TP1] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP1 TP1] Memory pool end. avail mem=19.46 GB
[2025-11-11 20:29:24 PP1 TP0] KV Cache is allocated. #tokens: 2524524, K size: 33.71 GB, V size: 33.71 GB
[2025-11-11 20:29:24 PP1 TP0] Memory pool end. avail mem=19.46 GB
[2025-11-11 20:29:24 PP0 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=18.80 GB
[2025-11-11 20:29:24 PP0 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2025-11-11 20:29:24 PP0 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=18.89 GB
[2025-11-11 20:29:24 PP1 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=18.89 GB
[2025-11-11 20:29:24 PP1 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=18.89 GB
[2025-11-11 20:29:24 PP1 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=17.90 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:04<00:00,  7.52it/s]
[2025-11-11 20:29:29 PP0 TP0] Registering 1044 cuda graph addresses
Capturing batches (bs=1 avail_mem=17.83 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:05<00:00,  7.19it/s]
[2025-11-11 20:29:29 PP1 TP0] Registering 1008 cuda graph addresses
[2025-11-11 20:29:29 PP0 TP1] Capture cuda graph end. Time elapsed: 5.44 s. mem usage=0.90 GB. avail mem=17.99 GB.
[2025-11-11 20:29:29 PP0 TP0] Capture cuda graph end. Time elapsed: 5.47 s. mem usage=0.90 GB. avail mem=17.90 GB.
[2025-11-11 20:29:30 PP1 TP1] Capture cuda graph end. Time elapsed: 5.70 s. mem usage=1.06 GB. avail mem=17.83 GB.
[2025-11-11 20:29:30 PP1 TP0] Capture cuda graph end. Time elapsed: 5.74 s. mem usage=1.06 GB. avail mem=17.83 GB.
[2025-11-11 20:29:30 PP1 TP0] max_total_num_tokens=2524524, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=17.83 GB
[2025-11-11 20:29:30 PP0 TP0] max_total_num_tokens=2524524, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=128000, available_gpu_mem=17.90 GB
[2025-11-11 20:29:31] INFO:     Started server process [376596]
[2025-11-11 20:29:31] INFO:     Waiting for application startup.
[2025-11-11 20:29:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-11-11 20:29:31] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 1e-06, 'top_k': 50, 'top_p': 1.0}
[2025-11-11 20:29:31] INFO:     Application startup complete.
[2025-11-11 20:29:31] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-11-11 20:29:32] INFO:     127.0.0.1:48306 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-11-11 20:29:32 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 29, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[2025-11-11 20:29:34 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 29, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
/opt/conda/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py:998: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1578.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
[2025-11-11 20:29:39] INFO:     127.0.0.1:48320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-11 20:29:39] The server is fired up and ready to roll!
[2025-11-11 20:29:50 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 955, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:29:51 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 955, #cached-token: 15, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:29:52 PP0 TP0] Decode batch, #running-req: 1, #token: 1003, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.93, #queue-req: 0, 
[2025-11-11 20:29:52 PP1 TP0] Decode batch, #running-req: 1, #token: 1003, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.93, #queue-req: 0, 
[2025-11-11 20:29:52 PP0 TP0] Decode batch, #running-req: 1, #token: 1043, token usage: 0.00, cuda graph: True, gen throughput (token/s): 144.49, #queue-req: 0, 
[2025-11-11 20:29:52 PP1 TP0] Decode batch, #running-req: 1, #token: 1043, token usage: 0.00, cuda graph: True, gen throughput (token/s): 144.69, #queue-req: 0, 
[2025-11-11 20:29:52] INFO:     127.0.0.1:55692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-11-11 20:30:00 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 969, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:30:00 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 969, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-11 20:30:00 PP0 TP0] Decode batch, #running-req: 1, #token: 995, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4.74, #queue-req: 0, 
[2025-11-11 20:30:00 PP1 TP0] Decode batch, #running-req: 1, #token: 995, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4.74, #queue-req: 0, 
[2025-11-11 20:30:00 PP0 TP0] Decode batch, #running-req: 1, #token: 1035, token usage: 0.00, cuda graph: True, gen throughput (token/s): 156.34, #queue-req: 0, 
[2025-11-11 20:30:00 PP1 TP0] Decode batch, #running-req: 1, #token: 1035, token usage: 0.00, cuda graph: True, gen throughput (token/s): 156.41, #queue-req: 0, 
[2025-11-11 20:30:01] INFO:     127.0.0.1:39012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
$bash bench_local_video.sh 
{"id":"48e575c6d45a484fa6d0f1a2b0c2be75","object":"chat.completion","created":1762864259,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"视频里的招牌上写着“小鞋匠洗鞋”,并且旁边还有一些中文文字,可能是店铺的联系方式或其他相关信息。招牌的颜色主要是黄色和红色,背景是店铺的外墙。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":20906,"total_tokens":20945,"completion_tokens":39,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m5.617s
user    0m0.001s
sys     0m0.003s
{"id":"b04b117dd06f46e381c6a0b0dff0557b","object":"chat.completion","created":1762864264,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"视频里的招牌上写着“小鞋匠洗鞋”,并且旁边还有一些中文文字,可能是店铺的联系方式或其他相关信息。招牌的颜色主要是黄色和红色,背景是店铺的外墙。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":20906,"total_tokens":20945,"completion_tokens":39,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real    0m5.500s
user    0m0.001s
sys     0m0.003s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for Pipeline Parallelism (PP) to the Qwen2.5-VL model. The changes enable efficient distribution of the model's layers across multiple devices, optimizing memory and computational resources for large-scale multimodal inference. It ensures that multimodal inputs are processed correctly at the initial stage, and model weights are loaded judiciously per pipeline stage, leading to improved scalability and performance.

Highlights

  • Pipeline Parallelism (PP) Support: Implemented comprehensive support for Pipeline Parallelism (PP) for the Qwen2.5-VL model, enabling efficient distribution of model layers across multiple devices.
  • Multimodal Input Handling: Modified multimodal input embedding logic to execute exclusively on the first pipeline parallelism rank, preventing redundant computations and optimizing resource usage.
  • Distributed Broadcasting Correction: Adjusted distributed broadcasting for multimodal inputs to correctly utilize the designated entry rank within the pipeline parallelism group, ensuring proper data flow.
  • Layer-specific Weight Loading and LM Head Initialization: Updated the language model head (lm_head) initialization to use a placeholder on non-last PP ranks and the actual head only on the final rank. Additionally, enhanced weight loading to selectively load model weights based on the current pipeline parallelism stage, reducing memory footprint for individual ranks.
  • Forward Pass Adaptation: Adapted the model's forward pass to correctly handle data flow and return values across different pipeline parallelism stages, passing hidden states between intermediate ranks and producing final output only from the last rank.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for pipeline parallelism (PP) to the Qwen2.5-VL model. The changes correctly adapt the model's embedding, forward pass, and weight loading logic to work across multiple pipeline stages. For instance, embeddings are now computed only on the first rank, and the language model head is only initialized on the last rank. I've found one potential bug where auxiliary hidden states are computed but not passed to the logits processor, which could affect features like speculative decoding. Overall, the changes are well-structured for enabling pipeline parallelism.

Co-authored-by: Yuan Luo <[email protected]>
Co-authored-by: Tianyu Guo <[email protected]>
@yuan-luo yuan-luo force-pushed the support_qwen_2_5_vl_pp branch from db18a33 to 8b16cb0 Compare November 12, 2025 02:39
@yuan-luo
Copy link
Collaborator Author

Added logic in #12762 which is needed. Adding @gty111 as co-author.

@JustinTong0323
Copy link
Collaborator

nit: for code readability, you should better not use nested if statements.

@yhyang201 yhyang201 self-requested a review November 12, 2025 12:37
@hnyls2002 hnyls2002 merged commit 706502f into sgl-project:main Nov 12, 2025
121 of 127 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants