Skip to content

Conversation

@BBuf
Copy link
Collaborator

@BBuf BBuf commented Dec 1, 2025

@sglang 
➜  sglang git:(main) ✗ CUDA_VISIBLE_DEVICES=7 python -m sglang.launch_server \      --model Qwen/Qwen3-VL-32B-Instruct-FP8 \                                                
    --tp 1 \
    --quantization fp8 \
    --trust-remote-code  
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
[2025-12-01 01:06:59] WARNING server_args.py:1305: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-12-01 01:07:00] server_args=ServerArgs(model_path='Qwen/Qwen3-VL-32B-Instruct-FP8', tokenizer_path='Qwen/Qwen3-VL-32B-Instruct-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization='fp8', quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.7925241406249999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=495890181, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, mm_process_config={}, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen3-VL-32B-Instruct-FP8', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', enable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_block_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, decrypted_config_file=None, decrypted_draft_config_file=None, mm_enable_dp_encoder=False, forward_hooks=None)
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
[2025-12-01 01:07:04] Using default HuggingFace chat template with detected content format: openai
[2025-12-01 01:07:11] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-01 01:07:12] Init torch distributed ends. mem usage=0.00 GB
[2025-12-01 01:07:12] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-01 01:07:12] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-12-01 01:07:13] Load weight begin. avail mem=177.74 GB
[2025-12-01 01:07:13] Detected fp8 checkpoint.
[2025-12-01 01:07:13] Multimodal attention backend not set. Use triton_attn.
[2025-12-01 01:07:13] Using triton_attn as multimodal attention backend.
[2025-12-01 01:07:14] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:01<00:06,  1.15s/it]
Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:02<00:06,  1.23s/it]
Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:03<00:05,  1.25s/it]
Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:04<00:03,  1.09s/it]
Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:05<00:02,  1.14s/it]
Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:07<00:01,  1.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:08<00:00,  1.23s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:08<00:00,  1.20s/it]

[2025-12-01 01:07:23] Load weight end. type=Qwen3VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=143.29 GB, mem usage=34.45 GB.
[2025-12-01 01:07:23] Using KV cache dtype: torch.bfloat16
[2025-12-01 01:07:23] KV Cache is allocated. #tokens: 436380, K size: 53.27 GB, V size: 53.27 GB
[2025-12-01 01:07:23] Memory pool end. avail mem=34.66 GB
[2025-12-01 01:07:23] Capture cuda graph begin. This can take up to several minutes. avail mem=34.22 GB
[2025-12-01 01:07:23] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=512 avail_mem=33.43 GB):   0%|                                                                                                                                 | 0/52 [00:00<?, ?it/s]
[2025-12-01 01:07:25] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 344, in __init__
    self.capture()
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 502, in capture
    _capture_one_stream()
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 486, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 650, in capture_one_batch_size
    attn_backend.init_forward_metadata_capture_cuda_graph(
  File "/home/yineng/bbuf/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 561, in init_forward_metadata_capture_cuda_graph
    self.indices_updater_decode.update(
  File "/home/yineng/bbuf/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 909, in update_single_wrapper
    self.call_begin_forward(
  File "/home/yineng/bbuf/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1090, in call_begin_forward
    wrapper.begin_forward(
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/decode.py", line 1045, in plan
    self._plan_info = self._cached_module.plan(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/tvm_ffi/cython/function.pxi", line 814, in core.Function.__call__
RuntimeError: Error in function 'aligned_alloc' at /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_s with size 3141632 and alignment 16, but only 524288 bytes available in AlignedAllocator. Increase the workspace buffer size.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yineng/bbuf/sglang/python/sglang/srt/managers/scheduler.py", line 2649, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/managers/scheduler.py", line 316, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/managers/tp_worker.py", line 245, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/model_runner.py", line 550, in initialize
    self.init_device_graphs()
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/model_runner.py", line 2469, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 346, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Error in function 'aligned_alloc' at /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_s with size 3141632 and alignment 16, but only 524288 bytes available in AlignedAllocator. Increase the workspace buffer size.
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-01 01:07:25] Received sigquit from a child process. It usually means the child failed.
[1]    457177 killed     CUDA_VISIBLE_DEVICES=7 python -m sglang.launch_server --model  --tp 1  fp8  

with the fix:

图片 图片

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical buffer overflow issue that occurred when attempting to run Qwen3VL models with FlashInfer. By adjusting the allocated workspace memory for these specific model architectures, the system can now successfully capture CUDA graphs and operate without memory allocation failures, improving stability and compatibility for Qwen3VL models.

Highlights

  • FlashInfer Workspace Size: Increased the FlashInfer workspace size for specific Qwen3VL model architectures to prevent buffer overflow errors during CUDA graph capture.
  • Model Compatibility: Extended the list of models requiring increased FlashInfer workspace to include Qwen3VLForConditionalGeneration and Qwen3VLMoeForConditionalGeneration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a buffer overflow issue with FlashInfer for Qwen3-VL models by increasing the workspace size. The change is straightforward and effective. I've added a suggestion to refactor the condition for checking model architectures to improve code readability and maintainability.

@BBuf BBuf requested a review from yuan-luo December 1, 2025 09:48
@BBuf BBuf merged commit fa9021b into main Dec 1, 2025
51 of 58 checks passed
@BBuf BBuf deleted the fix/qwen3vl-flashinfer-workspace-size branch December 1, 2025 09:54
harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants