fix: Increase FlashInfer workspace size for Qwen3VL models #14173

BBuf · 2025-12-01T01:41:24Z

@sglang 
➜  sglang git:(main) ✗ CUDA_VISIBLE_DEVICES=7 python -m sglang.launch_server \      --model Qwen/Qwen3-VL-32B-Instruct-FP8 \                                                
    --tp 1 \
    --quantization fp8 \
    --trust-remote-code  
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
[2025-12-01 01:06:59] WARNING server_args.py:1305: Attention backend not explicitly specified. Use flashinfer backend by default.
[2025-12-01 01:07:00] server_args=ServerArgs(model_path='Qwen/Qwen3-VL-32B-Instruct-FP8', tokenizer_path='Qwen/Qwen3-VL-32B-Instruct-FP8', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization='fp8', quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.7925241406249999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=495890181, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, mm_process_config={}, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, served_model_name='Qwen/Qwen3-VL-32B-Instruct-FP8', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', enable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_block_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, decrypted_config_file=None, decrypted_draft_config_file=None, mm_enable_dp_encoder=False, forward_hooks=None)
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
[2025-12-01 01:07:04] Using default HuggingFace chat template with detected content format: openai
[2025-12-01 01:07:11] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-01 01:07:12] Init torch distributed ends. mem usage=0.00 GB
[2025-12-01 01:07:12] MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
[2025-12-01 01:07:12] Ignore import error when loading sglang.srt.models.mindspore: name 'ms' is not defined
[2025-12-01 01:07:13] Load weight begin. avail mem=177.74 GB
[2025-12-01 01:07:13] Detected fp8 checkpoint.
[2025-12-01 01:07:13] Multimodal attention backend not set. Use triton_attn.
[2025-12-01 01:07:13] Using triton_attn as multimodal attention backend.
[2025-12-01 01:07:14] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:01<00:06,  1.15s/it]
Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:02<00:06,  1.23s/it]
Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:03<00:05,  1.25s/it]
Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:04<00:03,  1.09s/it]
Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:05<00:02,  1.14s/it]
Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:07<00:01,  1.19s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:08<00:00,  1.23s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:08<00:00,  1.20s/it]

[2025-12-01 01:07:23] Load weight end. type=Qwen3VLForConditionalGeneration, dtype=torch.bfloat16, avail mem=143.29 GB, mem usage=34.45 GB.
[2025-12-01 01:07:23] Using KV cache dtype: torch.bfloat16
[2025-12-01 01:07:23] KV Cache is allocated. #tokens: 436380, K size: 53.27 GB, V size: 53.27 GB
[2025-12-01 01:07:23] Memory pool end. avail mem=34.66 GB
[2025-12-01 01:07:23] Capture cuda graph begin. This can take up to several minutes. avail mem=34.22 GB
[2025-12-01 01:07:23] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=512 avail_mem=33.43 GB):   0%|                                                                                                                                 | 0/52 [00:00<?, ?it/s]
[2025-12-01 01:07:25] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 344, in __init__
    self.capture()
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 502, in capture
    _capture_one_stream()
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 486, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 650, in capture_one_batch_size
    attn_backend.init_forward_metadata_capture_cuda_graph(
  File "/home/yineng/bbuf/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 561, in init_forward_metadata_capture_cuda_graph
    self.indices_updater_decode.update(
  File "/home/yineng/bbuf/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 909, in update_single_wrapper
    self.call_begin_forward(
  File "/home/yineng/bbuf/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1090, in call_begin_forward
    wrapper.begin_forward(
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/decode.py", line 1045, in plan
    self._plan_info = self._cached_module.plan(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python/tvm_ffi/cython/function.pxi", line 814, in core.Function.__call__
RuntimeError: Error in function 'aligned_alloc' at /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_s with size 3141632 and alignment 16, but only 524288 bytes available in AlignedAllocator. Increase the workspace buffer size.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yineng/bbuf/sglang/python/sglang/srt/managers/scheduler.py", line 2649, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/managers/scheduler.py", line 316, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/managers/tp_worker.py", line 245, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/model_runner.py", line 359, in __init__
    self.initialize(min_per_gpu_memory)
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/model_runner.py", line 550, in initialize
    self.init_device_graphs()
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/model_runner.py", line 2469, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yineng/bbuf/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 346, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Error in function 'aligned_alloc' at /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_s with size 3141632 and alignment 16, but only 524288 bytes available in AlignedAllocator. Increase the workspace buffer size.
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-12-01 01:07:25] Received sigquit from a child process. It usually means the child failed.
[1]    457177 killed     CUDA_VISIBLE_DEVICES=7 python -m sglang.launch_server --model  --tp 1  fp8

with the fix:

This reverts commit 6f48bbd.

gemini-code-assist · 2025-12-01T01:41:34Z

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical buffer overflow issue that occurred when attempting to run Qwen3VL models with FlashInfer. By adjusting the allocated workspace memory for these specific model architectures, the system can now successfully capture CUDA graphs and operate without memory allocation failures, improving stability and compatibility for Qwen3VL models.

Highlights

FlashInfer Workspace Size: Increased the FlashInfer workspace size for specific Qwen3VL model architectures to prevent buffer overflow errors during CUDA graph capture.
Model Compatibility: Extended the list of models requiring increased FlashInfer workspace to include Qwen3VLForConditionalGeneration and Qwen3VLMoeForConditionalGeneration.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes a buffer overflow issue with FlashInfer for Qwen3-VL models by increasing the workspace size. The change is straightforward and effective. I've added a suggestion to refactor the condition for checking model architectures to improve code readability and maintainability.

python/sglang/srt/layers/attention/flashinfer_backend.py

…ct#14173)

BBuf added 9 commits November 29, 2025 13:57

add moe_wna16_marlin_gemm_v2

6f48bbd

Revert "add moe_wna16_marlin_gemm_v2"

eeea208

This reverts commit 6f48bbd.

Merge branch 'main' of github.com:sgl-project/sglang

83b3a76

Merge branch 'main' of github.com:sgl-project/sglang

ef908c7

Merge branch 'main' of github.com:sgl-project/sglang

ce08aed

Merge branch 'main' of github.com:sgl-project/sglang

8a23117

Merge branch 'main' of github.com:sgl-project/sglang

f8314e1

fix: Increase FlashInfer workspace size for Qwen3VL models

133f9dc

fix: Increase FlashInfer workspace size for Qwen3VL models

4d7c053

BBuf requested review from Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 1, 2025 01:41

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

BBuf requested a review from yuan-luo December 1, 2025 09:48

yuan-luo reviewed Dec 1, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashinfer_backend.py Show resolved Hide resolved

yuan-luo approved these changes Dec 1, 2025

View reviewed changes

BBuf merged commit fa9021b into main Dec 1, 2025
51 of 58 checks passed

BBuf deleted the fix/qwen3vl-flashinfer-workspace-size branch December 1, 2025 09:54

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025

fix: Increase FlashInfer workspace size for Qwen3VL models (sgl-proje…

991c63d

…ct#14173)

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

fix: Increase FlashInfer workspace size for Qwen3VL models (sgl-proje…

4f0c509

…ct#14173)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Increase FlashInfer workspace size for Qwen3VL models #14173

fix: Increase FlashInfer workspace size for Qwen3VL models #14173

Uh oh!

BBuf commented Dec 1, 2025

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Increase FlashInfer workspace size for Qwen3VL models #14173

fix: Increase FlashInfer workspace size for Qwen3VL models #14173

Uh oh!

Conversation

BBuf commented Dec 1, 2025

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants