Skip to content

vllm serve pytorch/gemma-3-12b-it-FP8 failed with fbgemm-gpu-genai 1.4.2 on B200 #5221

@huydhn

Description

@huydhn

When running vLLM benchmark of pytorch/gemma-3-12b-it-FP8 on B200, I notice that the benchmark starts failing with fbgemm-gpu-genai=1.4.2 https://github.com/pytorch/pytorch-integration-testing/actions/runs/20177479851/job/57929181213#step:19:2183. The error when running vllm serve pytorch/gemma-3-12b-it-FP8 is as follows:

TMA benchmarks will be running without grid constant TMA descriptor.
(APIServer pid=7998) INFO 12-13 00:56:05 [api_server.py:1351] vLLM API server version 0.13.0rc2.dev105+gfdc135d76
(APIServer pid=7998) INFO 12-13 00:56:05 [utils.py:253] non-default args: {'model_tag': 'pytorch/gemma-3-12b-it-FP8', 'model': 'pytorch/gemma-3-12b-it-FP8'}
(APIServer pid=7998) INFO 12-13 00:56:05 [model.py:514] Resolved architecture: Gemma3ForConditionalGeneration
(APIServer pid=7998) INFO 12-13 00:56:05 [model.py:1636] Using max model len 131072
(APIServer pid=7998) INFO 12-13 00:56:05 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=7998) WARNING 12-13 00:56:05 [cuda.py:244] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
TMA benchmarks will be running without grid constant TMA descriptor.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:13 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev105+gfdc135d76) with config: model='pytorch/gemma-3-12b-it-FP8', speculative_config=None, tokenizer='pytorch/gemma-3-12b-it-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=pytorch/gemma-3-12b-it-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:15 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.31.154:56647 backend=nccl
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:15 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=8153) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:22 [gpu_model_runner.py:3562] Starting to load model pytorch/gemma-3-12b-it-FP8...
(EngineCore_DP0 pid=8153) /usr/local/lib/python3.12/dist-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(EngineCore_DP0 pid=8153)   _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:24 [layer.py:537] Using AttentionBackendEnum.FLASH_ATTN for MultiHeadAttention in multimodal encoder.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:26 [cuda.py:412] Using FLEX_ATTENTION attention backend out of potential backends: ['FLEX_ATTENTION']
Loading pt checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading pt checkpoint shards:  33% Completed | 1/3 [00:02<00:05,  2.80s/it]
Loading pt checkpoint shards:  67% Completed | 2/3 [00:05<00:02,  2.73s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:07<00:00,  2.25s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:07<00:00,  2.39s/it]
(EngineCore_DP0 pid=8153)
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [default_loader.py:308] Loading weights took 7.16 seconds
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [gpu_model_runner.py:3659] Model loading took 12.8828 GiB memory and 11.215734 seconds
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [gpu_model_runner.py:4446] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 31 image items of the maximum feature size.
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     super().__init__(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return func(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 340, in determine_available_memory
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     self.model_runner.profile_run()
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4462, in profile_run
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     dummy_encoder_outputs = self.model.embed_multimodal(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 603, in embed_multimodal
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._process_image_input(image_input)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 587, in _process_image_input
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     image_features = self._image_pixels_to_features(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 576, in _image_pixels_to_features
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return vision_tower(pixel_values)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 856, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self.vision_model(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 754, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     encoder_outputs = self.encoder(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                       ^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 562, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     hidden_states, _ = encoder_layer(hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 511, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 429, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     qkv_states, _ = self.qkv_proj(hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 565, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/torchao.py", line 348, in apply
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return F.linear(x, layer.weight, bias)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torchao/utils.py", line 655, in _dispatch__torch_function__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return cls._ATEN_OP_OR_TORCH_FN_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torchao/utils.py", line 491, in wrapper
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return func(f, types, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torchao/quantization/quantize_/workflows/float8/float8_tensor.py", line 300, in _
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     res = torch.ops.fbgemm.f8f8bf16_rowwise(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] RuntimeError: cutlass cannot initialize

The same command vllm serve pytorch/gemma-3-12b-it-FP8 works fine with the previous version fbgemm-gpu-genai=1.4.1. The same error also happens with pytorch/gemma-3-27b-it-fp8

Here is the full list of pip packages:

sing Python 3.12.12 environment at: /usr
Package                            Version                     Editable project location
---------------------------------- --------------------------- -------------------------------------
absl-py                            2.1.0
accelerate                         1.0.1
aenum                              3.1.16
affine                             2.4.0
aiohappyeyeballs                   2.6.1
aiohttp                            3.13.0
aiohttp-cors                       0.8.1
aiosignal                          1.4.0
albucore                           0.0.16
albumentations                     1.4.6
alembic                            1.16.4
annotated-doc                      0.0.4
annotated-types                    0.7.0
anthropic                          0.71.0
antlr4-python3-runtime             4.9.3
anyio                              4.6.2.post1
apache-tvm-ffi                     0.1.5
arctic-inference                   0.1.1
argcomplete                        3.5.1
arrow                              1.3.0
astor                              0.8.1
attrs                              24.2.0
audioread                          3.0.1
backoff                            2.2.1
bitsandbytes                       0.46.1
black                              24.10.0
blake3                             1.0.8
blinker                            1.9.0
blobfile                           3.0.0
bm25s                              0.2.13
boto3                              1.35.57
botocore                           1.35.57
bounded-pool-executor              0.0.3
buildkite-test-collector           0.1.9
cachetools                         5.5.2
cbor2                              5.7.1
certifi                            2024.8.30
cffi                               1.17.1
cfgv                               3.5.0
chardet                            5.2.0
charset-normalizer                 3.4.0
chz                                0.3.0
click                              8.1.7
click-plugins                      1.1.1.2
cligj                              0.7.2
cloudpickle                        3.1.1
colorama                           0.4.6
colorful                           0.5.6
compressed-tensors                 0.12.2
contourpy                          1.3.0
coverage                           7.10.6
cramjam                            2.9.0
cryptography                       46.0.3
cuda-bindings                      13.1.1
cuda-pathfinder                    1.3.3
cuda-python                        13.1.1
cupy-cuda12x                       13.6.0
cycler                             0.12.1
databricks-sdk                     0.59.0
datamodel-code-generator           0.26.3
dataproperty                       1.0.1
datasets                           3.0.2
decorator                          5.1.1
decord                             0.6.0
deep-ep                            1.2.1+73b6ea4
deep-gemm                          2.1.0+594953a
depyf                              0.20.0
dill                               0.3.8
diskcache                          5.6.3
distlib                            0.3.9
distro                             1.9.0
dnspython                          2.7.0
docker                             7.1.0
docopt                             0.6.2
docstring-parser                   0.17.0
efficientnet-pytorch               0.7.1
einops                             0.8.1
einx                               0.3.0
email-validator                    2.2.0
encodec                            0.1.1
evaluate                           0.4.3
fastapi                            0.116.1
fastapi-cli                        0.0.16
fastapi-cloud-cli                  0.6.0
fastar                             0.8.0
fastparquet                        2024.11.0
fastrlock                          0.8.2
fastsafetensors                    0.1.10
fbgemm-gpu-genai                   1.4.2
filelock                           3.16.1
fiona                              1.10.1
flashinfer-cubin                   0.5.3
flashinfer-jit-cache               0.5.3+cu129
flashinfer-python                  0.5.3
flask                              3.1.1
fonttools                          4.55.0
fqdn                               1.5.1
frozendict                         2.4.6
frozenlist                         1.5.0
fsspec                             2024.9.0
ftfy                               6.3.1
genai-perf                         0.0.8
genson                             1.3.0
geopandas                          1.0.1
gguf                               0.17.1
gitdb                              4.0.12
gitpython                          3.1.44
google-api-core                    2.24.2
google-auth                        2.40.2
google-cloud-core                  2.4.3
google-cloud-storage               3.4.0
google-crc32c                      1.7.1
google-resumable-media             2.7.2
googleapis-common-protos           1.70.0
gpt-oss                            0.0.8
graphene                           3.4.3
graphql-core                       3.2.6
graphql-relay                      3.2.0
greenlet                           3.2.3
grpcio                             1.71.0
gunicorn                           23.0.0
h11                                0.14.0
h5py                               3.13.0
harfile                            0.3.0
hf-transfer                        0.1.9
hf-xet                             1.1.7
hiredis                            3.0.0
html2text                          2025.4.15
httpcore                           1.0.6
httptools                          0.7.1
httpx                              0.27.2
httpx-sse                          0.4.3
huggingface-hub                    0.34.3
humanize                           4.11.0
hydra-core                         1.3.2
hypothesis                         6.131.0
hypothesis-graphql                 0.11.1
hypothesis-jsonschema              0.23.1
identify                           2.6.15
idna                               3.10
ijson                              3.4.0.post0
imageio                            2.37.0
importlib-metadata                 8.7.0
importlib-resources                6.5.2
inflect                            5.6.2
iniconfig                          2.0.0
interegular                        0.3.3
isoduration                        20.11.0
isort                              5.13.2
itsdangerous                       2.2.0
jinja2                             3.1.6
jiter                              0.12.0
jiwer                              3.0.5
jmespath                           1.0.1
joblib                             1.4.2
jsonargparse                       4.35.0
jsonlines                          4.0.0
jsonpointer                        3.0.0
jsonschema                         4.23.0
jsonschema-specifications          2024.10.1
junit-xml                          1.9
kaleido                            0.2.1
kiwisolver                         1.4.7
kornia                             0.8.1
kornia-rs                          0.1.9
lark                               1.2.2
lazy-loader                        0.4
libnacl                            2.1.0
librosa                            0.10.2.post1
lightly                            1.5.20
lightly-utils                      0.0.2
lightning                          2.5.1.post0
lightning-utilities                0.14.3
llguidance                         1.3.0
llvmlite                           0.44.0
lm-eval                            0.4.9.1
lm-format-enforcer                 0.11.3
loguru                             0.7.3
lxml                               5.3.0
mako                               1.3.10
markdown                           3.8.2
markdown-it-py                     3.0.0
markupsafe                         3.0.1
matplotlib                         3.9.2
mbstrdecoder                       1.1.3
mcp                                1.24.0
mdurl                              0.1.2
mistral-common                     1.8.5
mlflow                             2.22.0
mlflow-skinny                      2.22.0
model-hosting-container-standards  0.1.11
more-itertools                     10.5.0
mpmath                             1.3.0
msgpack                            1.1.0
msgspec                            0.20.0
mteb                               2.1.2
multidict                          6.1.0
multiprocess                       0.70.16
munch                              4.0.0
mypy-extensions                    1.0.0
networkx                           3.2.1
ninja                              1.13.0
nltk                               3.9.1
nodeenv                            1.9.1
num2words                          0.5.14
numba                              0.61.2
numexpr                            2.10.1
numpy                              1.26.4
nvidia-cublas-cu12                 12.9.1.4
nvidia-cuda-cupti-cu12             12.9.79
nvidia-cuda-nvrtc-cu12             12.9.86
nvidia-cuda-runtime-cu12           12.9.79
nvidia-cudnn-cu12                  9.10.2.21
nvidia-cudnn-frontend              1.16.0
nvidia-cufft-cu12                  11.4.1.4
nvidia-cufile-cu12                 1.14.1.1
nvidia-curand-cu12                 10.3.10.19
nvidia-cusolver-cu12               11.7.5.82
nvidia-cusparse-cu12               12.5.10.65
nvidia-cusparselt-cu12             0.7.1
nvidia-cutlass-dsl                 4.3.3
nvidia-ml-py                       13.590.44
nvidia-nccl-cu12                   2.27.5
nvidia-nvjitlink-cu12              12.9.86
nvidia-nvshmem-cu12                3.3.20
nvidia-nvtx-cu12                   12.9.79
omegaconf                          2.3.0
open-clip-torch                    2.32.0
openai                             2.11.0
openai-harmony                     0.0.4
opencensus                         0.11.4
opencensus-context                 0.1.3
opencv-python-headless             4.11.0.86
opentelemetry-api                  1.35.0
opentelemetry-exporter-prometheus  0.56b0
opentelemetry-proto                1.36.0
opentelemetry-sdk                  1.35.0
opentelemetry-semantic-conventions 0.56b0
outlines-core                      0.2.11
packaging                          24.2
pandas                             2.2.3
partial-json-parser                0.2.1.1.post7
pathspec                           0.12.1
pathvalidate                       3.2.1
patsy                              1.0.1
peft                               0.16.0
pillow                             10.4.0
pip                                25.3
platformdirs                       4.3.6
plotly                             5.24.1
pluggy                             1.5.0
polars                             1.29.0
pooch                              1.8.2
portalocker                        2.10.1
pplx-kernels                       0.0.1
pqdm                               0.2.0
pre-commit                         4.0.1
pretrainedmodels                   0.7.4
prometheus-client                  0.22.0
prometheus-fastapi-instrumentator  7.1.0
propcache                          0.2.0
proto-plus                         1.26.1
protobuf                           5.28.3
psutil                             6.1.0
py                                 1.11.0
py-cpuinfo                         9.0.0
py-spy                             0.4.0
pyarrow                            18.0.0
pyasn1                             0.6.1
pyasn1-modules                     0.4.2
pybase64                           1.4.3
pybind11                           2.13.6
pycocotools                        2.0.8
pycountry                          24.6.1
pycparser                          2.22
pycryptodomex                      3.22.0
pydantic                           2.12.0
pydantic-core                      2.41.1
pydantic-extra-types               2.10.5
pydantic-settings                  2.12.0
pygments                           2.18.0
pyjwt                              2.10.1
pyogrio                            0.11.0
pyparsing                          3.2.0
pyproj                             3.7.1
pyrate-limiter                     3.7.0
pystemmer                          3.0.0
pytablewriter                      1.2.0
pytest                             8.3.5
pytest-asyncio                     0.24.0
pytest-cov                         6.3.0
pytest-forked                      1.6.0
pytest-mock                        3.14.0
pytest-rerunfailures               14.0
pytest-shard                       0.1.2
pytest-subtests                    0.14.1
pytest-timeout                     2.3.1
python-box                         7.3.2
python-dateutil                    2.9.0.post0
python-dotenv                      1.2.1
python-json-logger                 4.0.0
python-multipart                   0.0.20
python-rapidjson                   1.20
pytorch-lightning                  2.5.2
pytrec-eval-terrier                0.5.7
pytz                               2024.2
pyyaml                             6.0.2
pyzmq                              27.1.0
rapidfuzz                          3.12.1
rasterio                           1.4.3
ray                                2.48.0
redis                              5.2.0
referencing                        0.35.1
regex                              2024.9.11
requests                           2.32.3
responses                          0.25.3
rfc3339-validator                  0.1.4
rfc3987                            1.3.8
rich                               13.9.4
rich-toolkit                       0.17.0
rignore                            0.7.6
rioxarray                          0.19.0
rouge-score                        0.1.2
rpds-py                            0.20.1
rsa                                4.9.1
rtree                              1.4.0
runai-model-streamer               0.15.3
runai-model-streamer-gcs           0.15.3
runai-model-streamer-s3            0.15.3
s3transfer                         0.10.3
sacrebleu                          2.4.3
safetensors                        0.4.5
schemathesis                       3.39.15
scikit-image                       0.25.2
scikit-learn                       1.5.2
scipy                              1.13.1
segmentation-models-pytorch        0.4.0
sentence-transformers              3.2.1
sentencepiece                      0.2.1
sentry-sdk                         2.47.0
setproctitle                       1.3.7
setuptools                         77.0.3
shapely                            2.1.1
shellingham                        1.5.4
six                                1.16.0
smart-open                         7.1.0
smmap                              5.0.2
sniffio                            1.3.1
sortedcontainers                   2.4.0
soundfile                          0.12.1
soxr                               0.5.0.post1
sqlalchemy                         2.0.41
sqlitedict                         2.1.0
sqlparse                           0.5.3
sse-starlette                      3.0.3
starlette                          0.46.2
starlette-testclient               0.4.1
statsmodels                        0.14.4
structlog                          25.4.0
supervisor                         4.3.0
sympy                              1.13.3
tabledata                          1.3.3
tabulate                           0.9.0
tblib                              3.1.0
tcolorpy                           0.1.6
tenacity                           9.1.2
tensorboardx                       2.6.4
tensorizer                         2.10.1
termcolor                          3.1.0
terratorch                         1.0.2
threadpoolctl                      3.5.0
tifffile                           2025.3.30
tiktoken                           0.12.0
timm                               1.0.17
tokenizers                         0.22.0
tomli                              2.2.1
tomli-w                            1.2.0
torch                              2.9.0+cu129
torchao                            0.14.1
torchaudio                         2.9.0+cu129
torchgeo                           0.7.0
torchmetrics                       1.7.4
torchvision                        0.24.0+cu129
tqdm                               4.66.6
tqdm-multiprocess                  0.0.11
transformers                       4.57.3
transformers-stream-generator      0.0.5
triton                             3.5.0
tritonclient                       2.51.0
typepy                             1.3.2
typer                              0.15.2
types-python-dateutil              2.9.0.20241206
typeshed-client                    2.8.2
typing-extensions                  4.15.0
typing-inspection                  0.4.2
tzdata                             2024.2
uri-template                       1.3.0
urllib3                            2.2.3
uv                                 0.9.17
uvicorn                            0.35.0
uvloop                             0.22.1
vector-quantize-pytorch            1.21.2
virtualenv                         20.31.2
vllm                               0.13.0rc2.dev105+gfdc135d76
vllm-test-utils                    0.1                         /vllm-workspace/tests/vllm_test_utils
vocos                              0.1.0
watchfiles                         1.1.1
wcwidth                            0.2.13
webcolors                          24.11.1
websockets                         15.0.1
werkzeug                           3.1.3
word2number                        1.1
wrapt                              1.17.2
xarray                             2025.7.1
xgrammar                           0.1.27
xxhash                             3.5.0
yarl                               1.17.1
zipp                               3.23.0
zstandard                          0.23.0

cc @jainapurva

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions