-
Notifications
You must be signed in to change notification settings - Fork 727
Open
Description
When running vLLM benchmark of pytorch/gemma-3-12b-it-FP8 on B200, I notice that the benchmark starts failing with fbgemm-gpu-genai=1.4.2 https://github.com/pytorch/pytorch-integration-testing/actions/runs/20177479851/job/57929181213#step:19:2183. The error when running vllm serve pytorch/gemma-3-12b-it-FP8 is as follows:
TMA benchmarks will be running without grid constant TMA descriptor.
(APIServer pid=7998) INFO 12-13 00:56:05 [api_server.py:1351] vLLM API server version 0.13.0rc2.dev105+gfdc135d76
(APIServer pid=7998) INFO 12-13 00:56:05 [utils.py:253] non-default args: {'model_tag': 'pytorch/gemma-3-12b-it-FP8', 'model': 'pytorch/gemma-3-12b-it-FP8'}
(APIServer pid=7998) INFO 12-13 00:56:05 [model.py:514] Resolved architecture: Gemma3ForConditionalGeneration
(APIServer pid=7998) INFO 12-13 00:56:05 [model.py:1636] Using max model len 131072
(APIServer pid=7998) INFO 12-13 00:56:05 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=7998) WARNING 12-13 00:56:05 [cuda.py:244] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
TMA benchmarks will be running without grid constant TMA descriptor.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:13 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev105+gfdc135d76) with config: model='pytorch/gemma-3-12b-it-FP8', speculative_config=None, tokenizer='pytorch/gemma-3-12b-it-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=pytorch/gemma-3-12b-it-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:15 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.31.154:56647 backend=nccl
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:15 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=8153) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:22 [gpu_model_runner.py:3562] Starting to load model pytorch/gemma-3-12b-it-FP8...
(EngineCore_DP0 pid=8153) /usr/local/lib/python3.12/dist-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(EngineCore_DP0 pid=8153) _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:24 [layer.py:537] Using AttentionBackendEnum.FLASH_ATTN for MultiHeadAttention in multimodal encoder.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:26 [cuda.py:412] Using FLEX_ATTENTION attention backend out of potential backends: ['FLEX_ATTENTION']
Loading pt checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading pt checkpoint shards: 33% Completed | 1/3 [00:02<00:05, 2.80s/it]
Loading pt checkpoint shards: 67% Completed | 2/3 [00:05<00:02, 2.73s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.25s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.39s/it]
(EngineCore_DP0 pid=8153)
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [default_loader.py:308] Loading weights took 7.16 seconds
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [gpu_model_runner.py:3659] Model loading took 12.8828 GiB memory and 11.215734 seconds
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [gpu_model_runner.py:4446] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 31 image items of the maximum feature size.
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] super().__init__(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 340, in determine_available_memory
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] self.model_runner.profile_run()
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4462, in profile_run
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] dummy_encoder_outputs = self.model.embed_multimodal(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 603, in embed_multimodal
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._process_image_input(image_input)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 587, in _process_image_input
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] image_features = self._image_pixels_to_features(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 576, in _image_pixels_to_features
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return vision_tower(pixel_values)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 856, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self.vision_model(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 754, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] encoder_outputs = self.encoder(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 562, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] hidden_states, _ = encoder_layer(hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 511, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 429, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] qkv_states, _ = self.qkv_proj(hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 565, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/torchao.py", line 348, in apply
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return F.linear(x, layer.weight, bias)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torchao/utils.py", line 655, in _dispatch__torch_function__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return cls._ATEN_OP_OR_TORCH_FN_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torchao/utils.py", line 491, in wrapper
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return func(f, types, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torchao/quantization/quantize_/workflows/float8/float8_tensor.py", line 300, in _
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] res = torch.ops.fbgemm.f8f8bf16_rowwise(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] RuntimeError: cutlass cannot initialize
The same command vllm serve pytorch/gemma-3-12b-it-FP8 works fine with the previous version fbgemm-gpu-genai=1.4.1. The same error also happens with pytorch/gemma-3-27b-it-fp8
Here is the full list of pip packages:
sing Python 3.12.12 environment at: /usr
Package Version Editable project location
---------------------------------- --------------------------- -------------------------------------
absl-py 2.1.0
accelerate 1.0.1
aenum 3.1.16
affine 2.4.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.0
aiohttp-cors 0.8.1
aiosignal 1.4.0
albucore 0.0.16
albumentations 1.4.6
alembic 1.16.4
annotated-doc 0.0.4
annotated-types 0.7.0
anthropic 0.71.0
antlr4-python3-runtime 4.9.3
anyio 4.6.2.post1
apache-tvm-ffi 0.1.5
arctic-inference 0.1.1
argcomplete 3.5.1
arrow 1.3.0
astor 0.8.1
attrs 24.2.0
audioread 3.0.1
backoff 2.2.1
bitsandbytes 0.46.1
black 24.10.0
blake3 1.0.8
blinker 1.9.0
blobfile 3.0.0
bm25s 0.2.13
boto3 1.35.57
botocore 1.35.57
bounded-pool-executor 0.0.3
buildkite-test-collector 0.1.9
cachetools 5.5.2
cbor2 5.7.1
certifi 2024.8.30
cffi 1.17.1
cfgv 3.5.0
chardet 5.2.0
charset-normalizer 3.4.0
chz 0.3.0
click 8.1.7
click-plugins 1.1.1.2
cligj 0.7.2
cloudpickle 3.1.1
colorama 0.4.6
colorful 0.5.6
compressed-tensors 0.12.2
contourpy 1.3.0
coverage 7.10.6
cramjam 2.9.0
cryptography 46.0.3
cuda-bindings 13.1.1
cuda-pathfinder 1.3.3
cuda-python 13.1.1
cupy-cuda12x 13.6.0
cycler 0.12.1
databricks-sdk 0.59.0
datamodel-code-generator 0.26.3
dataproperty 1.0.1
datasets 3.0.2
decorator 5.1.1
decord 0.6.0
deep-ep 1.2.1+73b6ea4
deep-gemm 2.1.0+594953a
depyf 0.20.0
dill 0.3.8
diskcache 5.6.3
distlib 0.3.9
distro 1.9.0
dnspython 2.7.0
docker 7.1.0
docopt 0.6.2
docstring-parser 0.17.0
efficientnet-pytorch 0.7.1
einops 0.8.1
einx 0.3.0
email-validator 2.2.0
encodec 0.1.1
evaluate 0.4.3
fastapi 0.116.1
fastapi-cli 0.0.16
fastapi-cloud-cli 0.6.0
fastar 0.8.0
fastparquet 2024.11.0
fastrlock 0.8.2
fastsafetensors 0.1.10
fbgemm-gpu-genai 1.4.2
filelock 3.16.1
fiona 1.10.1
flashinfer-cubin 0.5.3
flashinfer-jit-cache 0.5.3+cu129
flashinfer-python 0.5.3
flask 3.1.1
fonttools 4.55.0
fqdn 1.5.1
frozendict 2.4.6
frozenlist 1.5.0
fsspec 2024.9.0
ftfy 6.3.1
genai-perf 0.0.8
genson 1.3.0
geopandas 1.0.1
gguf 0.17.1
gitdb 4.0.12
gitpython 3.1.44
google-api-core 2.24.2
google-auth 2.40.2
google-cloud-core 2.4.3
google-cloud-storage 3.4.0
google-crc32c 1.7.1
google-resumable-media 2.7.2
googleapis-common-protos 1.70.0
gpt-oss 0.0.8
graphene 3.4.3
graphql-core 3.2.6
graphql-relay 3.2.0
greenlet 3.2.3
grpcio 1.71.0
gunicorn 23.0.0
h11 0.14.0
h5py 3.13.0
harfile 0.3.0
hf-transfer 0.1.9
hf-xet 1.1.7
hiredis 3.0.0
html2text 2025.4.15
httpcore 1.0.6
httptools 0.7.1
httpx 0.27.2
httpx-sse 0.4.3
huggingface-hub 0.34.3
humanize 4.11.0
hydra-core 1.3.2
hypothesis 6.131.0
hypothesis-graphql 0.11.1
hypothesis-jsonschema 0.23.1
identify 2.6.15
idna 3.10
ijson 3.4.0.post0
imageio 2.37.0
importlib-metadata 8.7.0
importlib-resources 6.5.2
inflect 5.6.2
iniconfig 2.0.0
interegular 0.3.3
isoduration 20.11.0
isort 5.13.2
itsdangerous 2.2.0
jinja2 3.1.6
jiter 0.12.0
jiwer 3.0.5
jmespath 1.0.1
joblib 1.4.2
jsonargparse 4.35.0
jsonlines 4.0.0
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
junit-xml 1.9
kaleido 0.2.1
kiwisolver 1.4.7
kornia 0.8.1
kornia-rs 0.1.9
lark 1.2.2
lazy-loader 0.4
libnacl 2.1.0
librosa 0.10.2.post1
lightly 1.5.20
lightly-utils 0.0.2
lightning 2.5.1.post0
lightning-utilities 0.14.3
llguidance 1.3.0
llvmlite 0.44.0
lm-eval 0.4.9.1
lm-format-enforcer 0.11.3
loguru 0.7.3
lxml 5.3.0
mako 1.3.10
markdown 3.8.2
markdown-it-py 3.0.0
markupsafe 3.0.1
matplotlib 3.9.2
mbstrdecoder 1.1.3
mcp 1.24.0
mdurl 0.1.2
mistral-common 1.8.5
mlflow 2.22.0
mlflow-skinny 2.22.0
model-hosting-container-standards 0.1.11
more-itertools 10.5.0
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.20.0
mteb 2.1.2
multidict 6.1.0
multiprocess 0.70.16
munch 4.0.0
mypy-extensions 1.0.0
networkx 3.2.1
ninja 1.13.0
nltk 3.9.1
nodeenv 1.9.1
num2words 0.5.14
numba 0.61.2
numexpr 2.10.1
numpy 1.26.4
nvidia-cublas-cu12 12.9.1.4
nvidia-cuda-cupti-cu12 12.9.79
nvidia-cuda-nvrtc-cu12 12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.16.0
nvidia-cufft-cu12 11.4.1.4
nvidia-cufile-cu12 1.14.1.1
nvidia-curand-cu12 10.3.10.19
nvidia-cusolver-cu12 11.7.5.82
nvidia-cusparse-cu12 12.5.10.65
nvidia-cusparselt-cu12 0.7.1
nvidia-cutlass-dsl 4.3.3
nvidia-ml-py 13.590.44
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.9.86
nvidia-nvshmem-cu12 3.3.20
nvidia-nvtx-cu12 12.9.79
omegaconf 2.3.0
open-clip-torch 2.32.0
openai 2.11.0
openai-harmony 0.0.4
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python-headless 4.11.0.86
opentelemetry-api 1.35.0
opentelemetry-exporter-prometheus 0.56b0
opentelemetry-proto 1.36.0
opentelemetry-sdk 1.35.0
opentelemetry-semantic-conventions 0.56b0
outlines-core 0.2.11
packaging 24.2
pandas 2.2.3
partial-json-parser 0.2.1.1.post7
pathspec 0.12.1
pathvalidate 3.2.1
patsy 1.0.1
peft 0.16.0
pillow 10.4.0
pip 25.3
platformdirs 4.3.6
plotly 5.24.1
pluggy 1.5.0
polars 1.29.0
pooch 1.8.2
portalocker 2.10.1
pplx-kernels 0.0.1
pqdm 0.2.0
pre-commit 4.0.1
pretrainedmodels 0.7.4
prometheus-client 0.22.0
prometheus-fastapi-instrumentator 7.1.0
propcache 0.2.0
proto-plus 1.26.1
protobuf 5.28.3
psutil 6.1.0
py 1.11.0
py-cpuinfo 9.0.0
py-spy 0.4.0
pyarrow 18.0.0
pyasn1 0.6.1
pyasn1-modules 0.4.2
pybase64 1.4.3
pybind11 2.13.6
pycocotools 2.0.8
pycountry 24.6.1
pycparser 2.22
pycryptodomex 3.22.0
pydantic 2.12.0
pydantic-core 2.41.1
pydantic-extra-types 2.10.5
pydantic-settings 2.12.0
pygments 2.18.0
pyjwt 2.10.1
pyogrio 0.11.0
pyparsing 3.2.0
pyproj 3.7.1
pyrate-limiter 3.7.0
pystemmer 3.0.0
pytablewriter 1.2.0
pytest 8.3.5
pytest-asyncio 0.24.0
pytest-cov 6.3.0
pytest-forked 1.6.0
pytest-mock 3.14.0
pytest-rerunfailures 14.0
pytest-shard 0.1.2
pytest-subtests 0.14.1
pytest-timeout 2.3.1
python-box 7.3.2
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.20
python-rapidjson 1.20
pytorch-lightning 2.5.2
pytrec-eval-terrier 0.5.7
pytz 2024.2
pyyaml 6.0.2
pyzmq 27.1.0
rapidfuzz 3.12.1
rasterio 1.4.3
ray 2.48.0
redis 5.2.0
referencing 0.35.1
regex 2024.9.11
requests 2.32.3
responses 0.25.3
rfc3339-validator 0.1.4
rfc3987 1.3.8
rich 13.9.4
rich-toolkit 0.17.0
rignore 0.7.6
rioxarray 0.19.0
rouge-score 0.1.2
rpds-py 0.20.1
rsa 4.9.1
rtree 1.4.0
runai-model-streamer 0.15.3
runai-model-streamer-gcs 0.15.3
runai-model-streamer-s3 0.15.3
s3transfer 0.10.3
sacrebleu 2.4.3
safetensors 0.4.5
schemathesis 3.39.15
scikit-image 0.25.2
scikit-learn 1.5.2
scipy 1.13.1
segmentation-models-pytorch 0.4.0
sentence-transformers 3.2.1
sentencepiece 0.2.1
sentry-sdk 2.47.0
setproctitle 1.3.7
setuptools 77.0.3
shapely 2.1.1
shellingham 1.5.4
six 1.16.0
smart-open 7.1.0
smmap 5.0.2
sniffio 1.3.1
sortedcontainers 2.4.0
soundfile 0.12.1
soxr 0.5.0.post1
sqlalchemy 2.0.41
sqlitedict 2.1.0
sqlparse 0.5.3
sse-starlette 3.0.3
starlette 0.46.2
starlette-testclient 0.4.1
statsmodels 0.14.4
structlog 25.4.0
supervisor 4.3.0
sympy 1.13.3
tabledata 1.3.3
tabulate 0.9.0
tblib 3.1.0
tcolorpy 0.1.6
tenacity 9.1.2
tensorboardx 2.6.4
tensorizer 2.10.1
termcolor 3.1.0
terratorch 1.0.2
threadpoolctl 3.5.0
tifffile 2025.3.30
tiktoken 0.12.0
timm 1.0.17
tokenizers 0.22.0
tomli 2.2.1
tomli-w 1.2.0
torch 2.9.0+cu129
torchao 0.14.1
torchaudio 2.9.0+cu129
torchgeo 0.7.0
torchmetrics 1.7.4
torchvision 0.24.0+cu129
tqdm 4.66.6
tqdm-multiprocess 0.0.11
transformers 4.57.3
transformers-stream-generator 0.0.5
triton 3.5.0
tritonclient 2.51.0
typepy 1.3.2
typer 0.15.2
types-python-dateutil 2.9.0.20241206
typeshed-client 2.8.2
typing-extensions 4.15.0
typing-inspection 0.4.2
tzdata 2024.2
uri-template 1.3.0
urllib3 2.2.3
uv 0.9.17
uvicorn 0.35.0
uvloop 0.22.1
vector-quantize-pytorch 1.21.2
virtualenv 20.31.2
vllm 0.13.0rc2.dev105+gfdc135d76
vllm-test-utils 0.1 /vllm-workspace/tests/vllm_test_utils
vocos 0.1.0
watchfiles 1.1.1
wcwidth 0.2.13
webcolors 24.11.1
websockets 15.0.1
werkzeug 3.1.3
word2number 1.1
wrapt 1.17.2
xarray 2025.7.1
xgrammar 0.1.27
xxhash 3.5.0
yarl 1.17.1
zipp 3.23.0
zstandard 0.23.0
cc @jainapurva
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels