[BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121

### Which component has the problem?

CuTe DSL

### Bug Report

 Bug Report

  Summary

  CUTLASS 4.2+ added SM120 and SM121 kernel support for Blackwell GeForce (RTX 50-series) and DGX Spark (GB10) GPUs according to the https://docs.nvidia.com/cutlass/4.2.1/CHANGELOG.html, but the Python DSL BlockScaledMmaOp class restricts FP4 operations to sm_100a 
  only, preventing use on sm_120 and sm_121 hardware.

  Environment

  - Hardware: NVIDIA DGX Spark GB10 (Compute Capability 12.1, sm_121)
  - Package: nvidia-cutlass-dsl version 4.3.0 (latest from PyPI)
  - Python: 3.13
  - CUDA: 13.0.1
  - PyTorch: 2.10.0.dev20251118+cu130

  Bug Location

  File: python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py
  Class: BlockScaledMmaOp
  Lines: 303-305

  @dataclass(frozen=True)
  class BlockScaledMmaOp(Tcgen05MmaOp):
      # ... other fields ...

      admissible_archs = [
          Arch.sm_100a,  # ← Only sm_100a allowed
      ]

      def __post_init__(self) -> None:
          arch = CuTeDSL._get_dsl().get_arch_enum()
          if arch not in self.admissible_archs:
              raise OpError(
                  self,
                  f"expects arch to be one of {self.admissible_archs}, but got {arch}",
                  suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
              )

  Reproduction

  Minimal Test Case

  from cutlass.cute.nvgpu.tcgen05 import MmaMXF4NVF4Op
  from cutlass import Float8E8M0FNU, Arch
  from cutlass.cute.nvgpu.tcgen05 import CtaGroup, OperandSource

  # On sm_121 (GB10) or sm_120 (RTX 5090) hardware:
  mma_op = MmaMXF4NVF4Op(
      sf_dtype=Float8E8M0FNU,
      instruction_shape=(16, 16, 64),
      cta_group=CtaGroup.ONE,
      a_src=OperandSource.TMEM
  )

  Error Output

  OpError: expects arch to be one of [Arch.sm_100a], but got Arch.sm_121a

  Traceback (most recent call last):
    File "test_fp4.py", line 6, in <module>
      mma_op = MmaMXF4NVF4Op(...)
    File ".../mma.py", line 311, in __post_init__
      raise OpError(
          self,
          f"expects arch to be one of {self.admissible_archs}, but got {arch}",
          suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
      )

  Expected Behavior

  Based on CUTLASS 4.2 changelog:
  "Support for Blackwell SM121 kernels for DGX Spark GPUs. Share the major codes with Blackwell SM120 kernels."

  The Python DSL should allow FP4 operations on sm_120 (RTX 5090) and sm_121 (GB10) architectures, consistent with C++ API support.

  Proposed Fix

  admissible_archs = [
      Arch.sm_100a,   # B200/B100 datacenter Blackwell
      Arch.sm_120a,   # RTX 5090 GeForce Blackwell (compute 12.0)
      Arch.sm_121a,   # GB10 DGX Spark Blackwell (compute 12.1)
  ]

  Validation

  - ✅ Hardware verified: GB10 has 5th-generation Tensor Cores with FP4 support (1 PFLOPS peak)
  - ✅ C++ API works: vLLM successfully uses CUTLASS FP4 on sm_120 via C++ API (https://github.com/vllm-project/vllm/pull/21309)
  - ✅ Patch tested: Applying the proposed fix eliminates the architecture error on GB10
  - ⚠️ Kernel availability: Pre-compiled sm_121 kernels may still be unavailable (separate issue)

  Impact

  Affected Users:
  - ❌ All RTX 5090 users (sm_120)
  - ❌ All DGX Spark GB10 users (sm_121)
  - ❌ Potentially RTX 5080/5070/5060 users (also sm_120)

  Workaround:
  Users can manually patch the installed package, but this:
  - Requires editing system packages (not ideal)
  - Gets overwritten on package upgrades
  - Isn't discoverable to most users

  Additional Context

  CUTLASS Changelog References

  - https://docs.nvidia.com/cutlass/4.2.1/CHANGELOG.html: "Support for Blackwell SM121 kernels for DGX Spark GPUs"
  - https://docs.nvidia.com/cutlass/CHANGELOG.html: "Blockscaled ultra fp4 dense GEMM" and "SM120 mixed input blockscaled grouped GEMM"

  Related Work

  - vLLM PR #21309: Demonstrates CUTLASS NVFP4 working on sm_120 via C++ API
  - cuBLAS 12.9: Supports FP4 block-scaled operations on Blackwell

  Hardware Specifications

  - sm_100: B200/B100 datacenter GPUs
  - sm_120: RTX 5090/5080/5070/5060 consumer GPUs
  - sm_121: GB10 superchip (DGX Spark, Project DIGITS)

  All have 5th-generation Tensor Cores with hardware FP4 support.

  Request

  Please update BlockScaledMmaOp.admissible_archs to include Arch.sm_120a and Arch.sm_121a to match the C++ API's architecture support.

  This will enable the Python DSL FP4 functionality on the full Blackwell GPU family, not just datacenter variants.

  Thank you for maintaining this excellent library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121 #2800

Which component has the problem?

Bug Report

On sm_121 (GB10) or sm_120 (RTX 5090) hardware:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121 #2800

Description

Which component has the problem?

Bug Report

On sm_121 (GB10) or sm_120 (RTX 5090) hardware:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions