Checklist
Describe the bug
Summary
SGLang's compressed attention backend for DeepSeek-V4 calls get_paged_mqa_logits_metadata directly from DeepGEMM without a fallback path. On SM120 (RTX Pro 6000 Blackwell, RTX 5090), this kernel has no implementation in DeepGEMM, so V4 cannot complete a forward pass.
In contrast, SGLang's mHC implementation (layers/mhc.py) routes through TileLang and works correctly on SM120. If a similar TileLang or Triton path were available for the paged MQA logits metadata, V4 would be fully functional on SM120.
Behavior
- Server starts successfully, loads model (88.05 GiB per GPU), accepts HTTP requests
- On the first forward pass:
RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architecture
Full stack trace:
sglang/srt/layers/attention/deepseek_v4_backend_radix.py:491 in init_forward_metadata_prefill
→ sglang/srt/layers/attention/deepseek_v4_backend_radix.py:410 in init_forward_metadata_indexer
→ sglang/srt/layers/attention/compressed/metadata.py:142 in __post_init__
→ self.deep_gemm_metadata = get_paged_mqa_logits_metadata(...)
RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architecture
Request
Would the team consider adding an SM120 fallback path (TileLang or Triton) for get_paged_mqa_logits_metadata / paged_mqa_logits in the compressed attention backend, similar to how mHC is handled via TileLang? This would unblock V4 on all consumer/workstation Blackwell GPUs.
Upstream DeepGEMM issue: [https://github.com/deepseek-ai/DeepGEMM/issues/317]
Reproduction
SGLANG_DISABLE_DEEP_GEMM=1 \
python -m sglang.launch_server \
--model-path /models/DeepSeek-V4-Flash \
--tp 2 --trust-remote-code \
--fp8-gemm-backend triton --moe-runner-backend triton \
--kv-cache-dtype fp8_e4m3 \
--mem-fraction-static 0.96 --context-length 65536 \
--disable-cuda-graph --max-running-requests 8 \
--reasoning-parser deepseek-v4 --tool-call-parser deepseekv4
Setting --nsa-prefill-backend tilelang and --nsa-decode-backend tilelang has no effect because V4 auto-selects the compressed attention backend, not nsa.
Environment
- GPU: 2× NVIDIA RTX Pro 6000 Blackwell (96 GB, SM120)
- Image:
lmsysorg/sglang:deepseek-v4-blackwell (container CUDA 12.9.1)
- Model:
deepseek-ai/DeepSeek-V4-Flash
Checklist
Describe the bug
Summary
SGLang's
compressedattention backend for DeepSeek-V4 callsget_paged_mqa_logits_metadatadirectly from DeepGEMM without a fallback path. On SM120 (RTX Pro 6000 Blackwell, RTX 5090), this kernel has no implementation in DeepGEMM, so V4 cannot complete a forward pass.In contrast, SGLang's mHC implementation (
layers/mhc.py) routes through TileLang and works correctly on SM120. If a similar TileLang or Triton path were available for the paged MQA logits metadata, V4 would be fully functional on SM120.Behavior
RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architectureFull stack trace:
Request
Would the team consider adding an SM120 fallback path (TileLang or Triton) for
get_paged_mqa_logits_metadata/paged_mqa_logitsin the compressed attention backend, similar to how mHC is handled via TileLang? This would unblock V4 on all consumer/workstation Blackwell GPUs.Upstream DeepGEMM issue: [https://github.com/deepseek-ai/DeepGEMM/issues/317]
Reproduction
Setting
--nsa-prefill-backend tilelangand--nsa-decode-backend tilelanghas no effect because V4 auto-selects thecompressedattention backend, notnsa.Environment
lmsysorg/sglang:deepseek-v4-blackwell(container CUDA 12.9.1)deepseek-ai/DeepSeek-V4-Flash