Skip to content

[Bug] DeepSeek-V4 compressed attention backend: no SM120 fallback for Lightning Indexer #23657

@rs-ipps

Description

@rs-ipps

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

Summary

SGLang's compressed attention backend for DeepSeek-V4 calls get_paged_mqa_logits_metadata directly from DeepGEMM without a fallback path. On SM120 (RTX Pro 6000 Blackwell, RTX 5090), this kernel has no implementation in DeepGEMM, so V4 cannot complete a forward pass.

In contrast, SGLang's mHC implementation (layers/mhc.py) routes through TileLang and works correctly on SM120. If a similar TileLang or Triton path were available for the paged MQA logits metadata, V4 would be fully functional on SM120.

Behavior

  • Server starts successfully, loads model (88.05 GiB per GPU), accepts HTTP requests
  • On the first forward pass: RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architecture

Full stack trace:

sglang/srt/layers/attention/deepseek_v4_backend_radix.py:491 in init_forward_metadata_prefill
→ sglang/srt/layers/attention/deepseek_v4_backend_radix.py:410 in init_forward_metadata_indexer
→ sglang/srt/layers/attention/compressed/metadata.py:142 in __post_init__
→ self.deep_gemm_metadata = get_paged_mqa_logits_metadata(...)
RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architecture

Request

Would the team consider adding an SM120 fallback path (TileLang or Triton) for get_paged_mqa_logits_metadata / paged_mqa_logits in the compressed attention backend, similar to how mHC is handled via TileLang? This would unblock V4 on all consumer/workstation Blackwell GPUs.

Upstream DeepGEMM issue: [https://github.com/deepseek-ai/DeepGEMM/issues/317]

Reproduction

SGLANG_DISABLE_DEEP_GEMM=1 \
python -m sglang.launch_server \
  --model-path /models/DeepSeek-V4-Flash \
  --tp 2 --trust-remote-code \
  --fp8-gemm-backend triton --moe-runner-backend triton \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.96 --context-length 65536 \
  --disable-cuda-graph --max-running-requests 8 \
  --reasoning-parser deepseek-v4 --tool-call-parser deepseekv4

Setting --nsa-prefill-backend tilelang and --nsa-decode-backend tilelang has no effect because V4 auto-selects the compressed attention backend, not nsa.

Environment

  • GPU: 2× NVIDIA RTX Pro 6000 Blackwell (96 GB, SM120)
  • Image: lmsysorg/sglang:deepseek-v4-blackwell (container CUDA 12.9.1)
  • Model: deepseek-ai/DeepSeek-V4-Flash

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions