[Bug] DeepSeek-V4 compressed attention backend: no SM120 fallback for Lightning Indexer

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

## Summary

SGLang's `compressed` attention backend for DeepSeek-V4 calls `get_paged_mqa_logits_metadata` directly from DeepGEMM without a fallback path. On SM120 (RTX Pro 6000 Blackwell, RTX 5090), this kernel has no implementation in DeepGEMM, so V4 cannot complete a forward pass.

In contrast, SGLang's mHC implementation (`layers/mhc.py`) routes through TileLang and works correctly on SM120. If a similar TileLang or Triton path were available for the paged MQA logits metadata, V4 would be fully functional on SM120.

## Behavior

- Server starts successfully, loads model (88.05 GiB per GPU), accepts HTTP requests
- On the first forward pass: `RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architecture`

Full stack trace:
```
sglang/srt/layers/attention/deepseek_v4_backend_radix.py:491 in init_forward_metadata_prefill
→ sglang/srt/layers/attention/deepseek_v4_backend_radix.py:410 in init_forward_metadata_indexer
→ sglang/srt/layers/attention/compressed/metadata.py:142 in __post_init__
→ self.deep_gemm_metadata = get_paged_mqa_logits_metadata(...)
RuntimeError: Assertion error (csrc/apis/attention.hpp:211): Unsupported architecture
```

## Request

Would the team consider adding an SM120 fallback path (TileLang or Triton) for `get_paged_mqa_logits_metadata` / `paged_mqa_logits` in the compressed attention backend, similar to how mHC is handled via TileLang? This would unblock V4 on all consumer/workstation Blackwell GPUs.

Upstream DeepGEMM issue: [https://github.com/deepseek-ai/DeepGEMM/issues/317]

### Reproduction

```bash
SGLANG_DISABLE_DEEP_GEMM=1 \
python -m sglang.launch_server \
  --model-path /models/DeepSeek-V4-Flash \
  --tp 2 --trust-remote-code \
  --fp8-gemm-backend triton --moe-runner-backend triton \
  --kv-cache-dtype fp8_e4m3 \
  --mem-fraction-static 0.96 --context-length 65536 \
  --disable-cuda-graph --max-running-requests 8 \
  --reasoning-parser deepseek-v4 --tool-call-parser deepseekv4
```

Setting `--nsa-prefill-backend tilelang` and `--nsa-decode-backend tilelang` has no effect because V4 auto-selects the `compressed` attention backend, not `nsa`.

### Environment

- GPU: 2× NVIDIA RTX Pro 6000 Blackwell (96 GB, SM120)
- Image: `lmsysorg/sglang:deepseek-v4-blackwell` (container CUDA 12.9.1)
- Model: `deepseek-ai/DeepSeek-V4-Flash`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DeepSeek-V4 compressed attention backend: no SM120 fallback for Lightning Indexer #23657

Checklist

Describe the bug

Summary

Behavior

Request

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] DeepSeek-V4 compressed attention backend: no SM120 fallback for Lightning Indexer #23657

Description

Checklist

Describe the bug

Summary

Behavior

Request

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions