[Bug] Qwen3.6-27B-FP8 (dense): FP8 weight_scale_inv silently dropped → garbage output (gate_gate_up_proj loop bug in qwen3_5.py)

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

Loading **`Qwen/Qwen3.6-27B-FP8`** (dense) with the official command from the model card produces pure garbage output (token salad, repeated single-char tokens, no coherent text).

The same SGLang version + same hardware + same flags loads **`Qwen/Qwen3.5-27B-FP8`** (also dense, also FP8 block-128) cleanly with **0 loader warnings** and works correctly. So the SGLang env is fine — only the 3.6 checkpoint triggers the bug.

#### Symptom 1 — 256 loader warnings during weight load

Two warnings per dense MLP × 64 layers × TP=2 = 256 total:

```
Parameter model.layers.14.mlp.gate_gate_up_proj.weight_scale_inv not found in params_dict
Parameter model.layers.14.mlp.gate_up_proj.weight_scale_inv not found in params_dict
```

Note `gate_gate_up_proj` (double `gate_`) — strong hint at a string-mutation bug in the loader.

#### Symptom 2 — garbage generation after server is "ready"

```
prompt:  "一句话介绍你自己"  (max_tokens=100, enable_thinking=False)
output:  " **:**落花ingoELITgetImage检查工作栩-ca促促促促促促促促/prom/prom/prom Prom/prom..."

prompt:  "17*23 等于多少？"  (enable_thinking=True)
output:  reasoning_content non-empty, content="" — no '391' anywhere

prompt:  tool_call (北京天气), enable_thinking=False
output:  no tool_calls; content="atherscobrauda之急急急急急急急急..."

prompt:  8K-token long-context anchor-word recall
output:  "­t Hornyalph�arkingamarca有福裾本站..."  — fails to recall anchor
```

#### What's different between Qwen3.5-27B-FP8 (works) and Qwen3.6-27B-FP8 (broken)

| | 3.5-27B-FP8 (0 warnings, works) | 3.6-27B-FP8 (256 warnings, garbage) |
|---|---|---|
| sharding | standard `model-NNNNN-of-MMMMM.safetensors` | per-layer `layers-N.safetensors` |
| MTP heads in checkpoint | none | `mtp.layers.0.*` |
| linear_attn keys | `in_proj_a` / `in_proj_b` | + new fused `in_proj_ba` + `A_log` + `dt_bias` + `norm` |

Direct safetensors inspection confirms 3.6 has separate `mlp.gate_proj.weight_scale_inv` + `mlp.up_proj.weight_scale_inv` for every dense MLP layer (no fused `gate_up_proj` in checkpoint, by design).

#### Suspected root cause

In `python/sglang/srt/models/qwen3_5.py`, `Qwen3_5ForConditionalGeneration.load_weights` (around line 1380):

```python
stacked_params_mapping = [
    ("qkv_proj", "q_proj", "q"),
    ("qkv_proj", "k_proj", "k"),
    ("qkv_proj", "v_proj", "v"),
    ("gate_up_proj", "gate_proj", 0),
    ("gate_up_proj", "up_proj",   1),
    ("in_proj_qkvz.", "in_proj_qkv.", (0, 1, 2)),
    ("in_proj_qkvz.", "in_proj_z.", 3),
    ("in_proj_ba.", "in_proj_b.", 0),
    ("in_proj_ba.", "in_proj_a.", 1),
]
for param_name, weight_name, shard_id in stacked_params_mapping:
    if weight_name not in name:
        continue
    if "visual" in name or "mlp.experts" in name:
        continue
    name = name.replace(weight_name, param_name)   # mutates `name`
    if name.endswith(".bias") and name not in params_dict:
        continue
    if name not in params_dict:
        continue            # ← `name` is now mutated; next iter re-matches against poisoned name
    ...
    break
```

Trace for input weight `model.layers.14.mlp.gate_proj.weight_scale_inv`:

1. iter `("gate_up_proj", "gate_proj", 0)` — matches → replace → `model.layers.14.mlp.gate_up_proj.weight_scale_inv` → **not in `params_dict`** → `continue` (name now mutated)
2. iter `("gate_up_proj", "up_proj", 1)` — `up_proj` is in `gate_up_proj` (substring) → replace `up_proj` → `gate_up_proj` → **`model.layers.14.mlp.gate_gate_up_proj.weight_scale_inv`** → warning printed, weight dropped

The first iteration *should* have succeeded — the upstream cause is that the model's `MergedColumnParallelLinear` + FP8-block-quant doesn't expose `gate_up_proj.weight_scale_inv` as a registered parameter when `Qwen3_5ForConditionalGeneration` is instantiated for this dense variant. 3.5's checkpoint provides weights in an order/format that bypasses this; 3.6's per-layer + MTP layout doesn't.

Net effect: every dense MLP's FP8 scales fail to load → dequantization runs with default scale (1.0) → garbage output.

#### Workarounds tried (all fail with same warnings + garbage)

- `--attention-backend triton --linear-attn-decode-backend triton --mamba-scheduler-strategy extra_buffer` (per #20791)
- drop `--kv-cache-dtype fp8_e4m3` (per #19603 — also tried adding it back, no change)
- `0.5.10.post1` GA → identical to `0.5.10rc0` (`lmsysorg/sglang:latest`)
- `tp-size=2` with extra `--max-mamba-cache-size` / `--max-running-requests` tuning
- single-GPU `tp-size=1` OOMs at full 262144 ctx on RTX 5090 (32 GB) — context does not change the loader behavior

#### Proposed fix (two layers)

1. **(cosmetic)** make the inner `continue` after `name = name.replace(...)` either restore the original `name` or `break` out of the `stacked_params_mapping` loop, so subsequent iterations don't poison the mutated name string.
2. **(real fix)** ensure `MergedColumnParallelLinear` for the Qwen3.6 dense MLP exposes `weight_scale_inv` as a registered parameter at the time `load_weights` runs. Likely an init-order issue where the FP8 quant method registers the scale parameter only after `named_parameters()` has been captured.

Happy to test patches against a local checkout. Full server logs and a TP=1 minimal-config repro available on request.

### Reproduction

Verbatim from the model card https://huggingface.co/Qwen/Qwen3.6-27B-FP8 (only `--tp-size` and `--port` adjusted to local hardware):

```bash
docker run --gpus '"device=0,1"' --shm-size 32g --ipc host -p 8000:8080 \
  -v /path/to/Qwen3.6-27B-FP8:/models/Qwen3.6-27B-FP8:ro \
  lmsysorg/sglang:v0.5.10.post1-cu130 \
  python3 -m sglang.launch_server \
    --model-path /models/Qwen3.6-27B-FP8 \
    --tp-size 2 \
    --mem-fraction-static 0.8 \
    --context-length 262144 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --host 0.0.0.0 --port 8080
```

Server starts (after ~256 warnings). Any chat completion returns garbage:

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"...","messages":[{"role":"user","content":"hi"}],
       "max_tokens":40,"chat_template_kwargs":{"enable_thinking":false}}'
```

Replacing the model with `Qwen/Qwen3.5-27B-FP8` and re-running the identical command produces 0 warnings and coherent output.

### Environment

```
CUDA Version 13.0.1
Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 5090
GPU 0 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 580.126.20
PyTorch: 2.9.1+cu130
sglang: 0.5.10.post1
sglang-kernel: 0.4.1+cu130
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu130
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
pydantic: 2.12.5
xgrammar: 0.1.32
torchcodec: 0.9.1+cu130

NVIDIA Topology: 2× RTX 5090, no NVLink (peer access not supported between devices — handled by SGLang via NCCL fallback, expected)

Hardware: 2× RTX 5090 (Blackwell sm_120, 32 GB each)
Container image: lmsysorg/sglang:v0.5.10.post1-cu130
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Qwen3.6-27B-FP8 (dense): FP8 weight_scale_inv silently dropped → garbage output (gate_gate_up_proj loop bug in qwen3_5.py) #23687

Checklist

Describe the bug

Symptom 1 — 256 loader warnings during weight load

Symptom 2 — garbage generation after server is "ready"

What's different between Qwen3.5-27B-FP8 (works) and Qwen3.6-27B-FP8 (broken)

Suspected root cause

Workarounds tried (all fail with same warnings + garbage)

Proposed fix (two layers)

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	3.5-27B-FP8 (0 warnings, works)	3.6-27B-FP8 (256 warnings, garbage)
sharding	standard `model-NNNNN-of-MMMMM.safetensors`	per-layer `layers-N.safetensors`
MTP heads in checkpoint	none	`mtp.layers.0.*`
linear_attn keys	`in_proj_a` / `in_proj_b`	+ new fused `in_proj_ba` + `A_log` + `dt_bias` + `norm`

[Bug] Qwen3.6-27B-FP8 (dense): FP8 weight_scale_inv silently dropped → garbage output (gate_gate_up_proj loop bug in qwen3_5.py) #23687

Description

Checklist

Describe the bug

Symptom 1 — 256 loader warnings during weight load

Symptom 2 — garbage generation after server is "ready"

What's different between Qwen3.5-27B-FP8 (works) and Qwen3.6-27B-FP8 (broken)

Suspected root cause

Workarounds tried (all fail with same warnings + garbage)

Proposed fix (two layers)

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions