Skip to content

[Bug] Qwen3.6-27B-FP8 (dense): FP8 weight_scale_inv silently dropped → garbage output (gate_gate_up_proj loop bug in qwen3_5.py) #23687

@gucasbrg

Description

@gucasbrg

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

Loading Qwen/Qwen3.6-27B-FP8 (dense) with the official command from the model card produces pure garbage output (token salad, repeated single-char tokens, no coherent text).

The same SGLang version + same hardware + same flags loads Qwen/Qwen3.5-27B-FP8 (also dense, also FP8 block-128) cleanly with 0 loader warnings and works correctly. So the SGLang env is fine — only the 3.6 checkpoint triggers the bug.

Symptom 1 — 256 loader warnings during weight load

Two warnings per dense MLP × 64 layers × TP=2 = 256 total:

Parameter model.layers.14.mlp.gate_gate_up_proj.weight_scale_inv not found in params_dict
Parameter model.layers.14.mlp.gate_up_proj.weight_scale_inv not found in params_dict

Note gate_gate_up_proj (double gate_) — strong hint at a string-mutation bug in the loader.

Symptom 2 — garbage generation after server is "ready"

prompt:  "一句话介绍你自己"  (max_tokens=100, enable_thinking=False)
output:  " **:**落花ingoELITgetImage检查工作栩-ca促促促促促促促促/prom/prom/prom Prom/prom..."

prompt:  "17*23 等于多少?"  (enable_thinking=True)
output:  reasoning_content non-empty, content="" — no '391' anywhere

prompt:  tool_call (北京天气), enable_thinking=False
output:  no tool_calls; content="atherscobrauda之急急急急急急急急..."

prompt:  8K-token long-context anchor-word recall
output:  "­t Hornyalph�arkingamarca有福裾本站..."  — fails to recall anchor

What's different between Qwen3.5-27B-FP8 (works) and Qwen3.6-27B-FP8 (broken)

3.5-27B-FP8 (0 warnings, works) 3.6-27B-FP8 (256 warnings, garbage)
sharding standard model-NNNNN-of-MMMMM.safetensors per-layer layers-N.safetensors
MTP heads in checkpoint none mtp.layers.0.*
linear_attn keys in_proj_a / in_proj_b + new fused in_proj_ba + A_log + dt_bias + norm

Direct safetensors inspection confirms 3.6 has separate mlp.gate_proj.weight_scale_inv + mlp.up_proj.weight_scale_inv for every dense MLP layer (no fused gate_up_proj in checkpoint, by design).

Suspected root cause

In python/sglang/srt/models/qwen3_5.py, Qwen3_5ForConditionalGeneration.load_weights (around line 1380):

stacked_params_mapping = [
    ("qkv_proj", "q_proj", "q"),
    ("qkv_proj", "k_proj", "k"),
    ("qkv_proj", "v_proj", "v"),
    ("gate_up_proj", "gate_proj", 0),
    ("gate_up_proj", "up_proj",   1),
    ("in_proj_qkvz.", "in_proj_qkv.", (0, 1, 2)),
    ("in_proj_qkvz.", "in_proj_z.", 3),
    ("in_proj_ba.", "in_proj_b.", 0),
    ("in_proj_ba.", "in_proj_a.", 1),
]
for param_name, weight_name, shard_id in stacked_params_mapping:
    if weight_name not in name:
        continue
    if "visual" in name or "mlp.experts" in name:
        continue
    name = name.replace(weight_name, param_name)   # mutates `name`
    if name.endswith(".bias") and name not in params_dict:
        continue
    if name not in params_dict:
        continue            # ← `name` is now mutated; next iter re-matches against poisoned name
    ...
    break

Trace for input weight model.layers.14.mlp.gate_proj.weight_scale_inv:

  1. iter ("gate_up_proj", "gate_proj", 0) — matches → replace → model.layers.14.mlp.gate_up_proj.weight_scale_invnot in params_dictcontinue (name now mutated)
  2. iter ("gate_up_proj", "up_proj", 1)up_proj is in gate_up_proj (substring) → replace up_projgate_up_projmodel.layers.14.mlp.gate_gate_up_proj.weight_scale_inv → warning printed, weight dropped

The first iteration should have succeeded — the upstream cause is that the model's MergedColumnParallelLinear + FP8-block-quant doesn't expose gate_up_proj.weight_scale_inv as a registered parameter when Qwen3_5ForConditionalGeneration is instantiated for this dense variant. 3.5's checkpoint provides weights in an order/format that bypasses this; 3.6's per-layer + MTP layout doesn't.

Net effect: every dense MLP's FP8 scales fail to load → dequantization runs with default scale (1.0) → garbage output.

Workarounds tried (all fail with same warnings + garbage)

Proposed fix (two layers)

  1. (cosmetic) make the inner continue after name = name.replace(...) either restore the original name or break out of the stacked_params_mapping loop, so subsequent iterations don't poison the mutated name string.
  2. (real fix) ensure MergedColumnParallelLinear for the Qwen3.6 dense MLP exposes weight_scale_inv as a registered parameter at the time load_weights runs. Likely an init-order issue where the FP8 quant method registers the scale parameter only after named_parameters() has been captured.

Happy to test patches against a local checkout. Full server logs and a TP=1 minimal-config repro available on request.

Reproduction

Verbatim from the model card https://huggingface.co/Qwen/Qwen3.6-27B-FP8 (only --tp-size and --port adjusted to local hardware):

docker run --gpus '"device=0,1"' --shm-size 32g --ipc host -p 8000:8080 \
  -v /path/to/Qwen3.6-27B-FP8:/models/Qwen3.6-27B-FP8:ro \
  lmsysorg/sglang:v0.5.10.post1-cu130 \
  python3 -m sglang.launch_server \
    --model-path /models/Qwen3.6-27B-FP8 \
    --tp-size 2 \
    --mem-fraction-static 0.8 \
    --context-length 262144 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --host 0.0.0.0 --port 8080

Server starts (after ~256 warnings). Any chat completion returns garbage:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"...","messages":[{"role":"user","content":"hi"}],
       "max_tokens":40,"chat_template_kwargs":{"enable_thinking":false}}'

Replacing the model with Qwen/Qwen3.5-27B-FP8 and re-running the identical command produces 0 warnings and coherent output.

Environment

CUDA Version 13.0.1
Python: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 5090
GPU 0 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 580.126.20
PyTorch: 2.9.1+cu130
sglang: 0.5.10.post1
sglang-kernel: 0.4.1+cu130
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu130
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
pydantic: 2.12.5
xgrammar: 0.1.32
torchcodec: 0.9.1+cu130

NVIDIA Topology: 2× RTX 5090, no NVLink (peer access not supported between devices — handled by SGLang via NCCL fallback, expected)

Hardware: 2× RTX 5090 (Blackwell sm_120, 32 GB each)
Container image: lmsysorg/sglang:v0.5.10.post1-cu130

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions