Checklist
Describe the bug
Loading Qwen/Qwen3.6-27B-FP8 (dense) with the official command from the model card produces pure garbage output (token salad, repeated single-char tokens, no coherent text).
The same SGLang version + same hardware + same flags loads Qwen/Qwen3.5-27B-FP8 (also dense, also FP8 block-128) cleanly with 0 loader warnings and works correctly. So the SGLang env is fine — only the 3.6 checkpoint triggers the bug.
Symptom 1 — 256 loader warnings during weight load
Two warnings per dense MLP × 64 layers × TP=2 = 256 total:
Parameter model.layers.14.mlp.gate_gate_up_proj.weight_scale_inv not found in params_dict
Parameter model.layers.14.mlp.gate_up_proj.weight_scale_inv not found in params_dict
Note gate_gate_up_proj (double gate_) — strong hint at a string-mutation bug in the loader.
Symptom 2 — garbage generation after server is "ready"
prompt: "一句话介绍你自己" (max_tokens=100, enable_thinking=False)
output: " **:**落花ingoELITgetImage检查工作栩-ca促促促促促促促促/prom/prom/prom Prom/prom..."
prompt: "17*23 等于多少?" (enable_thinking=True)
output: reasoning_content non-empty, content="" — no '391' anywhere
prompt: tool_call (北京天气), enable_thinking=False
output: no tool_calls; content="atherscobrauda之急急急急急急急急..."
prompt: 8K-token long-context anchor-word recall
output: "t Hornyalph�arkingamarca有福裾本站..." — fails to recall anchor
What's different between Qwen3.5-27B-FP8 (works) and Qwen3.6-27B-FP8 (broken)
|
3.5-27B-FP8 (0 warnings, works) |
3.6-27B-FP8 (256 warnings, garbage) |
| sharding |
standard model-NNNNN-of-MMMMM.safetensors |
per-layer layers-N.safetensors |
| MTP heads in checkpoint |
none |
mtp.layers.0.* |
| linear_attn keys |
in_proj_a / in_proj_b |
+ new fused in_proj_ba + A_log + dt_bias + norm |
Direct safetensors inspection confirms 3.6 has separate mlp.gate_proj.weight_scale_inv + mlp.up_proj.weight_scale_inv for every dense MLP layer (no fused gate_up_proj in checkpoint, by design).
Suspected root cause
In python/sglang/srt/models/qwen3_5.py, Qwen3_5ForConditionalGeneration.load_weights (around line 1380):
stacked_params_mapping = [
("qkv_proj", "q_proj", "q"),
("qkv_proj", "k_proj", "k"),
("qkv_proj", "v_proj", "v"),
("gate_up_proj", "gate_proj", 0),
("gate_up_proj", "up_proj", 1),
("in_proj_qkvz.", "in_proj_qkv.", (0, 1, 2)),
("in_proj_qkvz.", "in_proj_z.", 3),
("in_proj_ba.", "in_proj_b.", 0),
("in_proj_ba.", "in_proj_a.", 1),
]
for param_name, weight_name, shard_id in stacked_params_mapping:
if weight_name not in name:
continue
if "visual" in name or "mlp.experts" in name:
continue
name = name.replace(weight_name, param_name) # mutates `name`
if name.endswith(".bias") and name not in params_dict:
continue
if name not in params_dict:
continue # ← `name` is now mutated; next iter re-matches against poisoned name
...
break
Trace for input weight model.layers.14.mlp.gate_proj.weight_scale_inv:
- iter
("gate_up_proj", "gate_proj", 0) — matches → replace → model.layers.14.mlp.gate_up_proj.weight_scale_inv → not in params_dict → continue (name now mutated)
- iter
("gate_up_proj", "up_proj", 1) — up_proj is in gate_up_proj (substring) → replace up_proj → gate_up_proj → model.layers.14.mlp.gate_gate_up_proj.weight_scale_inv → warning printed, weight dropped
The first iteration should have succeeded — the upstream cause is that the model's MergedColumnParallelLinear + FP8-block-quant doesn't expose gate_up_proj.weight_scale_inv as a registered parameter when Qwen3_5ForConditionalGeneration is instantiated for this dense variant. 3.5's checkpoint provides weights in an order/format that bypasses this; 3.6's per-layer + MTP layout doesn't.
Net effect: every dense MLP's FP8 scales fail to load → dequantization runs with default scale (1.0) → garbage output.
Workarounds tried (all fail with same warnings + garbage)
Proposed fix (two layers)
- (cosmetic) make the inner
continue after name = name.replace(...) either restore the original name or break out of the stacked_params_mapping loop, so subsequent iterations don't poison the mutated name string.
- (real fix) ensure
MergedColumnParallelLinear for the Qwen3.6 dense MLP exposes weight_scale_inv as a registered parameter at the time load_weights runs. Likely an init-order issue where the FP8 quant method registers the scale parameter only after named_parameters() has been captured.
Happy to test patches against a local checkout. Full server logs and a TP=1 minimal-config repro available on request.
Reproduction
Verbatim from the model card https://huggingface.co/Qwen/Qwen3.6-27B-FP8 (only --tp-size and --port adjusted to local hardware):
docker run --gpus '"device=0,1"' --shm-size 32g --ipc host -p 8000:8080 \
-v /path/to/Qwen3.6-27B-FP8:/models/Qwen3.6-27B-FP8:ro \
lmsysorg/sglang:v0.5.10.post1-cu130 \
python3 -m sglang.launch_server \
--model-path /models/Qwen3.6-27B-FP8 \
--tp-size 2 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--host 0.0.0.0 --port 8080
Server starts (after ~256 warnings). Any chat completion returns garbage:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"...","messages":[{"role":"user","content":"hi"}],
"max_tokens":40,"chat_template_kwargs":{"enable_thinking":false}}'
Replacing the model with Qwen/Qwen3.5-27B-FP8 and re-running the identical command produces 0 warnings and coherent output.
Environment
CUDA Version 13.0.1
Python: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 5090
GPU 0 Compute Capability: 12.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 580.126.20
PyTorch: 2.9.1+cu130
sglang: 0.5.10.post1
sglang-kernel: 0.4.1+cu130
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu130
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
fastapi: 0.135.3
huggingface_hub: 1.9.2
pydantic: 2.12.5
xgrammar: 0.1.32
torchcodec: 0.9.1+cu130
NVIDIA Topology: 2× RTX 5090, no NVLink (peer access not supported between devices — handled by SGLang via NCCL fallback, expected)
Hardware: 2× RTX 5090 (Blackwell sm_120, 32 GB each)
Container image: lmsysorg/sglang:v0.5.10.post1-cu130
Checklist
Describe the bug
Loading
Qwen/Qwen3.6-27B-FP8(dense) with the official command from the model card produces pure garbage output (token salad, repeated single-char tokens, no coherent text).The same SGLang version + same hardware + same flags loads
Qwen/Qwen3.5-27B-FP8(also dense, also FP8 block-128) cleanly with 0 loader warnings and works correctly. So the SGLang env is fine — only the 3.6 checkpoint triggers the bug.Symptom 1 — 256 loader warnings during weight load
Two warnings per dense MLP × 64 layers × TP=2 = 256 total:
Note
gate_gate_up_proj(doublegate_) — strong hint at a string-mutation bug in the loader.Symptom 2 — garbage generation after server is "ready"
What's different between Qwen3.5-27B-FP8 (works) and Qwen3.6-27B-FP8 (broken)
model-NNNNN-of-MMMMM.safetensorslayers-N.safetensorsmtp.layers.0.*in_proj_a/in_proj_bin_proj_ba+A_log+dt_bias+normDirect safetensors inspection confirms 3.6 has separate
mlp.gate_proj.weight_scale_inv+mlp.up_proj.weight_scale_invfor every dense MLP layer (no fusedgate_up_projin checkpoint, by design).Suspected root cause
In
python/sglang/srt/models/qwen3_5.py,Qwen3_5ForConditionalGeneration.load_weights(around line 1380):Trace for input weight
model.layers.14.mlp.gate_proj.weight_scale_inv:("gate_up_proj", "gate_proj", 0)— matches → replace →model.layers.14.mlp.gate_up_proj.weight_scale_inv→ not inparams_dict→continue(name now mutated)("gate_up_proj", "up_proj", 1)—up_projis ingate_up_proj(substring) → replaceup_proj→gate_up_proj→model.layers.14.mlp.gate_gate_up_proj.weight_scale_inv→ warning printed, weight droppedThe first iteration should have succeeded — the upstream cause is that the model's
MergedColumnParallelLinear+ FP8-block-quant doesn't exposegate_up_proj.weight_scale_invas a registered parameter whenQwen3_5ForConditionalGenerationis instantiated for this dense variant. 3.5's checkpoint provides weights in an order/format that bypasses this; 3.6's per-layer + MTP layout doesn't.Net effect: every dense MLP's FP8 scales fail to load → dequantization runs with default scale (1.0) → garbage output.
Workarounds tried (all fail with same warnings + garbage)
--attention-backend triton --linear-attn-decode-backend triton --mamba-scheduler-strategy extra_buffer(per [Bug] [GDN] Accuracy degradation with flashinfergated_delta_rule_decode_pretransposeunderno_bufferscheduling #20791)--kv-cache-dtype fp8_e4m3(per [Benchmark] Qwen3.5-122B-A10B FP8 weights / bf16 KV on 8x RTX PRO 6000 (SM120): 1,985 tok/s burst, MTP 2.75x, fp8 KV silent corruption finding #19603 — also tried adding it back, no change)0.5.10.post1GA → identical to0.5.10rc0(lmsysorg/sglang:latest)tp-size=2with extra--max-mamba-cache-size/--max-running-requeststuningtp-size=1OOMs at full 262144 ctx on RTX 5090 (32 GB) — context does not change the loader behaviorProposed fix (two layers)
continueaftername = name.replace(...)either restore the originalnameorbreakout of thestacked_params_mappingloop, so subsequent iterations don't poison the mutated name string.MergedColumnParallelLinearfor the Qwen3.6 dense MLP exposesweight_scale_invas a registered parameter at the timeload_weightsruns. Likely an init-order issue where the FP8 quant method registers the scale parameter only afternamed_parameters()has been captured.Happy to test patches against a local checkout. Full server logs and a TP=1 minimal-config repro available on request.
Reproduction
Verbatim from the model card https://huggingface.co/Qwen/Qwen3.6-27B-FP8 (only
--tp-sizeand--portadjusted to local hardware):docker run --gpus '"device=0,1"' --shm-size 32g --ipc host -p 8000:8080 \ -v /path/to/Qwen3.6-27B-FP8:/models/Qwen3.6-27B-FP8:ro \ lmsysorg/sglang:v0.5.10.post1-cu130 \ python3 -m sglang.launch_server \ --model-path /models/Qwen3.6-27B-FP8 \ --tp-size 2 \ --mem-fraction-static 0.8 \ --context-length 262144 \ --reasoning-parser qwen3 \ --tool-call-parser qwen3_coder \ --host 0.0.0.0 --port 8080Server starts (after ~256 warnings). Any chat completion returns garbage:
Replacing the model with
Qwen/Qwen3.5-27B-FP8and re-running the identical command produces 0 warnings and coherent output.Environment