Dflash working launch parameters #22532

GerLinuxEnthusiast · 2026-04-10T17:06:22Z

GerLinuxEnthusiast
Apr 10, 2026

Hey! I am generally new to Sglang and wanted to try out the Dflash speculative decoding algorithm, but I don't get it to work.

My setup entails two nvidia rtx 3090s (2x24gb/ 48gb total) and I tried running it with the official gptq quantized Qwen3.5 27B version from Qwen, but no matter what I do, it always runs OOM.

Is there a setting/ parameter I miss in the picture or is it just not working at all?

My current setup:
Version: 0.5.6.post2

python -m sglang.launch_server
--model-path Qwen/Qwen3.5-27B-GPTQ-Int4
--speculative-algorithm DFLASH
--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash
--speculative-num-draft-tokens 16
--tp-size 2
--attention-backend flashinfer
--speculative-draft-attention-backend fa4
--mem-fraction-static 0.8
--mamba-scheduler-strategy extra_buffer
--trust-remote-code
--context-length 4000

The context length is deliberately low as I initially thought this might be the problem, but it is not.

The output
`...
[2026-04-10 19:02:29 TP0] Load weight end. elapsed=19.18 s, type=Qwen3_5ForConditionalGeneration, quant=gptq, bits=4, avail mem=9.40 GB, mem usage=13.85 GB.
[2026-04-10 19:02:29 TP1] Load weight end. elapsed=19.20 s, type=Qwen3_5ForConditionalGeneration, quant=gptq, bits=4, avail mem=8.95 GB, mem usage=13.87 GB.
[2026-04-10 19:02:29 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-10 19:02:29 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3277, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 382, in init
self.init_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 578, in init_model_worker
self.init_tp_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 536, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 258, in init
self._init_model_runner()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 464, in init
self.initialize(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 651, in initialize
self.init_memory_pool(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner_kv_cache_mixin.py", line 801, in init_memory_pool
raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. Current value: self.server_args.mem_fraction_static=0.8

[2026-04-10 19:02:29 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3277, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 382, in init
self.init_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 578, in init_model_worker
self.init_tp_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 536, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 258, in init
self._init_model_runner()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 464, in init
self.initialize(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 651, in initialize
self.init_memory_pool(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner_kv_cache_mixin.py", line 801, in init_memory_pool
raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. Current value: self.server_args.mem_fraction_static=0.8

[2026-04-10 19:02:29] Received sigquit from a child process. It usually means the child failed.
Process terminated (OOM)`

lviy · 2026-04-21T07:46:37Z

lviy
Apr 21, 2026

These setting look a bit aggressive for 2x3090s.
You could try increasing --mem-fraction-static to 0.90-0.95 and reducing --speculative-num-draft-tokens to 4-8 first.

0 replies

mratsim · 2026-05-01T06:33:42Z

mratsim
May 1, 2026

--speculative-draft-attention-backend fa4, if FA4 refers to flash attention, FA3 and FA4 only supports H100 and B200 / B300 class of GPUs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dflash working launch parameters #22532

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dflash working launch parameters #22532

Uh oh!

GerLinuxEnthusiast Apr 10, 2026

Replies: 2 comments

Uh oh!

lviy Apr 21, 2026

Uh oh!

mratsim May 1, 2026

GerLinuxEnthusiast
Apr 10, 2026

lviy
Apr 21, 2026

mratsim
May 1, 2026