Dflash working launch parameters #22532
Unanswered
GerLinuxEnthusiast
asked this question in
Q&A
Replies: 2 comments
-
|
These setting look a bit aggressive for 2x3090s. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey! I am generally new to Sglang and wanted to try out the Dflash speculative decoding algorithm, but I don't get it to work.
My setup entails two nvidia rtx 3090s (2x24gb/ 48gb total) and I tried running it with the official gptq quantized Qwen3.5 27B version from Qwen, but no matter what I do, it always runs OOM.
Is there a setting/ parameter I miss in the picture or is it just not working at all?
My current setup:
Version: 0.5.6.post2
python -m sglang.launch_server
--model-path Qwen/Qwen3.5-27B-GPTQ-Int4
--speculative-algorithm DFLASH
--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash
--speculative-num-draft-tokens 16
--tp-size 2
--attention-backend flashinfer
--speculative-draft-attention-backend fa4
--mem-fraction-static 0.8
--mamba-scheduler-strategy extra_buffer
--trust-remote-code
--context-length 4000
The context length is deliberately low as I initially thought this might be the problem, but it is not.
The output
`...
[2026-04-10 19:02:29 TP0] Load weight end. elapsed=19.18 s, type=Qwen3_5ForConditionalGeneration, quant=gptq, bits=4, avail mem=9.40 GB, mem usage=13.85 GB.
[2026-04-10 19:02:29 TP1] Load weight end. elapsed=19.20 s, type=Qwen3_5ForConditionalGeneration, quant=gptq, bits=4, avail mem=8.95 GB, mem usage=13.87 GB.
[2026-04-10 19:02:29 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-10 19:02:29 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3277, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 382, in init
self.init_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 578, in init_model_worker
self.init_tp_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 536, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 258, in init
self._init_model_runner()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 464, in init
self.initialize(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 651, in initialize
self.init_memory_pool(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner_kv_cache_mixin.py", line 801, in init_memory_pool
raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. Current value: self.server_args.mem_fraction_static=0.8
[2026-04-10 19:02:29 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 3277, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 382, in init
self.init_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 578, in init_model_worker
self.init_tp_model_worker()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 536, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 258, in init
self._init_model_runner()
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 464, in init
self.initialize(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 651, in initialize
self.init_memory_pool(pre_model_load_memory)
File "/home/user1/sglang_env/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner_kv_cache_mixin.py", line 801, in init_memory_pool
raise RuntimeError(
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static. Current value: self.server_args.mem_fraction_static=0.8
[2026-04-10 19:02:29] Received sigquit from a child process. It usually means the child failed.
Process terminated (OOM)`
Beta Was this translation helpful? Give feedback.
All reactions