Checklist
Describe the bug
I'd like to report a severe CPU memory allocation bug in the v0.5.10 feature Hisparse(DSA CPU-offload decoding)
During Hisparse, for each new token generated, it is backed-up in CPU, causing additional CPU memory usage. This is different from non-Hisparse methods as they do not back-up newly generated kv-cache on CPU unless a sequence is retracted.
However, the memory pool system still uses the non-Hisparse checking method that accepts the request if input_length+a small number< CPU memory.
So when the CPU memory pool is about to become full and the generation length is long, there may be CPU overflow as decode length increases.
Reproduction
Though the overflow may happen at any setting with rare chance, a way to increase the chance of reproducing it is to
1: Use a DSA model(Deepseek-V3.2-Speciale/GLM-5.1) and PD disaggregation, set up a small host_to_device_ratio in hisparse-config, for example, --hisparse-config "{'top_k':2048,'device_buffer_size':4096,'host_to_device_ratio':1}"
2: Prepare long hard prompts that require the model to think a lot of tokens.(input is better at 50-100k range, generation shall be more than 4k each)
3: Arrange a lot of prefill servers and only one decode server to give larger pressure to the decoding side.
The decode server is likely to crash due to short of CPU memory because of the bug described.
Environment
sglang-0.5.10
Checklist
Describe the bug
I'd like to report a severe CPU memory allocation bug in the v0.5.10 feature Hisparse(DSA CPU-offload decoding)
During Hisparse, for each new token generated, it is backed-up in CPU, causing additional CPU memory usage. This is different from non-Hisparse methods as they do not back-up newly generated kv-cache on CPU unless a sequence is retracted.
However, the memory pool system still uses the non-Hisparse checking method that accepts the request if input_length+a small number< CPU memory.
So when the CPU memory pool is about to become full and the generation length is long, there may be CPU overflow as decode length increases.
Reproduction
Though the overflow may happen at any setting with rare chance, a way to increase the chance of reproducing it is to
1: Use a DSA model(Deepseek-V3.2-Speciale/GLM-5.1) and PD disaggregation, set up a small
host_to_device_ratioinhisparse-config, for example,--hisparse-config "{'top_k':2048,'device_buffer_size':4096,'host_to_device_ratio':1}"2: Prepare long hard prompts that require the model to think a lot of tokens.(input is better at 50-100k range, generation shall be more than 4k each)
3: Arrange a lot of prefill servers and only one decode server to give larger pressure to the decoding side.
The decode server is likely to crash due to short of CPU memory because of the bug described.
Environment
sglang-0.5.10