Skip to content

[Bug] Memory Leak in token_to_kv_pool_allocator since v0.5.3.post3 (not present in v0.5.3) #11970

@yd-oom

Description

@yd-oom

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=270331, available_size=644, evictable_size=269628, protected_size=0
A clear and reproducible memory leak is detected from the token_to_kv_pool_allocator when using sglang==0.5.3.post3. both flashinfer and fa3 attention backend will happened. not happened in v0.5.3 on same machine

Reproduction

the minimal reproduction:

python3 -m sglang.launch_server --model-path /model/Qwen3-30B-A3B --served-model-name Qwen3-8B --tp 1 --attention-backend flashinfer

I use evelscope to do the benchmark

evalscope perf --parallel 256 --model Qwen3-8B --url http://127.0.0.1:30000/v1/chat/completions --api openai --dataset random --min-tokens 1024 --max-tokens 1024 --min-prompt-length 1024 --max-prompt-length 1024 --prefix-length 0 --number 256 --tokenizer-path /model/Qwen3-8B

the error log :


:~# python3 -m sglang.launch_server --model-path /model/Qwen3-30B-A3B --served-model-name Qwen3-8B --tp 1 --attention-backend flashinfer
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-22 20:21:32] server_args=ServerArgs(model_path='/model/Qwen3-30B-A3B', tokenizer_path='/model/Qwen3-30B-A3B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, mem_fraction_static=0.863, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', elastic_ep_backend=None, mooncake_ib_device=None, tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=397916904, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, crash_on_nan=False, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, enable_trace=False, oltp_traces_endpoint='localhost:4317', api_key=None, served_model_name='Qwen3-8B', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='csgmv', lora_eviction_policy='lru', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill='flashmla_prefill', nsa_decode='fa3', enable_beta_spec=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8)
[2025-10-22 20:21:33] Using default HuggingFace chat template with detected content format: string
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-22 20:21:38] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-22 20:21:38] Init torch distributed ends. mem usage=0.00 GB
[2025-10-22 20:21:39] Load weight begin. avail mem=94.69 GB
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:09,  1.55it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:01<00:09,  1.46it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:02<00:09,  1.43it/s]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:02<00:08,  1.42it/s]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:03<00:07,  1.41it/s]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:04<00:07,  1.41it/s]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:04<00:06,  1.41it/s]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:05<00:05,  1.41it/s]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:06<00:04,  1.41it/s]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:07<00:04,  1.41it/s]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:07<00:03,  1.41it/s]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:08<00:02,  1.41it/s]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:09<00:02,  1.41it/s]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:09<00:01,  1.41it/s]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:10<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:10<00:00,  1.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:10<00:00,  1.48it/s]

[2025-10-22 20:21:50] Load weight end. type=Qwen3MoeForCausalLM, dtype=torch.bfloat16, avail mem=37.73 GB, mem usage=56.96 GB.
[2025-10-22 20:21:50] Using KV cache dtype: torch.bfloat16
[2025-10-22 20:21:50] KV Cache is allocated. #tokens: 270331, K size: 12.37 GB, V size: 12.37 GB
[2025-10-22 20:21:50] Memory pool end. avail mem=12.47 GB
[2025-10-22 20:21:51] Capture cuda graph begin. This can take up to several minutes. avail mem=11.89 GB
[2025-10-22 20:21:51] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=11.71 GB):   0%|                                                                                                                                                                         | 0/36 [00:00<?, ?it/s][2025-10-22 20:21:51] Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=768,device_name=NVIDIA_H20.json. Fallback to triton version 3.3.1 and use MoE kernel config from /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=768,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
Capturing batches (bs=1 avail_mem=10.88 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:05<00:00,  6.77it/s]
[2025-10-22 20:21:56] Capture cuda graph end. Time elapsed: 5.80 s. mem usage=1.03 GB. avail mem=10.87 GB.
[2025-10-22 20:21:57] max_total_num_tokens=270331, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=3379, context_len=40960, available_gpu_mem=10.87 GB
[2025-10-22 20:21:57] INFO:     Started server process [50633]
[2025-10-22 20:21:57] INFO:     Waiting for application startup.
[2025-10-22 20:21:57] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-22 20:21:57] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-22 20:21:57] INFO:     Application startup complete.
[2025-10-22 20:21:57] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
......

[2025-10-22 20:23:04] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8993 -> 0.9261
[2025-10-22 20:23:05] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9130 -> 0.9397
[2025-10-22 20:23:05] Decode batch. #running-req: 137, #token: 269123, token usage: 1.00, cuda graph: True, gen throughput (token/s): 3882.74, #queue-req: 119, 
[2025-10-22 20:23:05] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9267 -> 0.9534
[2025-10-22 20:23:06] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9394 -> 0.9680
[2025-10-22 20:23:07] Decode batch. #running-req: 134, #token: 268540, token usage: 0.99, cuda graph: True, gen throughput (token/s): 3879.49, #queue-req: 121, 
[2025-10-22 20:23:07] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9390 -> 0.9971
[2025-10-22 20:23:08] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9830 -> 1.0000
[2025-10-22 20:23:08] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 27, token usage: 0.00, #running-req: 0, #queue-req: 114, 
[2025-10-22 20:23:08] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.03, #running-req: 8, #queue-req: 106, 
[2025-10-22 20:23:08] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.06, #running-req: 16, #queue-req: 98, 
[2025-10-22 20:23:09] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.09, #running-req: 24, #queue-req: 90, 
[2025-10-22 20:23:09] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.12, #running-req: 32, #queue-req: 82, 
[2025-10-22 20:23:10] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.15, #running-req: 40, #queue-req: 74, 
[2025-10-22 20:23:10] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.18, #running-req: 48, #queue-req: 66, 
[2025-10-22 20:23:10] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.21, #running-req: 56, #queue-req: 58, 
[2025-10-22 20:23:11] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.24, #running-req: 64, #queue-req: 53, 
[2025-10-22 20:23:11] Prefill batch. #new-seq: 7, #new-token: 8192, #cached-token: 18, token usage: 0.27, #running-req: 69, #queue-req: 47, 
[2025-10-22 20:23:12] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.30, #running-req: 75, #queue-req: 42, 
[2025-10-22 20:23:12] Prefill batch. #new-seq: 7, #new-token: 8192, #cached-token: 18, token usage: 0.33, #running-req: 80, #queue-req: 36, 
[2025-10-22 20:23:13] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.36, #running-req: 86, #queue-req: 31, 
[2025-10-22 20:23:13] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.39, #running-req: 91, #queue-req: 26, 
[2025-10-22 20:23:14] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.42, #running-req: 96, #queue-req: 22, 
[2025-10-22 20:23:14] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.45, #running-req: 100, #queue-req: 17, 
[2025-10-22 20:23:14] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.49, #running-req: 105, #queue-req: 12, 
[2025-10-22 20:23:15] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.52, #running-req: 110, #queue-req: 8, 
[2025-10-22 20:23:15] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.55, #running-req: 114, #queue-req: 4, 
[2025-10-22 20:23:16] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.58, #running-req: 118, #queue-req: 0, 
[2025-10-22 20:23:16] Prefill batch. #new-seq: 1, #new-token: 529, #cached-token: 0, token usage: 0.61, #running-req: 122, #queue-req: 0, 
[2025-10-22 20:23:17] Decode batch. #running-req: 123, #token: 164923, token usage: 0.61, cuda graph: True, gen throughput (token/s): 500.00, #queue-req: 0, 
[2025-10-22 20:23:18] Decode batch. #running-req: 121, #token: 165696, token usage: 0.61, cuda graph: True, gen throughput (token/s): 4094.45, #queue-req: 0, 
[2025-10-22 20:23:20] Decode batch. #running-req: 118, #token: 164352, token usage: 0.61, cuda graph: True, gen throughput (token/s): 4056.96, #queue-req: 0, 
[2025-10-22 20:23:21] Decode batch. #running-req: 116, #token: 162896, token usage: 0.60, cuda graph: True, gen throughput (token/s): 3968.94, #queue-req: 0, 
[2025-10-22 20:23:22] Decode batch. #running-req: 113, #token: 161321, token usage: 0.60, cuda graph: True, gen throughput (token/s): 3860.62, #queue-req: 0, 
[2025-10-22 20:23:23] Decode batch. #running-req: 109, #token: 159622, token usage: 0.59, cuda graph: True, gen throughput (token/s): 3842.54, #queue-req: 0, 
[2025-10-22 20:23:24] Decode batch. #running-req: 106, #token: 157796, token usage: 0.58, cuda graph: True, gen throughput (token/s): 3744.57, #queue-req: 0, 
[2025-10-22 20:23:25] Decode batch. #running-req: 103, #token: 155837, token usage: 0.58, cuda graph: True, gen throughput (token/s): 3692.80, #queue-req: 0, 
[2025-10-22 20:23:26] Decode batch. #running-req: 99, #token: 151692, token usage: 0.56, cuda graph: True, gen throughput (token/s): 3599.84, #queue-req: 0, 
[2025-10-22 20:23:28] Decode batch. #running-req: 95, #token: 147401, token usage: 0.55, cuda graph: True, gen throughput (token/s): 3529.46, #queue-req: 0, 
[2025-10-22 20:23:29] Decode batch. #running-req: 92, #token: 142957, token usage: 0.53, cuda graph: True, gen throughput (token/s): 3410.78, #queue-req: 0, 
[2025-10-22 20:23:30] Decode batch. #running-req: 87, #token: 138351, token usage: 0.51, cuda graph: True, gen throughput (token/s): 3304.70, #queue-req: 0, 
[2025-10-22 20:23:31] Decode batch. #running-req: 83, #token: 133576, token usage: 0.49, cuda graph: True, gen throughput (token/s): 3220.77, #queue-req: 0, 
[2025-10-22 20:23:32] Decode batch. #running-req: 79, #token: 126579, token usage: 0.47, cuda graph: True, gen throughput (token/s): 3119.00, #queue-req: 0, 
[2025-10-22 20:23:33] Decode batch. #running-req: 74, #token: 121439, token usage: 0.45, cuda graph: True, gen throughput (token/s): 3050.09, #queue-req: 0, 
[2025-10-22 20:23:34] Decode batch. #running-req: 69, #token: 112013, token usage: 0.41, cuda graph: True, gen throughput (token/s): 2958.13, #queue-req: 0, 
[2025-10-22 20:23:35] Decode batch. #running-req: 64, #token: 106540, token usage: 0.39, cuda graph: True, gen throughput (token/s): 2841.08, #queue-req: 0, 
[2025-10-22 20:23:36] Decode batch. #running-req: 64, #token: 109100, token usage: 0.40, cuda graph: True, gen throughput (token/s): 2818.36, #queue-req: 0, 
[2025-10-22 20:23:37] Decode batch. #running-req: 64, #token: 111660, token usage: 0.41, cuda graph: True, gen throughput (token/s): 2819.91, #queue-req: 0, 
[2025-10-22 20:23:37] Decode batch. #running-req: 64, #token: 114220, token usage: 0.42, cuda graph: True, gen throughput (token/s): 2813.07, #queue-req: 0, 
[2025-10-22 20:23:38] Decode batch. #running-req: 64, #token: 116780, token usage: 0.43, cuda graph: True, gen throughput (token/s): 2813.02, #queue-req: 0, 
[2025-10-22 20:23:39] Decode batch. #running-req: 64, #token: 119340, token usage: 0.44, cuda graph: True, gen throughput (token/s): 2792.02, #queue-req: 0, 
[2025-10-22 20:23:40] Decode batch. #running-req: 64, #token: 121900, token usage: 0.45, cuda graph: True, gen throughput (token/s): 2785.32, #queue-req: 0, 
[2025-10-22 20:23:41] Decode batch. #running-req: 64, #token: 124460, token usage: 0.46, cuda graph: True, gen throughput (token/s): 2763.89, #queue-req: 0, 
[2025-10-22 20:23:42] Decode batch. #running-req: 64, #token: 127020, token usage: 0.47, cuda graph: True, gen throughput (token/s): 2749.54, #queue-req: 0, 
[2025-10-22 20:23:43] Decode batch. #running-req: 63, #token: 127555, token usage: 0.47, cuda graph: True, gen throughput (token/s): 2709.10, #queue-req: 0, 
[2025-10-22 20:23:44] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 3056, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1053, in event_loop_overlap
    self.self_check_during_idle()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1677, in self_check_during_idle
    self.check_memory()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1732, in check_memory
    raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=270331, available_size=644, evictable_size=269628, protected_size=0
[2025-10-22 20:30:53] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed

Environment

root@iZbp19uogy8wkqfqtk61nzZ:~# python3 -m sglang.check_env
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Python: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
CUDA available: True
GPU 0,1: NVIDIA H20
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
CUDA Driver Version: 570.133.20
PyTorch: 2.8.0+cu128
sglang: 0.5.3.post3
sgl_kernel: 0.3.15
flashinfer_python: 0.4.0
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.13.1
fastapi: 0.119.1
hf_transfer: 0.1.9
huggingface_hub: 0.35.3
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.1
pydantic: 2.12.3
python-multipart: 0.0.12
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.12.0
anthropic: 0.71.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    0-31    0               N/A
GPU1    NV18     X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 65535

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions