-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=270331, available_size=644, evictable_size=269628, protected_size=0
A clear and reproducible memory leak is detected from the token_to_kv_pool_allocator when using sglang==0.5.3.post3. both flashinfer and fa3 attention backend will happened. not happened in v0.5.3 on same machine
Reproduction
the minimal reproduction:
python3 -m sglang.launch_server --model-path /model/Qwen3-30B-A3B --served-model-name Qwen3-8B --tp 1 --attention-backend flashinfer
I use evelscope to do the benchmark
evalscope perf --parallel 256 --model Qwen3-8B --url http://127.0.0.1:30000/v1/chat/completions --api openai --dataset random --min-tokens 1024 --max-tokens 1024 --min-prompt-length 1024 --max-prompt-length 1024 --prefix-length 0 --number 256 --tokenizer-path /model/Qwen3-8B
the error log :
:~# python3 -m sglang.launch_server --model-path /model/Qwen3-30B-A3B --served-model-name Qwen3-8B --tp 1 --attention-backend flashinfer
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-22 20:21:32] server_args=ServerArgs(model_path='/model/Qwen3-30B-A3B', tokenizer_path='/model/Qwen3-30B-A3B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, mem_fraction_static=0.863, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', elastic_ep_backend=None, mooncake_ib_device=None, tp_size=1, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=397916904, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, crash_on_nan=False, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, enable_trace=False, oltp_traces_endpoint='localhost:4317', api_key=None, served_model_name='Qwen3-8B', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='csgmv', lora_eviction_policy='lru', max_lora_chunk_size=16, attention_backend='flashinfer', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill='flashmla_prefill', nsa_decode='fa3', enable_beta_spec=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8)
[2025-10-22 20:21:33] Using default HuggingFace chat template with detected content format: string
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-22 20:21:38] Init torch distributed begin.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-10-22 20:21:38] Init torch distributed ends. mem usage=0.00 GB
[2025-10-22 20:21:39] Load weight begin. avail mem=94.69 GB
Loading safetensors checkpoint shards: 0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/16 [00:00<00:09, 1.55it/s]
Loading safetensors checkpoint shards: 12% Completed | 2/16 [00:01<00:09, 1.46it/s]
Loading safetensors checkpoint shards: 19% Completed | 3/16 [00:02<00:09, 1.43it/s]
Loading safetensors checkpoint shards: 25% Completed | 4/16 [00:02<00:08, 1.42it/s]
Loading safetensors checkpoint shards: 31% Completed | 5/16 [00:03<00:07, 1.41it/s]
Loading safetensors checkpoint shards: 38% Completed | 6/16 [00:04<00:07, 1.41it/s]
Loading safetensors checkpoint shards: 44% Completed | 7/16 [00:04<00:06, 1.41it/s]
Loading safetensors checkpoint shards: 50% Completed | 8/16 [00:05<00:05, 1.41it/s]
Loading safetensors checkpoint shards: 56% Completed | 9/16 [00:06<00:04, 1.41it/s]
Loading safetensors checkpoint shards: 62% Completed | 10/16 [00:07<00:04, 1.41it/s]
Loading safetensors checkpoint shards: 69% Completed | 11/16 [00:07<00:03, 1.41it/s]
Loading safetensors checkpoint shards: 75% Completed | 12/16 [00:08<00:02, 1.41it/s]
Loading safetensors checkpoint shards: 81% Completed | 13/16 [00:09<00:02, 1.41it/s]
Loading safetensors checkpoint shards: 88% Completed | 14/16 [00:09<00:01, 1.41it/s]
Loading safetensors checkpoint shards: 94% Completed | 15/16 [00:10<00:00, 1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:10<00:00, 1.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:10<00:00, 1.48it/s]
[2025-10-22 20:21:50] Load weight end. type=Qwen3MoeForCausalLM, dtype=torch.bfloat16, avail mem=37.73 GB, mem usage=56.96 GB.
[2025-10-22 20:21:50] Using KV cache dtype: torch.bfloat16
[2025-10-22 20:21:50] KV Cache is allocated. #tokens: 270331, K size: 12.37 GB, V size: 12.37 GB
[2025-10-22 20:21:50] Memory pool end. avail mem=12.47 GB
[2025-10-22 20:21:51] Capture cuda graph begin. This can take up to several minutes. avail mem=11.89 GB
[2025-10-22 20:21:51] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=11.71 GB): 0%| | 0/36 [00:00<?, ?it/s][2025-10-22 20:21:51] Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=768,device_name=NVIDIA_H20.json. Fallback to triton version 3.3.1 and use MoE kernel config from /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=768,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
Capturing batches (bs=1 avail_mem=10.88 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:05<00:00, 6.77it/s]
[2025-10-22 20:21:56] Capture cuda graph end. Time elapsed: 5.80 s. mem usage=1.03 GB. avail mem=10.87 GB.
[2025-10-22 20:21:57] max_total_num_tokens=270331, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=3379, context_len=40960, available_gpu_mem=10.87 GB
[2025-10-22 20:21:57] INFO: Started server process [50633]
[2025-10-22 20:21:57] INFO: Waiting for application startup.
[2025-10-22 20:21:57] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-22 20:21:57] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-10-22 20:21:57] INFO: Application startup complete.
[2025-10-22 20:21:57] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
......
[2025-10-22 20:23:04] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8993 -> 0.9261
[2025-10-22 20:23:05] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9130 -> 0.9397
[2025-10-22 20:23:05] Decode batch. #running-req: 137, #token: 269123, token usage: 1.00, cuda graph: True, gen throughput (token/s): 3882.74, #queue-req: 119,
[2025-10-22 20:23:05] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9267 -> 0.9534
[2025-10-22 20:23:06] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9394 -> 0.9680
[2025-10-22 20:23:07] Decode batch. #running-req: 134, #token: 268540, token usage: 0.99, cuda graph: True, gen throughput (token/s): 3879.49, #queue-req: 121,
[2025-10-22 20:23:07] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9390 -> 0.9971
[2025-10-22 20:23:08] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9830 -> 1.0000
[2025-10-22 20:23:08] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 27, token usage: 0.00, #running-req: 0, #queue-req: 114,
[2025-10-22 20:23:08] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.03, #running-req: 8, #queue-req: 106,
[2025-10-22 20:23:08] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.06, #running-req: 16, #queue-req: 98,
[2025-10-22 20:23:09] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.09, #running-req: 24, #queue-req: 90,
[2025-10-22 20:23:09] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.12, #running-req: 32, #queue-req: 82,
[2025-10-22 20:23:10] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.15, #running-req: 40, #queue-req: 74,
[2025-10-22 20:23:10] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.18, #running-req: 48, #queue-req: 66,
[2025-10-22 20:23:10] Prefill batch. #new-seq: 9, #new-token: 8192, #cached-token: 24, token usage: 0.21, #running-req: 56, #queue-req: 58,
[2025-10-22 20:23:11] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.24, #running-req: 64, #queue-req: 53,
[2025-10-22 20:23:11] Prefill batch. #new-seq: 7, #new-token: 8192, #cached-token: 18, token usage: 0.27, #running-req: 69, #queue-req: 47,
[2025-10-22 20:23:12] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.30, #running-req: 75, #queue-req: 42,
[2025-10-22 20:23:12] Prefill batch. #new-seq: 7, #new-token: 8192, #cached-token: 18, token usage: 0.33, #running-req: 80, #queue-req: 36,
[2025-10-22 20:23:13] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.36, #running-req: 86, #queue-req: 31,
[2025-10-22 20:23:13] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.39, #running-req: 91, #queue-req: 26,
[2025-10-22 20:23:14] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.42, #running-req: 96, #queue-req: 22,
[2025-10-22 20:23:14] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.45, #running-req: 100, #queue-req: 17,
[2025-10-22 20:23:14] Prefill batch. #new-seq: 6, #new-token: 8192, #cached-token: 15, token usage: 0.49, #running-req: 105, #queue-req: 12,
[2025-10-22 20:23:15] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.52, #running-req: 110, #queue-req: 8,
[2025-10-22 20:23:15] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.55, #running-req: 114, #queue-req: 4,
[2025-10-22 20:23:16] Prefill batch. #new-seq: 5, #new-token: 8192, #cached-token: 12, token usage: 0.58, #running-req: 118, #queue-req: 0,
[2025-10-22 20:23:16] Prefill batch. #new-seq: 1, #new-token: 529, #cached-token: 0, token usage: 0.61, #running-req: 122, #queue-req: 0,
[2025-10-22 20:23:17] Decode batch. #running-req: 123, #token: 164923, token usage: 0.61, cuda graph: True, gen throughput (token/s): 500.00, #queue-req: 0,
[2025-10-22 20:23:18] Decode batch. #running-req: 121, #token: 165696, token usage: 0.61, cuda graph: True, gen throughput (token/s): 4094.45, #queue-req: 0,
[2025-10-22 20:23:20] Decode batch. #running-req: 118, #token: 164352, token usage: 0.61, cuda graph: True, gen throughput (token/s): 4056.96, #queue-req: 0,
[2025-10-22 20:23:21] Decode batch. #running-req: 116, #token: 162896, token usage: 0.60, cuda graph: True, gen throughput (token/s): 3968.94, #queue-req: 0,
[2025-10-22 20:23:22] Decode batch. #running-req: 113, #token: 161321, token usage: 0.60, cuda graph: True, gen throughput (token/s): 3860.62, #queue-req: 0,
[2025-10-22 20:23:23] Decode batch. #running-req: 109, #token: 159622, token usage: 0.59, cuda graph: True, gen throughput (token/s): 3842.54, #queue-req: 0,
[2025-10-22 20:23:24] Decode batch. #running-req: 106, #token: 157796, token usage: 0.58, cuda graph: True, gen throughput (token/s): 3744.57, #queue-req: 0,
[2025-10-22 20:23:25] Decode batch. #running-req: 103, #token: 155837, token usage: 0.58, cuda graph: True, gen throughput (token/s): 3692.80, #queue-req: 0,
[2025-10-22 20:23:26] Decode batch. #running-req: 99, #token: 151692, token usage: 0.56, cuda graph: True, gen throughput (token/s): 3599.84, #queue-req: 0,
[2025-10-22 20:23:28] Decode batch. #running-req: 95, #token: 147401, token usage: 0.55, cuda graph: True, gen throughput (token/s): 3529.46, #queue-req: 0,
[2025-10-22 20:23:29] Decode batch. #running-req: 92, #token: 142957, token usage: 0.53, cuda graph: True, gen throughput (token/s): 3410.78, #queue-req: 0,
[2025-10-22 20:23:30] Decode batch. #running-req: 87, #token: 138351, token usage: 0.51, cuda graph: True, gen throughput (token/s): 3304.70, #queue-req: 0,
[2025-10-22 20:23:31] Decode batch. #running-req: 83, #token: 133576, token usage: 0.49, cuda graph: True, gen throughput (token/s): 3220.77, #queue-req: 0,
[2025-10-22 20:23:32] Decode batch. #running-req: 79, #token: 126579, token usage: 0.47, cuda graph: True, gen throughput (token/s): 3119.00, #queue-req: 0,
[2025-10-22 20:23:33] Decode batch. #running-req: 74, #token: 121439, token usage: 0.45, cuda graph: True, gen throughput (token/s): 3050.09, #queue-req: 0,
[2025-10-22 20:23:34] Decode batch. #running-req: 69, #token: 112013, token usage: 0.41, cuda graph: True, gen throughput (token/s): 2958.13, #queue-req: 0,
[2025-10-22 20:23:35] Decode batch. #running-req: 64, #token: 106540, token usage: 0.39, cuda graph: True, gen throughput (token/s): 2841.08, #queue-req: 0,
[2025-10-22 20:23:36] Decode batch. #running-req: 64, #token: 109100, token usage: 0.40, cuda graph: True, gen throughput (token/s): 2818.36, #queue-req: 0,
[2025-10-22 20:23:37] Decode batch. #running-req: 64, #token: 111660, token usage: 0.41, cuda graph: True, gen throughput (token/s): 2819.91, #queue-req: 0,
[2025-10-22 20:23:37] Decode batch. #running-req: 64, #token: 114220, token usage: 0.42, cuda graph: True, gen throughput (token/s): 2813.07, #queue-req: 0,
[2025-10-22 20:23:38] Decode batch. #running-req: 64, #token: 116780, token usage: 0.43, cuda graph: True, gen throughput (token/s): 2813.02, #queue-req: 0,
[2025-10-22 20:23:39] Decode batch. #running-req: 64, #token: 119340, token usage: 0.44, cuda graph: True, gen throughput (token/s): 2792.02, #queue-req: 0,
[2025-10-22 20:23:40] Decode batch. #running-req: 64, #token: 121900, token usage: 0.45, cuda graph: True, gen throughput (token/s): 2785.32, #queue-req: 0,
[2025-10-22 20:23:41] Decode batch. #running-req: 64, #token: 124460, token usage: 0.46, cuda graph: True, gen throughput (token/s): 2763.89, #queue-req: 0,
[2025-10-22 20:23:42] Decode batch. #running-req: 64, #token: 127020, token usage: 0.47, cuda graph: True, gen throughput (token/s): 2749.54, #queue-req: 0,
[2025-10-22 20:23:43] Decode batch. #running-req: 63, #token: 127555, token usage: 0.47, cuda graph: True, gen throughput (token/s): 2709.10, #queue-req: 0,
[2025-10-22 20:23:44] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 3056, in run_scheduler_process
scheduler.event_loop_overlap()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1053, in event_loop_overlap
self.self_check_during_idle()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1677, in self_check_during_idle
self.check_memory()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1732, in check_memory
raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=270331, available_size=644, evictable_size=269628, protected_size=0
[2025-10-22 20:30:53] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
Killed
Environment
root@iZbp19uogy8wkqfqtk61nzZ:~# python3 -m sglang.check_env
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
Python: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
CUDA available: True
GPU 0,1: NVIDIA H20
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
CUDA Driver Version: 570.133.20
PyTorch: 2.8.0+cu128
sglang: 0.5.3.post3
sgl_kernel: 0.3.15
flashinfer_python: 0.4.0
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.13.1
fastapi: 0.119.1
hf_transfer: 0.1.9
huggingface_hub: 0.35.3
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.1
pydantic: 2.12.3
python-multipart: 0.0.12
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.12.0
anthropic: 0.71.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 0-31 0 N/A
GPU1 NV18 X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 65535