-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Description
Your current environment
The output of python collect_env.py
Collecting environment information...
uv is set
==============================
System Info
==============================
OS : Ubuntu 24.04.2 LTS (x86_64)
GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version : Could not collect
CMake version : version 3.28.3
Libc version : glibc-2.39
==============================
PyTorch Info
==============================
PyTorch version : 2.8.0+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-5.10.238-234.956.amzn2.x86_64-x86_64-with-glibc2.39
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration :
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200
Nvidia driver version : 550.163.01
cuDNN version : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8488C
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 8
BogoMIPS: 4800.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize flush_l1d arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 4.5 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 192 MiB (96 instances)
L3 cache: 210 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cu126
[pip3] torchaudio==2.8.0+cu126
[pip3] torchvision==0.23.0+cu126
[pip3] transformers==4.57.1
[pip3] triton==3.4.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.11.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 48-95,144-191 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
Setting SamplingParams.truncate_prompt_tokens has no effect in LLM.chat. According to the output, the prompt is not truncated to a specified length. Particularly, passing a longer prompt than max_model_len results in an error even if (truncate_prompt_tokens + max_tokens) is set to be smaller than max_model_len.
While this behavior is mentioned in other issues (#17324, #3144 (comment)), it seems like it is not handled. I guess it is a bug as the docstring of truncate_prompt_tokens (https://docs.vllm.ai/en/latest/api/vllm/index.html#vllm.SamplingParams.truncate_prompt_tokens) does not mention such a limitation and it is natural to expect it to work in LLM.chat as well.
case 1: max_model_len is large enough
RequestOutput.prompt_token_ids seems not truncated to 10.
import vllm
model = vllm.LLM(model="Qwen/Qwen3-0.6B", max_model_len=1000)
ret = model.chat([{"role": "user", "content": "<think>" * 200}], vllm.SamplingParams(
truncate_prompt_tokens=10,
max_tokens=10,
))
print(ret[0])INFO 10-28 16:13:52 [__init__.py:216] Automatically detected platform cuda.
INFO 10-28 16:14:04 [utils.py:233] non-default args: {'max_model_len': 1000, 'disable_log_stats': True}
INFO 10-28 16:14:05 [model.py:547] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-28 16:14:05 [model.py:1510] Using max model len 1000
INFO 10-28 16:14:07 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:08 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:08 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:19 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2333224) WARNING 10-28 16:14:19 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:19 [gpu_model_runner.py:2602] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [cuda.py:366] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [weight_utils.py:392] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [weight_utils.py:450] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.05s/it]
(EngineCore_DP0 pid=2333224)
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:22 [default_loader.py:267] Loading weights took 2.07 seconds
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:23 [gpu_model_runner.py:2653] Model loading took 1.1201 GiB and 2.911120 seconds
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:34 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9fe2c87a72/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:34 [backends.py:559] Dynamo bytecode transform time: 10.89 s
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:36 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.485 s
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:37 [monitor.py:34] torch.compile takes 10.89 s in total
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:37 [gpu_worker.py:298] Available KV cache memory: 118.92 GiB
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:38 [kv_cache_utils.py:1087] GPU KV cache size: 1,113,328 tokens
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:38 [kv_cache_utils.py:1091] Maximum concurrency for 1,000 tokens per request: 1104.49x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 34.31it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 46.87it/s]
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:41 [gpu_model_runner.py:3480] Graph capturing finished in 4 secs, took 0.98 GiB
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:41 [core.py:210] init engine (profile, create kv cache, warmup model) took 18.61 seconds
INFO 10-28 16:14:42 [llm.py:306] Supported_tasks: ['generate']
INFO 10-28 16:14:43 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1573.85it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27.77it/s, est. speed input: 5802.45 toks/s, output: 278.85 toks/s]
RequestOutput(request_id=0, prompt=None, prompt_token_ids=[151644, 872, 198, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151645, 198, 151644, 77091, 198], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text="<think>\n<think>\nOkay, let's see.", token_ids=[151667, 198, 151667, 198, 32313, 11, 1077, 594, 1490, 13], cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=None, lora_request=None, num_cached_tokens=0, multi_modal_placeholders={})
ERROR 10-28 16:14:44 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
case 2: max_model_len is not large enough
This raises an error, which suggests that the prompt is not truncated.
import vllm
model = vllm.LLM(model="Qwen/Qwen3-0.6B", max_model_len=100)
ret = model.chat([{"role": "user", "content": "<think>" * 200}], vllm.SamplingParams(
truncate_prompt_tokens=10,
max_tokens=10,
))
print(ret[0])INFO 10-28 16:20:10 [__init__.py:216] Automatically detected platform cuda.
INFO 10-28 16:20:19 [utils.py:233] non-default args: {'max_model_len': 100, 'disable_log_stats': True}
INFO 10-28 16:20:19 [model.py:547] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-28 16:20:20 [model.py:1510] Using max model len 100
INFO 10-28 16:20:21 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:22 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:22 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=100, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:31 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2341366) WARNING 10-28 16:20:32 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [gpu_model_runner.py:2602] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [cuda.py:366] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [weight_utils.py:392] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:33 [weight_utils.py:450] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.08it/s]
(EngineCore_DP0 pid=2341366)
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:33 [default_loader.py:267] Loading weights took 0.50 seconds
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:33 [gpu_model_runner.py:2653] Model loading took 1.1201 GiB and 1.289675 seconds
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:43 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9501096389/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:43 [backends.py:559] Dynamo bytecode transform time: 9.14 s
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:46 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:00 [backends.py:218] Compiling a graph for dynamic shape takes 16.60 s
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:06 [monitor.py:34] torch.compile takes 25.74 s in total
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:07 [gpu_worker.py:298] Available KV cache memory: 118.91 GiB
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:07 [kv_cache_utils.py:1087] GPU KV cache size: 1,113,296 tokens
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:07 [kv_cache_utils.py:1091] Maximum concurrency for 100 tokens per request: 9940.14x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:02<00:00, 32.13it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:04<00:00, 16.42it/s]
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:14 [gpu_model_runner.py:3480] Graph capturing finished in 7 secs, took 0.98 GiB
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:14 [core.py:210] init engine (profile, create kv cache, warmup model) took 40.31 seconds
INFO 10-28 16:21:15 [llm.py:306] Supported_tasks: ['generate']
INFO 10-28 16:21:16 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/debug_vllm_truncate_prompt_tokens.py", line 6, in <module>
ret = model.chat([{"role": "user", "content": "<think>" * 200}], vllm.SamplingParams(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 893, in chat
return self.generate(
^^^^^^^^^^^^^^
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 393, in generate
self._validate_and_add_requests(
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1516, in _validate_and_add_requests
self._add_request(
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1569, in _add_request
self.llm_engine.add_request(
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 230, in add_request
prompt_str, request = self.processor.process_inputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 392, in process_inputs
self._validate_model_inputs(encoder_inputs, decoder_inputs)
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 466, in _validate_model_inputs
self._validate_model_input(decoder_inputs, prompt_type="decoder")
File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 535, in _validate_model_input
raise ValueError(
ValueError: The decoder prompt (length 208) is longer than the maximum model length of 100. Make sure that `max_model_len` is no smaller than the number of text tokens.
ERROR 10-28 16:21:16 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.