[Bug]: SamplingParams.truncate_prompt_tokens has no effect in LLM.chat

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.10.238-234.956.amzn2.x86_64-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration :
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 550.163.01
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize flush_l1d arch_capabilities
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cu126
[pip3] torchaudio==2.8.0+cu126
[pip3] torchvision==0.23.0+cu126
[pip3] transformers==4.57.1
[pip3] triton==3.4.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      48-95,144-191   1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

```

</details>


### 🐛 Describe the bug

Setting `SamplingParams.truncate_prompt_tokens` has no effect in `LLM.chat`. According to the output, the prompt is not truncated to a specified length. Particularly, passing a longer prompt than `max_model_len` results in an error even if `(truncate_prompt_tokens + max_tokens)` is set to be smaller than `max_model_len`.

While this behavior is mentioned in other issues (https://github.com/vllm-project/vllm/issues/17324, https://github.com/vllm-project/vllm/pull/3144#issuecomment-2044121238), it seems like it is not handled. I guess it is a bug as the docstring of `truncate_prompt_tokens` (https://docs.vllm.ai/en/latest/api/vllm/index.html#vllm.SamplingParams.truncate_prompt_tokens)  does not mention such a limitation and it is natural to expect it to work in LLM.chat as well.

### case 1: `max_model_len` is large enough

`RequestOutput.prompt_token_ids` seems not truncated to 10.

```python
import vllm
model = vllm.LLM(model="Qwen/Qwen3-0.6B", max_model_len=1000)
ret = model.chat([{"role": "user", "content": "<think>" * 200}], vllm.SamplingParams(
    truncate_prompt_tokens=10,
    max_tokens=10,
))
print(ret[0])
```

```
INFO 10-28 16:13:52 [__init__.py:216] Automatically detected platform cuda.
INFO 10-28 16:14:04 [utils.py:233] non-default args: {'max_model_len': 1000, 'disable_log_stats': True}
INFO 10-28 16:14:05 [model.py:547] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-28 16:14:05 [model.py:1510] Using max model len 1000
INFO 10-28 16:14:07 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:08 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:08 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:19 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2333224) WARNING 10-28 16:14:19 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:19 [gpu_model_runner.py:2602] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [cuda.py:366] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [weight_utils.py:392] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:20 [weight_utils.py:450] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.05s/it]
(EngineCore_DP0 pid=2333224)
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:22 [default_loader.py:267] Loading weights took 2.07 seconds
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:23 [gpu_model_runner.py:2653] Model loading took 1.1201 GiB and 2.911120 seconds
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:34 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9fe2c87a72/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:34 [backends.py:559] Dynamo bytecode transform time: 10.89 s
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:36 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.485 s
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:37 [monitor.py:34] torch.compile takes 10.89 s in total
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:37 [gpu_worker.py:298] Available KV cache memory: 118.92 GiB
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:38 [kv_cache_utils.py:1087] GPU KV cache size: 1,113,328 tokens
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:38 [kv_cache_utils.py:1091] Maximum concurrency for 1,000 tokens per request: 1104.49x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 34.31it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 46.87it/s]
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:41 [gpu_model_runner.py:3480] Graph capturing finished in 4 secs, took 0.98 GiB
(EngineCore_DP0 pid=2333224) INFO 10-28 16:14:41 [core.py:210] init engine (profile, create kv cache, warmup model) took 18.61 seconds
INFO 10-28 16:14:42 [llm.py:306] Supported_tasks: ['generate']
INFO 10-28 16:14:43 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1573.85it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27.77it/s, est. speed input: 5802.45 toks/s, output: 278.85 toks/s]
RequestOutput(request_id=0, prompt=None, prompt_token_ids=[151644, 872, 198, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151667, 151645, 198, 151644, 77091, 198], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text="<think>\n<think>\nOkay, let's see.", token_ids=[151667, 198, 151667, 198, 32313, 11, 1077, 594, 1490, 13], cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=None, lora_request=None, num_cached_tokens=0, multi_modal_placeholders={})
ERROR 10-28 16:14:44 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
```

### case 2: `max_model_len` is not large enough

This raises an error, which suggests that the prompt is not truncated.

```python
import vllm
model = vllm.LLM(model="Qwen/Qwen3-0.6B", max_model_len=100)
ret = model.chat([{"role": "user", "content": "<think>" * 200}], vllm.SamplingParams(
    truncate_prompt_tokens=10,
    max_tokens=10,
))
print(ret[0])
```

```
INFO 10-28 16:20:10 [__init__.py:216] Automatically detected platform cuda.
INFO 10-28 16:20:19 [utils.py:233] non-default args: {'max_model_len': 100, 'disable_log_stats': True}
INFO 10-28 16:20:19 [model.py:547] Resolved architecture: Qwen3ForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 10-28 16:20:20 [model.py:1510] Using max model len 100
INFO 10-28 16:20:21 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:22 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:22 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=100, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:31 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=2341366) WARNING 10-28 16:20:32 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [gpu_model_runner.py:2602] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [cuda.py:366] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:32 [weight_utils.py:392] Using model weights format ['*.safetensors']
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:33 [weight_utils.py:450] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.08it/s]
(EngineCore_DP0 pid=2341366)
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:33 [default_loader.py:267] Loading weights took 0.50 seconds
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:33 [gpu_model_runner.py:2653] Model loading took 1.1201 GiB and 1.289675 seconds
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:43 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9501096389/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:43 [backends.py:559] Dynamo bytecode transform time: 9.14 s
(EngineCore_DP0 pid=2341366) INFO 10-28 16:20:46 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:00 [backends.py:218] Compiling a graph for dynamic shape takes 16.60 s
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:06 [monitor.py:34] torch.compile takes 25.74 s in total
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:07 [gpu_worker.py:298] Available KV cache memory: 118.91 GiB
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:07 [kv_cache_utils.py:1087] GPU KV cache size: 1,113,296 tokens
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:07 [kv_cache_utils.py:1091] Maximum concurrency for 100 tokens per request: 9940.14x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:02<00:00, 32.13it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:04<00:00, 16.42it/s]
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:14 [gpu_model_runner.py:3480] Graph capturing finished in 7 secs, took 0.98 GiB
(EngineCore_DP0 pid=2341366) INFO 10-28 16:21:14 [core.py:210] init engine (profile, create kv cache, warmup model) took 40.31 seconds
INFO 10-28 16:21:15 [llm.py:306] Supported_tasks: ['generate']
INFO 10-28 16:21:16 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
Adding requests:   0%|                                                                                                                                                                                                                                                                                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/debug_vllm_truncate_prompt_tokens.py", line 6, in <module>
    ret = model.chat([{"role": "user", "content": "<think>" * 200}], vllm.SamplingParams(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 893, in chat
    return self.generate(
           ^^^^^^^^^^^^^^
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 393, in generate
    self._validate_and_add_requests(
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1516, in _validate_and_add_requests
    self._add_request(
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1569, in _add_request
    self.llm_engine.add_request(
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 230, in add_request
    prompt_str, request = self.processor.process_inputs(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 392, in process_inputs
    self._validate_model_inputs(encoder_inputs, decoder_inputs)
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 466, in _validate_model_inputs
    self._validate_model_input(decoder_inputs, prompt_type="decoder")
  File "/mnt/shared/fujita/debug_vllm_truncate_prompt_tokens/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 535, in _validate_model_input
    raise ValueError(
ValueError: The decoder prompt (length 208) is longer than the maximum model length of 100. Make sure that `max_model_len` is no smaller than the number of text tokens.
ERROR 10-28 16:21:16 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: SamplingParams.truncate_prompt_tokens has no effect in LLM.chat #27642

Your current environment

🐛 Describe the bug

case 1: `max_model_len` is large enough

case 2: `max_model_len` is not large enough

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: SamplingParams.truncate_prompt_tokens has no effect in LLM.chat #27642

Description

Your current environment

🐛 Describe the bug

case 1: max_model_len is large enough

case 2: max_model_len is not large enough

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

case 1: `max_model_len` is large enough

case 2: `max_model_len` is not large enough