Skip to content

[Usage]: How to perform right truncation on the input prompt if it is too long? #17324

@liyongkang123

Description

@liyongkang123

Your current environment

My env is fine, so I did not put anything here.

How would you like to use vllm

I want to run an inference of a meta-llama/Meta-Llama-3.1-8B-Instruct. I want to use LLama 3.1 for inference, and its context length is 128k. I use the following code for chat:

    sampling_params = SamplingParams(temperature=0.8,
                                     top_p=0.95,
                                     max_tokens=512)

    # Create an LLM.
    llm = LLM(model=''meta-llama/Meta-Llama-3.1-8B-Instruct'',
              quantization="fp8",
              task='generate',
              tensor_parallel_size=1,
              enforce_eager=True,
              enable_expert_parallel=False)
    outputs = llm.chat(rank_prompts, sampling_params, use_tqdm=True)

Sometimes my prompt is too long, causing an error. I want to ask if it's possible to set it to keep only the first k tokens of the prompt.

I found that the current vLLM settings for limiting the maximum input tokens seem to keep only the last k tokens, not the first k tokens. I noticed the following settings:

1. The `--max-model-len` parameter in the vLLM engine:
   Model context length. If unspecified, will be automatically derived from the model config. Supports k/m/g/K/M/G in human-readable format. Examples:
   1k → 1000
   1K → 1024

I'm not entirely sure how this method performs truncation.

2. The `truncate_prompt_tokens` parameter in Sampling Parameters:
   If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None (i.e., no truncation).

Neither of these options seems to keep the first k tokens. I understand that left truncation (keeping the last k tokens) is a more common setting in LLMs. However, since my task description is at the beginning of the prompt, I would like to know if there is a parameter setting for the right truncation to keep the first k tokens. Thank you very much!

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleOver 90 days of inactivityusageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions