-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Description
Your current environment
My env is fine, so I did not put anything here.
How would you like to use vllm
I want to run an inference of a meta-llama/Meta-Llama-3.1-8B-Instruct. I want to use LLama 3.1 for inference, and its context length is 128k. I use the following code for chat:
sampling_params = SamplingParams(temperature=0.8,
top_p=0.95,
max_tokens=512)
# Create an LLM.
llm = LLM(model=''meta-llama/Meta-Llama-3.1-8B-Instruct'',
quantization="fp8",
task='generate',
tensor_parallel_size=1,
enforce_eager=True,
enable_expert_parallel=False)
outputs = llm.chat(rank_prompts, sampling_params, use_tqdm=True)
Sometimes my prompt is too long, causing an error. I want to ask if it's possible to set it to keep only the first k tokens of the prompt.
I found that the current vLLM settings for limiting the maximum input tokens seem to keep only the last k tokens, not the first k tokens. I noticed the following settings:
1. The `--max-model-len` parameter in the vLLM engine:
Model context length. If unspecified, will be automatically derived from the model config. Supports k/m/g/K/M/G in human-readable format. Examples:
1k → 1000
1K → 1024
I'm not entirely sure how this method performs truncation.
2. The `truncate_prompt_tokens` parameter in Sampling Parameters:
If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None (i.e., no truncation).
Neither of these options seems to keep the first k tokens. I understand that left truncation (keeping the last k tokens) is a more common setting in LLMs. However, since my task description is at the beginning of the prompt, I would like to know if there is a parameter setting for the right truncation to keep the first k tokens. Thank you very much!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.