-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Update hyperparameter_tuning.md #7454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,15 +11,15 @@ When the server is running at full load in a steady state, look for the followin | |
|
|
||
| `#queue-req` indicates the number of requests in the queue. | ||
| If you frequently see `#queue-req: 0`, it suggests that your client code is submitting requests too slowly. | ||
| A healthy range for `#queue-req` is `100 - 1000`. | ||
| A healthy range for `#queue-req` is `100 - 2000`. | ||
| However, avoid making `#queue-req` too large, as this will increase the scheduling overhead on the server. | ||
|
|
||
| ### Tune `--schedule-conservativeness` to achieve a high `token usage`. | ||
| ### Achieve a high `token usage` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| `token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization. | ||
|
|
||
| If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3. | ||
| The case of server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings. | ||
| The case of a server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings. | ||
|
|
||
| On the other hand, if you see `token usage` very high and you frequently see warnings like | ||
| `KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. | ||
|
|
@@ -36,7 +36,7 @@ for activations and CUDA graph buffers. | |
|
|
||
| A simple strategy is to increase `--mem-fraction-static` by 0.01 each time until you encounter out-of-memory errors. | ||
|
|
||
| ## Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests` | ||
| ### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests` | ||
|
|
||
| If you encounter out-of-memory (OOM) errors, you can adjust the following parameters: | ||
|
|
||
|
|
@@ -57,5 +57,6 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa | |
| ### Try other options | ||
|
|
||
| - `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`. | ||
| - Try other quantization (e.g. FP8 quantizatioin) or other parallelism strategies (e.g. expert parallelism) | ||
| - Try other quantization (e.g. FP8 quantization with `--quantization fp8`) | ||
| - Try other parallelism strategies (e.g. expert parallelism) or DP attention for deepseek models (with `--enable-dp-attention --dp-size 8`). | ||
| - If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider providing context or a rationale for the increased range for
#queue-req.