Is your feature request related to a problem? Please describe.
Currently, the --fit-target (or -fitt) parameter accepts a single integer value (in MiB), which sets a uniform memory margin target across all available devices.
-fitt, --fit-target MiB target margin per device for --fit option, default: 1024
(env: LLAMA_ARG_FIT_TARGET)
In heterogeneous multi-GPU environments, this "one size fits all" approach is suboptimal. For example:
- Primary Display GPU: Often needs a larger overhead (e.g., 2048 MiB) to handle desktop compositing, browser tabs, and other display tasks without crashing or stuttering.
- Secondary/Dedicated Compute GPUs: Can safely run with a much smaller margin (e.g., 256 MiB or even less) to maximize VRAM utilization for larger models or context.
Using a single large value (e.g., 2048) wastes valuable VRAM on the dedicated compute cards. Using a single small value (e.g., 256) risks OOM or instability on the primary display adapter.
Describe the solution you'd like
I would like --fit-target to support a comma-separated list of values, similar to how --tensor-split or --device works.
Proposed Syntax:
--fit-target 2048,256,256
- Device 0: Uses 2048 MiB margin.
- Device 1: Uses 256 MiB margin.
- Device 2: Uses 256 MiB margin.
If a single value is provided (e.g., --fit-target 1024), it should retain the current behavior of applying that margin to all devices.
Describe alternatives you've considered
- Manual Tuning: Disabling
--fit and manually tuning context size or layer offloading. This is tedious and fragile, breaking whenever other VRAM usage changes or models are swapped.
- Lowest Common Denominator: Setting
--fit-target to the max required by any single card (e.g., the display GPU). This results in wasted VRAM on all other cards, potentially preventing a model from fitting that otherwise would.
Additional context
This would greatly improve quality of life for "homelab" or workstation setups where mixed-use GPUs are common (e.g., an RTX 3090 for display/inference + an RTX 3060 for dedicated inference).
Specific Use Case:
On my secondary GPU, I run TTS (Text-to-Speech) and STT (Speech-to-Text) workloads for Home Assistant. These consume a specific, relatively static amount of VRAM. I want to reserve a tight margin for these services while letting llama.cpp utilize the remaining VRAM on that card, without imposing the same strict margin on my primary GPU or vice-versa.
Is your feature request related to a problem? Please describe.
Currently, the
--fit-target(or-fitt) parameter accepts a single integer value (in MiB), which sets a uniform memory margin target across all available devices.In heterogeneous multi-GPU environments, this "one size fits all" approach is suboptimal. For example:
Using a single large value (e.g., 2048) wastes valuable VRAM on the dedicated compute cards. Using a single small value (e.g., 256) risks OOM or instability on the primary display adapter.
Describe the solution you'd like
I would like
--fit-targetto support a comma-separated list of values, similar to how--tensor-splitor--deviceworks.Proposed Syntax:
--fit-target 2048,256,256If a single value is provided (e.g.,
--fit-target 1024), it should retain the current behavior of applying that margin to all devices.Describe alternatives you've considered
--fitand manually tuning context size or layer offloading. This is tedious and fragile, breaking whenever other VRAM usage changes or models are swapped.--fit-targetto the max required by any single card (e.g., the display GPU). This results in wasted VRAM on all other cards, potentially preventing a model from fitting that otherwise would.Additional context
This would greatly improve quality of life for "homelab" or workstation setups where mixed-use GPUs are common (e.g., an RTX 3090 for display/inference + an RTX 3060 for dedicated inference).
Specific Use Case:
On my secondary GPU, I run TTS (Text-to-Speech) and STT (Speech-to-Text) workloads for Home Assistant. These consume a specific, relatively static amount of VRAM. I want to reserve a tight margin for these services while letting
llama.cpputilize the remaining VRAM on that card, without imposing the same strict margin on my primary GPU or vice-versa.