Feature Request: Allow per-device memory margin for `--fit-target`

### Is your feature request related to a problem? Please describe.

Currently, the `--fit-target` (or `-fitt`) parameter accepts a single integer value (in MiB), which sets a uniform memory margin target across all available devices. 

```
-fitt, --fit-target MiB                 target margin per device for --fit option, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)
```

In heterogeneous multi-GPU environments, this "one size fits all" approach is suboptimal. For example:
- **Primary Display GPU:** Often needs a larger overhead (e.g., 2048 MiB) to handle desktop compositing, browser tabs, and other display tasks without crashing or stuttering.
- **Secondary/Dedicated Compute GPUs:** Can safely run with a much smaller margin (e.g., 256 MiB or even less) to maximize VRAM utilization for larger models or context.

Using a single large value (e.g., 2048) wastes valuable VRAM on the dedicated compute cards. Using a single small value (e.g., 256) risks OOM or instability on the primary display adapter.

### Describe the solution you'd like

I would like `--fit-target` to support a comma-separated list of values, similar to how `--tensor-split` or `--device` works.

**Proposed Syntax:**
`--fit-target 2048,256,256`

- **Device 0:** Uses 2048 MiB margin.
- **Device 1:** Uses 256 MiB margin.
- **Device 2:** Uses 256 MiB margin.

If a single value is provided (e.g., `--fit-target 1024`), it should retain the current behavior of applying that margin to all devices.

### Describe alternatives you've considered

- **Manual Tuning:** Disabling `--fit` and manually tuning context size or layer offloading. This is tedious and fragile, breaking whenever other VRAM usage changes or models are swapped.
- **Lowest Common Denominator:** Setting `--fit-target` to the max required by any single card (e.g., the display GPU). This results in wasted VRAM on all other cards, potentially preventing a model from fitting that otherwise would.

### Additional context

This would greatly improve quality of life for "homelab" or workstation setups where mixed-use GPUs are common (e.g., an RTX 3090 for display/inference + an RTX 3060 for dedicated inference).

**Specific Use Case:**
On my secondary GPU, I run TTS (Text-to-Speech) and STT (Speech-to-Text) workloads for Home Assistant. These consume a specific, relatively static amount of VRAM. I want to reserve a tight margin for these services while letting `llama.cpp` utilize the remaining VRAM on that card, without imposing the same strict margin on my primary GPU or vice-versa.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Allow per-device memory margin for `--fit-target` #18390

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Allow per-device memory margin for --fit-target #18390

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: Allow per-device memory margin for `--fit-target` #18390