Skip to content

Feature Request: Allow per-device memory margin for --fit-target #18390

@kirel

Description

@kirel

Is your feature request related to a problem? Please describe.

Currently, the --fit-target (or -fitt) parameter accepts a single integer value (in MiB), which sets a uniform memory margin target across all available devices.

-fitt, --fit-target MiB                 target margin per device for --fit option, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)

In heterogeneous multi-GPU environments, this "one size fits all" approach is suboptimal. For example:

  • Primary Display GPU: Often needs a larger overhead (e.g., 2048 MiB) to handle desktop compositing, browser tabs, and other display tasks without crashing or stuttering.
  • Secondary/Dedicated Compute GPUs: Can safely run with a much smaller margin (e.g., 256 MiB or even less) to maximize VRAM utilization for larger models or context.

Using a single large value (e.g., 2048) wastes valuable VRAM on the dedicated compute cards. Using a single small value (e.g., 256) risks OOM or instability on the primary display adapter.

Describe the solution you'd like

I would like --fit-target to support a comma-separated list of values, similar to how --tensor-split or --device works.

Proposed Syntax:
--fit-target 2048,256,256

  • Device 0: Uses 2048 MiB margin.
  • Device 1: Uses 256 MiB margin.
  • Device 2: Uses 256 MiB margin.

If a single value is provided (e.g., --fit-target 1024), it should retain the current behavior of applying that margin to all devices.

Describe alternatives you've considered

  • Manual Tuning: Disabling --fit and manually tuning context size or layer offloading. This is tedious and fragile, breaking whenever other VRAM usage changes or models are swapped.
  • Lowest Common Denominator: Setting --fit-target to the max required by any single card (e.g., the display GPU). This results in wasted VRAM on all other cards, potentially preventing a model from fitting that otherwise would.

Additional context

This would greatly improve quality of life for "homelab" or workstation setups where mixed-use GPUs are common (e.g., an RTX 3090 for display/inference + an RTX 3060 for dedicated inference).

Specific Use Case:
On my secondary GPU, I run TTS (Text-to-Speech) and STT (Speech-to-Text) workloads for Home Assistant. These consume a specific, relatively static amount of VRAM. I want to reserve a tight margin for these services while letting llama.cpp utilize the remaining VRAM on that card, without imposing the same strict margin on my primary GPU or vice-versa.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions