There have been a steady stream of asks for improved group functionality. In August I wrote about my position that the groups design already felt like close to the right balance of complexity and what they were meant to do.
Since then there have been requests for more complex swapping logic to fit all sorts of combinations of models people are looking to run over various numbers of GPUs. This revealed the flaws and limitations of the existing groups design. llama-swap has also grown to support generative image and audio endpoints. It has been on my mind for a while now on how to enable more complex swap strategies while also keeping things easy to understand. Additionally, coding agents are good enough now that I can be more ambitious on the features I'm comfortable maintaining and supporting for this this project.
For the new design llama-swap will introduce a basic DSL for expressing sets of models that can run together. When a request comes in the solver will decide what needs to be unloaded to make space for the new model. llama-swap will still expect the admin to provide configurations that work.
(edit) Final configuration design:
# =============================================================================
# matrix: run concurrent models with a solver-based swap DSL
# =============================================================================
#
# Note:
# A config must use either groups or matrix not both. A configuration error
# will happen if both are defined.
#
# The matrix declares valid combinations of models that can run concurrently.
# When a model is requested, the solver finds the cheapest way to make it
# available by evicting as few (and least costly) running models as possible.
#
# Solver behavior:
# 1. Request arrives for model X
# 2. If X is already running, forward immediately. Done.
# 3. Find all sets containing X
# 4. For each candidate set, compute cost: sum of evict_costs for
# every running model NOT in that set
# 5. Pick lowest cost candidate. Ties broken by definition order.
# 6. Evict what needs to stop. Start X. Forward request.
#
# Subset semantics: a set [a, b, c] means any subset is valid.
# Only the requested model is started — others are not preloaded.
#
# A model not appearing in any set can only run alone.
#
matrix:
# vars: short names for models (alphanumeric, 1-8 chars)
# - required for sets and evict_costs settings
# - each entry is a short name to a real model ID. Do not use an alias
# - used to keep set DSL logic short and easier to read
# - sets and evict_costs only use identifiers defined in vars
vars:
g: gemma-model
q: qwen-model
m: mistral-model
v: voxtral-model
e: reranker-model
L: llama-70B
sd: stable-diffusion
# evict_costs: relative cost of losing a running model (default: 1)
evict_costs:
v: 50 # vllm backend, slow cold start
L: 30 # 70B weights, slow to load
# sets: named sets of concurrent model combinations
# Values are DSL strings with operators:
# & AND (models run together)
# | OR (alternatives)
# () grouping
# +ref inline another set's expression
#
# Expansion examples:
# "L" → [L]
# "a & b" → [a, b]
# "a | b" → [a], [b]
# "(a | b) & c" → [a, c], [b, c]
# "(a | b) & (c | d)" → [a,c], [a,d], [b,c], [b,d]
# "+llms & v" → expands llms inline, then applies & v
sets:
# LLM + TTS: switching between g/q/m won't evict v
# expands to: [g,v], [q,v], [m,v]
standard: "(g | q | m) & v"
# LLM + TTS + reranker
# expands to: [g,v,e], [q,v,e]
with_rerank: "(g | q) & v & e"
# LLM + image generation, no TTS
# expands to: [g,sd], [q,sd]
creative: "(g | q) & sd"
# 70B model uses all GPUs, can only run alone
# expands to: [L]
full: "L"
Refs:
There have been a steady stream of asks for improved group functionality. In August I wrote about my position that the groups design already felt like close to the right balance of complexity and what they were meant to do.
Since then there have been requests for more complex swapping logic to fit all sorts of combinations of models people are looking to run over various numbers of GPUs. This revealed the flaws and limitations of the existing groups design. llama-swap has also grown to support generative image and audio endpoints. It has been on my mind for a while now on how to enable more complex swap strategies while also keeping things easy to understand. Additionally, coding agents are good enough now that I can be more ambitious on the features I'm comfortable maintaining and supporting for this this project.
For the new design llama-swap will introduce a basic DSL for expressing sets of models that can run together. When a request comes in the solver will decide what needs to be unloaded to make space for the new model. llama-swap will still expect the admin to provide configurations that work.
(edit) Final configuration design:
Refs: