Packages HuggingFace model weights into immutable Nix store paths for serving by vLLM, Triton Inference Server (vLLM backend), and SGLang.
LLM inference runtimes typically download model weights at startup — from HuggingFace Hub, S3, or a shared filesystem. This is slow, non-reproducible, and hard to version. By packaging weights into the Nix store, every deployment gets an identical, content-addressed directory of model files. The same Nix hash always contains exactly the same weights. Runtimes mount the store path directly — no downloads, no caching surprises, no "works on my machine."
Before diving into the layouts, here are the HuggingFace and Triton terms used throughout:
- HuggingFace Hub — the public registry where model authors publish weights. Each model lives at a path like
microsoft/Phi-4-mini-instruct. - Slug — a flattened version of the HuggingFace model path with
/replaced by--. For example,microsoft/Phi-4-mini-instructbecomesmicrosoft--Phi-4-mini-instruct. HuggingFace's download tools use this format for local cache directories. - Snapshot ID — a commit hash from HuggingFace Hub that pins a specific version of the model weights. Think of it like a git commit SHA for model files.
- HF cache layout — the directory structure that HuggingFace's
huggingface_hubPython library creates when it downloads a model. It stores files underhub/models--<slug>/snapshots/<hash>/. Tools like vLLM and SGLang use this library internally, so they know how to find models in this layout. - Triton model repository — Triton Inference Server discovers models by scanning a directory for subdirectories that contain a
config.pbtxtfile. Each subdirectory is one servable model. config.pbtxt— a Protobuf-text configuration file that tells Triton how to serve a model: which backend to use (vLLM in our case), input/output tensor shapes, and streaming configuration.model-defaults.json— a custom file this repo creates alongsideconfig.pbtxt. It holds vLLM engine parameters (GPU memory fraction, max sequence length, data type, etc.) that the runtime reads when launching the model.$out— in Nix build expressions,$outis the output path in the Nix store (e.g./nix/store/abc123-vllm-qwen3.5-2b-1.0.0/). Everything the build produces goes under this path.
The shared builder mkHfModel.nix produces one of two directory layouts depending on which parameters are provided.
The simpler layout. Used when only Triton needs to serve the model. Weight files are copied directly into a weights/ directory:
$out/share/models/<tritonModelName>/
config.pbtxt # tells Triton to use the vLLM backend
model-defaults.json # vLLM engine parameters (GPU memory, max length, dtype, etc.)
weights/ # model files (tensors, tokenizer, config) copied here
Triton scans $out/share/models/, finds the <tritonModelName>/ directory with its config.pbtxt, and serves the model.
Used when both standalone vLLM and Triton need to serve the same packaged weights. The builder creates a full HuggingFace cache tree and points Triton at it via symlink — so there's only one copy of the (often multi-gigabyte) weight files:
$out/share/models/hub/models--<slug>/
refs/main # text file containing the snapshot ID
snapshots/<snapshotId>/ # the actual model files (single copy)
$out/share/models/<tritonModelName>/
config.pbtxt # Triton config (same as single layout)
model-defaults.json # vLLM engine parameters
weights -> ../hub/models--<slug>/snapshots/<snapshotId> # symlink, not a copy
How this works for each runtime:
- vLLM / SGLang: Set the environment variable
HF_HUB_CACHE=$out/share/models/hub. These runtimes use HuggingFace's library internally, which follows therefs/main→snapshots/<hash>chain to find model files — exactly as if the model had been downloaded normally. - Triton: Set
--model-repository=$out/share/models. Triton sees<tritonModelName>/config.pbtxtand follows theweights/symlink to reach the same files. It doesn't know or care about the HF cache structure.
Zero file duplication — both runtimes read from the same snapshot directory.
All standard model packages are built by calling mkHfModel. Here's a real example (Qwen3.5-2B, dual layout):
{ pkgs, mkHfModel ? pkgs.callPackage ./mkHfModel.nix {} }:
mkHfModel {
pname = "vllm-qwen3.5-2b";
baseVersion = "1.0.0";
buildMeta = builtins.fromJSON (builtins.readFile ../../build-meta/vllm-qwen3-5-2b.json);
srcPath = /mnt/scratch/models/inferencing/hub/models--Qwen--Qwen3.5-2B;
tritonModelName = "qwen3_5_2b";
# These two enable dual layout (omit both for single layout):
slug = "Qwen--Qwen3.5-2B";
snapshotId = "15852e8c16360a2fea060d615a32b45270f8a8fc";
vllmDefaults = { gpu_memory_utilization = 0.85; max_model_len = 4096; dtype = "auto"; };
};| Parameter | Required | Description |
|---|---|---|
pname |
yes | Nix package name (e.g. "vllm-qwen3.5-2b") |
baseVersion |
yes | Semantic version (e.g. "1.0.0") |
buildMeta |
yes | Parsed JSON from build-meta/<name>.json — provides build_version and git_rev_short |
srcPath |
yes | Absolute path to model weights on the local build machine |
tritonModelName |
yes | Directory name Triton will see under its model repository (e.g. "qwen3_5_2b") |
vllmDefaults |
no | vLLM engine parameters written to model-defaults.json (memory budget, dtype, quantization, etc.) |
slug |
no | HuggingFace slug with -- separator (e.g. "Qwen--Qwen3.5-2B"). Providing this enables the dual layout |
snapshotId |
no | HuggingFace snapshot commit hash. Required when slug is set |
Layout selection: If both slug and snapshotId are provided → dual layout. Otherwise → single layout.
| Package | Model | Layout | Quantization | GPU req | Notes |
|---|---|---|---|---|---|
phi-4-mini-instruct-fp8-hf |
Phi-4-mini-instruct | single | FP8 (PyTorch native) | SM89+ | Triton-only |
phi-4-mini-instruct-fp8-torchao |
Phi-4-mini-instruct | single | FP8 (TorchAO) | SM89+ | Triton-only |
phi-4-mini-instruct-fp8-sglang |
Phi-4-mini-instruct | HF-only | FP8 (TorchAO) | SM89+ | Custom build (see below) |
vllm-phi-3-5-mini-instruct-awq |
Phi-3.5-mini-instruct | dual | AWQ 4-bit | SM75+ | |
vllm-qwen3-5-2b |
Qwen3.5-2B | dual | none (full precision) | SM75+ | |
vllm-qwen3-5-4b-fp8 |
Qwen3.5-4B | single | FP8 (TorchAO) | SM89+ | |
vllm-smollm3-3b-fp8 |
SmolLM3-3B | single | FP8 (TorchAO) | SM89+ |
Quantization types in this repo:
- FP8 (W8A8) — 8-bit float weights and activations. Two toolchains: PyTorch native and TorchAO.
- AWQ (INT4) — 4-bit integer weights, FP16 activations. Preserves accuracy via salient-weight protection.
- none (BF16/FP16) — unquantized.
dtype = "auto"resolves to BF16 on SM80+, FP16 on older GPUs.
| SM | Generation | Representative GPUs |
|---|---|---|
| SM75 | Turing (2018) | T4, RTX 2070/2080 |
| SM80 | Ampere (2020) | A100 |
| SM86 | Ampere (2020) | A10, RTX 3090 |
| SM89 | Ada Lovelace (2022) | L4, L40, L40S, RTX 4090 |
| SM90 | Hopper (2022) | H100, H200 |
| SM100 | Blackwell (2024) | B200 |
| SM120 | Blackwell (2025) | RTX 5090, RTX 5080 |
This covers the three methods used in this repo plus common alternatives you may encounter:
| Method | SM75 | SM80/86 | SM89 | SM90 | SM100/120 |
|---|---|---|---|---|---|
| FP8 W8A8 | W8A16 only (Marlin) | W8A16 only (Marlin) | native | native | native |
| AWQ (INT4) | supported | Marlin-optimized | Marlin-optimized | Marlin-optimized | Marlin-optimized |
| GPTQ (INT4) | supported | Marlin-optimized | Marlin-optimized | Marlin-optimized | Marlin-optimized |
| INT8 (W8A8) | supported | supported | supported | supported | supported |
| FP16 / BF16 | FP16 only | native (both) | native (both) | native (both) | native (both) |
| FP4 (NVFP4) | — | — | — | — | native |
Legend:
- native — hardware tensor-core support, best performance
- Marlin-optimized — uses Marlin kernels for high-throughput dequantization
- W8A16 only (Marlin) — weights stored in FP8, dequantized to FP16 for compute via Marlin; memory savings but no FP8 compute speedup
- supported — runs correctly, standard CUDA kernels (native AWQ GEMM on SM75, ExLlamaV2 for GPTQ on SM75)
- — — not available on this architecture
- SM89+ (L4, L40, RTX 4090, H100): FP8 is the best default — native tensor-core support, half the memory of FP16.
- SM80/86 (A100, A10, RTX 3090): AWQ or GPTQ for best memory efficiency. FP8 loads but runs as W8A16 via Marlin.
- SM75 (T4, RTX 2070/2080): AWQ or GPTQ recommended. FP8 loads as W8A16 (memory savings, no FP8 compute speedup). Small unquantized models fit in FP16.
See the vLLM quantization docs for the full story.
This package is a custom derivation that does not use mkHfModel. It produces only the HF cache layout (no Triton config.pbtxt) and patches tokenizer_config.json — replacing "TokenizersBackend" with "PreTrainedTokenizerFast" — for compatibility with SGLang 0.5.x (which ships transformers 4.57.x, before TokenizersBackend was added in 4.58). This package can be removed once SGLang ships transformers >= 4.58.
Each package has a corresponding JSON file in build-meta/ that tracks its version independently of git:
{
"build_version": 2,
"force_increment": 0,
"git_rev": "ca88e1655077dab1a0e8bc3583d98bb2575be501",
"git_rev_short": "ca88e16",
"changelog": "Dual layout: Triton + vanilla vLLM via mkHfModel with slug/snapshotId."
}build_versionincrements independently of git commits. Bump it when re-packaging weights without changing.nixfiles (e.g., updated quantization of the same model).- Final version string:
<baseVersion>+<git_rev_short>(e.g.1.0.0+ca88e16). ThebaseVersioncomes from the.nixfile; the git rev is recorded at build time. - A marker file is written to
$out/share/<pname>/flox-build-version-<build_version>so you can identify which build produced a given store path.
Each runtime needs just one environment variable or flag to find the packaged model:
| Runtime | Configuration | What it does |
|---|---|---|
| vLLM | HF_HUB_CACHE=$out/share/models/hub |
vLLM's HuggingFace integration follows refs/main → snapshots/<hash> to locate model files |
| Triton | --model-repository=$out/share/models |
Triton scans for <tritonModelName>/config.pbtxt and loads weights from the weights/ subdirectory |
| SGLang | HF_HUB_CACHE=$out/share/models/hub |
Same mechanism as vLLM — both use HuggingFace's cache resolution internally |
Dual-layout packages work for all three runtimes simultaneously from a single store path, with no file duplication.
-
Download or quantize weights to
/mnt/scratch/models/inferencing/.- For dual layout: use the HF cache structure (
hub/models--<org>--<model>/withsnapshots/and optionallyblobs/) - For single layout: use a flat directory (
resolved/<org>--<model>/) with all symlinks resolved
- For dual layout: use the HF cache structure (
-
Create
build-meta/<name>.json:{ "build_version": 1, "force_increment": 0, "git_rev": "<current git rev>", "git_rev_short": "<short rev>", "changelog": "Initial package." } -
Create
.flox/pkgs/<name>.nixcallingmkHfModelwith appropriate parameters. Includeslug+snapshotIdfor dual layout, omit both for single layout. -
Build:
flox build -A <name>
-
Verify the output layout:
ls -la result-<name>/share/models/
-
Publish:
flox publish
Weights live on local scratch storage before packaging:
- HF cache layout:
/mnt/scratch/models/inferencing/hub/models--<org>--<model>/— containsblobs/andsnapshots/directories as created byhuggingface-cli download. Used for dual-layout packages. - Pre-resolved:
/mnt/scratch/models/inferencing/resolved/<org>--<model>/— a flat directory where HF blob symlinks have been resolved to regular files. Used for single-layout packages.
# Build a single package
flox build -A vllm-qwen3-5-2b
# Build all packages
flox build
# Inspect the result
ls -la result-vllm-qwen3-5-2b/share/models/
# Publish to FloxHub
flox publish