package-hf-models

Packages HuggingFace model weights into immutable Nix store paths for serving by vLLM, Triton Inference Server (vLLM backend), and SGLang.

Why package weights in Nix?

LLM inference runtimes typically download model weights at startup — from HuggingFace Hub, S3, or a shared filesystem. This is slow, non-reproducible, and hard to version. By packaging weights into the Nix store, every deployment gets an identical, content-addressed directory of model files. The same Nix hash always contains exactly the same weights. Runtimes mount the store path directly — no downloads, no caching surprises, no "works on my machine."

Key concepts

Before diving into the layouts, here are the HuggingFace and Triton terms used throughout:

HuggingFace Hub — the public registry where model authors publish weights. Each model lives at a path like microsoft/Phi-4-mini-instruct.
Slug — a flattened version of the HuggingFace model path with / replaced by --. For example, microsoft/Phi-4-mini-instruct becomes microsoft--Phi-4-mini-instruct. HuggingFace's download tools use this format for local cache directories.
Snapshot ID — a commit hash from HuggingFace Hub that pins a specific version of the model weights. Think of it like a git commit SHA for model files.
HF cache layout — the directory structure that HuggingFace's huggingface_hub Python library creates when it downloads a model. It stores files under hub/models--<slug>/snapshots/<hash>/. Tools like vLLM and SGLang use this library internally, so they know how to find models in this layout.
Triton model repository — Triton Inference Server discovers models by scanning a directory for subdirectories that contain a config.pbtxt file. Each subdirectory is one servable model.
config.pbtxt — a Protobuf-text configuration file that tells Triton how to serve a model: which backend to use (vLLM in our case), input/output tensor shapes, and streaming configuration.
model-defaults.json — a custom file this repo creates alongside config.pbtxt. It holds vLLM engine parameters (GPU memory fraction, max sequence length, data type, etc.) that the runtime reads when launching the model.
$out — in Nix build expressions, $out is the output path in the Nix store (e.g. /nix/store/abc123-vllm-qwen3.5-2b-1.0.0/). Everything the build produces goes under this path.

Output layouts

The shared builder mkHfModel.nix produces one of two directory layouts depending on which parameters are provided.

Single layout (Triton-only)

The simpler layout. Used when only Triton needs to serve the model. Weight files are copied directly into a weights/ directory:

$out/share/models/<tritonModelName>/
  config.pbtxt            # tells Triton to use the vLLM backend
  model-defaults.json     # vLLM engine parameters (GPU memory, max length, dtype, etc.)
  weights/                # model files (tensors, tokenizer, config) copied here

Triton scans $out/share/models/, finds the <tritonModelName>/ directory with its config.pbtxt, and serves the model.

Dual layout (vLLM + Triton)

Used when both standalone vLLM and Triton need to serve the same packaged weights. The builder creates a full HuggingFace cache tree and points Triton at it via symlink — so there's only one copy of the (often multi-gigabyte) weight files:

$out/share/models/hub/models--<slug>/
  refs/main                          # text file containing the snapshot ID
  snapshots/<snapshotId>/            # the actual model files (single copy)

$out/share/models/<tritonModelName>/
  config.pbtxt                       # Triton config (same as single layout)
  model-defaults.json                # vLLM engine parameters
  weights -> ../hub/models--<slug>/snapshots/<snapshotId>   # symlink, not a copy

How this works for each runtime:

vLLM / SGLang: Set the environment variable HF_HUB_CACHE=$out/share/models/hub. These runtimes use HuggingFace's library internally, which follows the refs/main → snapshots/<hash> chain to find model files — exactly as if the model had been downloaded normally.
Triton: Set --model-repository=$out/share/models. Triton sees <tritonModelName>/config.pbtxt and follows the weights/ symlink to reach the same files. It doesn't know or care about the HF cache structure.

Zero file duplication — both runtimes read from the same snapshot directory.

The shared builder — `mkHfModel.nix`

All standard model packages are built by calling mkHfModel. Here's a real example (Qwen3.5-2B, dual layout):

{ pkgs, mkHfModel ? pkgs.callPackage ./mkHfModel.nix {} }:
mkHfModel {
  pname = "vllm-qwen3.5-2b";
  baseVersion = "1.0.0";
  buildMeta = builtins.fromJSON (builtins.readFile ../../build-meta/vllm-qwen3-5-2b.json);
  srcPath = /mnt/scratch/models/inferencing/hub/models--Qwen--Qwen3.5-2B;
  tritonModelName = "qwen3_5_2b";

  # These two enable dual layout (omit both for single layout):
  slug = "Qwen--Qwen3.5-2B";
  snapshotId = "15852e8c16360a2fea060d615a32b45270f8a8fc";

  vllmDefaults = { gpu_memory_utilization = 0.85; max_model_len = 4096; dtype = "auto"; };
};

Parameters

Parameter	Required	Description
`pname`	yes	Nix package name (e.g. `"vllm-qwen3.5-2b"`)
`baseVersion`	yes	Semantic version (e.g. `"1.0.0"`)
`buildMeta`	yes	Parsed JSON from `build-meta/<name>.json` — provides `build_version` and `git_rev_short`
`srcPath`	yes	Absolute path to model weights on the local build machine
`tritonModelName`	yes	Directory name Triton will see under its model repository (e.g. `"qwen3_5_2b"`)
`vllmDefaults`	no	vLLM engine parameters written to `model-defaults.json` (memory budget, dtype, quantization, etc.)
`slug`	no	HuggingFace slug with `--` separator (e.g. `"Qwen--Qwen3.5-2B"`). Providing this enables the dual layout
`snapshotId`	no	HuggingFace snapshot commit hash. Required when `slug` is set

Layout selection: If both slug and snapshotId are provided → dual layout. Otherwise → single layout.

Current packages

Package	Model	Layout	Quantization	GPU req	Notes
`phi-4-mini-instruct-fp8-hf`	Phi-4-mini-instruct	single	FP8 (PyTorch native)	SM89+	Triton-only
`phi-4-mini-instruct-fp8-torchao`	Phi-4-mini-instruct	single	FP8 (TorchAO)	SM89+	Triton-only
`phi-4-mini-instruct-fp8-sglang`	Phi-4-mini-instruct	HF-only	FP8 (TorchAO)	SM89+	Custom build (see below)
`vllm-phi-3-5-mini-instruct-awq`	Phi-3.5-mini-instruct	dual	AWQ 4-bit	SM75+
`vllm-qwen3-5-2b`	Qwen3.5-2B	dual	none (full precision)	SM75+
`vllm-qwen3-5-4b-fp8`	Qwen3.5-4B	single	FP8 (TorchAO)	SM89+
`vllm-smollm3-3b-fp8`	SmolLM3-3B	single	FP8 (TorchAO)	SM89+

Quantization types in this repo:

FP8 (W8A8) — 8-bit float weights and activations. Two toolchains: PyTorch native and TorchAO.
AWQ (INT4) — 4-bit integer weights, FP16 activations. Preserves accuracy via salient-weight protection.
none (BF16/FP16) — unquantized. dtype = "auto" resolves to BF16 on SM80+, FP16 on older GPUs.

GPU compatibility

SM architecture reference

SM	Generation	Representative GPUs
SM75	Turing (2018)	T4, RTX 2070/2080
SM80	Ampere (2020)	A100
SM86	Ampere (2020)	A10, RTX 3090
SM89	Ada Lovelace (2022)	L4, L40, L40S, RTX 4090
SM90	Hopper (2022)	H100, H200
SM100	Blackwell (2024)	B200
SM120	Blackwell (2025)	RTX 5090, RTX 5080

Quantization × SM compatibility matrix

This covers the three methods used in this repo plus common alternatives you may encounter:

Method	SM75	SM80/86	SM89	SM90	SM100/120
FP8 W8A8	W8A16 only (Marlin)	W8A16 only (Marlin)	native	native	native
AWQ (INT4)	supported	Marlin-optimized	Marlin-optimized	Marlin-optimized	Marlin-optimized
GPTQ (INT4)	supported	Marlin-optimized	Marlin-optimized	Marlin-optimized	Marlin-optimized
INT8 (W8A8)	supported	supported	supported	supported	supported
FP16 / BF16	FP16 only	native (both)	native (both)	native (both)	native (both)
FP4 (NVFP4)	—	—	—	—	native

Legend:

native — hardware tensor-core support, best performance
Marlin-optimized — uses Marlin kernels for high-throughput dequantization
W8A16 only (Marlin) — weights stored in FP8, dequantized to FP16 for compute via Marlin; memory savings but no FP8 compute speedup
supported — runs correctly, standard CUDA kernels (native AWQ GEMM on SM75, ExLlamaV2 for GPTQ on SM75)
— — not available on this architecture

Choosing a quantization for your GPU

SM89+ (L4, L40, RTX 4090, H100): FP8 is the best default — native tensor-core support, half the memory of FP16.
SM80/86 (A100, A10, RTX 3090): AWQ or GPTQ for best memory efficiency. FP8 loads but runs as W8A16 via Marlin.
SM75 (T4, RTX 2070/2080): AWQ or GPTQ recommended. FP8 loads as W8A16 (memory savings, no FP8 compute speedup). Small unquantized models fit in FP16.

See the vLLM quantization docs for the full story.

Special case: `phi-4-mini-instruct-fp8-sglang`

This package is a custom derivation that does not use mkHfModel. It produces only the HF cache layout (no Triton config.pbtxt) and patches tokenizer_config.json — replacing "TokenizersBackend" with "PreTrainedTokenizerFast" — for compatibility with SGLang 0.5.x (which ships transformers 4.57.x, before TokenizersBackend was added in 4.58). This package can be removed once SGLang ships transformers >= 4.58.

Build metadata versioning

Each package has a corresponding JSON file in build-meta/ that tracks its version independently of git:

{
  "build_version": 2,
  "force_increment": 0,
  "git_rev": "ca88e1655077dab1a0e8bc3583d98bb2575be501",
  "git_rev_short": "ca88e16",
  "changelog": "Dual layout: Triton + vanilla vLLM via mkHfModel with slug/snapshotId."
}

build_version increments independently of git commits. Bump it when re-packaging weights without changing .nix files (e.g., updated quantization of the same model).
Final version string: <baseVersion>+<git_rev_short> (e.g. 1.0.0+ca88e16). The baseVersion comes from the .nix file; the git rev is recorded at build time.
A marker file is written to $out/share/<pname>/flox-build-version-<build_version> so you can identify which build produced a given store path.

How runtimes consume packages

Each runtime needs just one environment variable or flag to find the packaged model:

Runtime	Configuration	What it does
vLLM	`HF_HUB_CACHE=$out/share/models/hub`	vLLM's HuggingFace integration follows `refs/main` → `snapshots/<hash>` to locate model files
Triton	`--model-repository=$out/share/models`	Triton scans for `<tritonModelName>/config.pbtxt` and loads weights from the `weights/` subdirectory
SGLang	`HF_HUB_CACHE=$out/share/models/hub`	Same mechanism as vLLM — both use HuggingFace's cache resolution internally

Dual-layout packages work for all three runtimes simultaneously from a single store path, with no file duplication.

How to add a new model

Download or quantize weights to /mnt/scratch/models/inferencing/.
- For dual layout: use the HF cache structure (hub/models--<org>--<model>/ with snapshots/ and optionally blobs/)
- For single layout: use a flat directory (resolved/<org>--<model>/) with all symlinks resolved

Create build-meta/<name>.json:

{
  "build_version": 1,
  "force_increment": 0,
  "git_rev": "<current git rev>",
  "git_rev_short": "<short rev>",
  "changelog": "Initial package."
}

Create .flox/pkgs/<name>.nix calling mkHfModel with appropriate parameters. Include slug + snapshotId for dual layout, omit both for single layout.
Build:
```
flox build -A <name>
```
Verify the output layout:
```
ls -la result-<name>/share/models/
```
Publish:
```
flox publish
```

Source model locations

Weights live on local scratch storage before packaging:

HF cache layout: /mnt/scratch/models/inferencing/hub/models--<org>--<model>/ — contains blobs/ and snapshots/ directories as created by huggingface-cli download. Used for dual-layout packages.
Pre-resolved: /mnt/scratch/models/inferencing/resolved/<org>--<model>/ — a flat directory where HF blob symlinks have been resolved to regular files. Used for single-layout packages.

Building and publishing

# Build a single package
flox build -A vllm-qwen3-5-2b

# Build all packages
flox build

# Inspect the result
ls -la result-vllm-qwen3-5-2b/share/models/

# Publish to FloxHub
flox publish

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.flox		.flox
build-meta		build-meta
.gitignore		.gitignore
README.md		README.md
result-vllm-phi-4-mini-instruct-fp8		result-vllm-phi-4-mini-instruct-fp8
result-vllm-phi-4-mini-instruct-fp8-log		result-vllm-phi-4-mini-instruct-fp8-log
result-vllm-qwen3-5-4b-fp8		result-vllm-qwen3-5-4b-fp8
result-vllm-qwen3-5-4b-fp8-log		result-vllm-qwen3-5-4b-fp8-log
result-vllm-smollm3-3b-fp8		result-vllm-smollm3-3b-fp8
result-vllm-smollm3-3b-fp8-log		result-vllm-smollm3-3b-fp8-log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

package-hf-models

Why package weights in Nix?

Key concepts

Output layouts

Single layout (Triton-only)

Dual layout (vLLM + Triton)

The shared builder — `mkHfModel.nix`

Parameters

Current packages

GPU compatibility

SM architecture reference

Quantization × SM compatibility matrix

Choosing a quantization for your GPU

Special case: `phi-4-mini-instruct-fp8-sglang`

Build metadata versioning

How runtimes consume packages

How to add a new model

Source model locations

Building and publishing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

package-hf-models

Why package weights in Nix?

Key concepts

Output layouts

Single layout (Triton-only)

Dual layout (vLLM + Triton)

The shared builder — mkHfModel.nix

Parameters

Current packages

GPU compatibility

SM architecture reference

Quantization × SM compatibility matrix

Choosing a quantization for your GPU

Special case: phi-4-mini-instruct-fp8-sglang

Build metadata versioning

How runtimes consume packages

How to add a new model

Source model locations

Building and publishing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The shared builder — `mkHfModel.nix`

Special case: `phi-4-mini-instruct-fp8-sglang`

Packages