ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH by DocShotgun · Pull Request #18535 · ggml-org/llama.cpp

DocShotgun · 2026-01-02T06:33:49Z

Makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32 if not specified to keep current behavior intact.

This is helpful when running large MoEs with a significant size of weights stored in host buffers on CPU, causing a bottleneck when op offloading with small batches that are still larger than the default 32. The optimal value, or "break even point" here depends on characteristics of the hardware + model, and is best determined empirically (ref: #17026 (comment)).

Make sure to read the contributing guidelines before submitting a PR

* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

ggerganov

The env var should be read only once upon device initiali§ation and then queried from the device context.

DocShotgun · 2026-01-02T22:34:56Z

Took a crack at this, let me know if you'd recommend doing anything differently.

AI-assisted with searching for the relevant code in the Metal backend and with debugging compile failures.

For CUDA, CANN, SYCL, and Vulkan, added op_offload_min_batch_size to the device context struct. We read the env var once prior to the loop that creates the device context(s), and then assign this value to the context for each device.
For Metal we instead add the field to the device props, which we can then fetch from the offload op check.
dev is no longer flagged as unused in the backend offload op checks. In Metal, op was also previously flagged as unused.
CANN had an issue where ggml_backend_cann_offload_op is declared before ggml_backend_cann_device_context. This didn't cause any problems before when device context was unused. I moved it down to roughly match the other backends.

I tested CUDA locally on a Qwen3-Coder-30B-A3B-Instruct-IQ4_XS.gguf on my 7950X + 4090 Windows machine with -b 4096 -ub 4096 and --cpu-moe and it seems to work as expected:

GGML_OP_OFFLOAD_MIN_BATCH=50000 with 1532 tokens prompt, op offload not triggered -> PP 146.81 T/s
GGML_OP_OFFLOAD_MIN_BATCH=64 with 1532 tokens prompt, op offload triggered -> PP 2320.88 T/s
No env var set with 1532 tokens prompt, defaults to 32, op offload triggered -> PP 2322.76 T/s

NeoZhangJianyu

Looks good to me for SYCL backend part.

0cc4m · 2026-01-05T11:00:22Z

The Vulkan changes are fine.

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

am17an · 2026-01-08T08:44:11Z

@ggerganov merge?

* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH * makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32 * ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx * cann: forward declaration of device context struct * cann: move offload op check after device context declaration * cuda: fix whitespace Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH

3c1bcf2

* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

DocShotgun requested review from 0cc4m and ggerganov as code owners January 2, 2026 06:33

loci-dev mentioned this pull request Jan 2, 2026

UPSTREAM PR #18535: ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH auroralabs-loci/llama.cpp#781

Open

ggerganov reviewed Jan 2, 2026

View reviewed changes

DocShotgun added 3 commits January 2, 2026 13:26

ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

fa46774

cann: forward declaration of device context struct

a449358

cann: move offload op check after device context declaration

7a838e7

taronaeo linked an issue Jan 3, 2026 that may be closed by this pull request

Feature Request: Add configurable op offload min batch size #18530

Closed

4 tasks

NeoZhangJianyu reviewed Jan 4, 2026

View reviewed changes

am17an approved these changes Jan 6, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu Outdated

cuda: fix whitespace

919aa4f

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ggerganov merged commit 9a5724d into ggml-org:master Jan 8, 2026
79 of 80 checks passed

jukofyork mentioned this pull request Jan 10, 2026

Adding --direct-io flag for model loading #18166

Merged

DocShotgun mentioned this pull request Feb 2, 2026

Refactor batch size handling for offloading host operations to device ikawrakow/ik_llama.cpp#1214

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH#18535

ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH#18535
ggerganov merged 5 commits intoggml-org:masterfrom
DocShotgun:op-offload-min-batch

DocShotgun commented Jan 2, 2026

Uh oh!

ggerganov left a comment

Uh oh!

DocShotgun commented Jan 2, 2026 •

edited

Loading

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

0cc4m commented Jan 5, 2026

Uh oh!

Uh oh!

am17an commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DocShotgun commented Jan 2, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

DocShotgun commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Jan 5, 2026

Uh oh!

Uh oh!

am17an commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DocShotgun commented Jan 2, 2026 •

edited

Loading