UPSTREAM PR #17711: vulkan: Use one row per workgroup for f32 mmv #411

loci-dev · 2025-12-03T05:37:31Z

The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128 -p 0 -r 10 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-vulkan.dll
load_backend: loaded CPU backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-cpu.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       298.13 ± 28.95 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        305.39 ± 0.52 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        305.81 ± 0.46 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.81 ± 7.76 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        358.94 ± 1.05 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.68 ± 0.97 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       360.95 ± 14.97 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        363.50 ± 0.72 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        365.32 ± 0.99 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128 -p 0 -r 10 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-vulkan.dll
load_backend: loaded CPU backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-cpu.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       309.22 ± 11.26 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        310.99 ± 0.41 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        311.00 ± 0.64 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       358.48 ± 13.14 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        362.42 ± 0.81 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        359.07 ± 0.94 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       363.85 ± 13.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        366.16 ± 0.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        367.89 ± 0.70 |

The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs well. I think even for larger m, f32 is so bandwidth-limited that running multiple rows doesn't help.

loci-agentic-ai · 2025-12-03T06:18:52Z

Explore the complete analysis inside the Version Insights

Pull Request Performance Summary: PR #411

Modified File: ggml/src/ggml-vulkan/ggml-vulkan.cpp (3 additions, 3 deletions)

This PR adjusts Vulkan workgroup configuration for f32 matrix-vector multiplication pipelines, reducing from 2 rows per workgroup to 1 row per workgroup. The changes target three specific pipeline variants in ggml_vk_load_shaders: standard f32→f32, f32→f16, and expert-indexed MoE operations.

Key Findings

Performance-Critical Area Impact:

The modifications occur in initialization code (ggml_vk_load_shaders) rather than the inference hot path. Static analysis shows negligible changes in CPU-side function metrics. The performance improvements manifest during GPU kernel execution at runtime, which is outside the scope of CPU binary throughput measurements.

Inference and Tokens Per Second:

The changes do not directly impact llama_decode, llama_encode, or llama_tokenize functions. These core inference functions remain unmodified in both source code and binary analysis. Benchmark data from the PR description reports 0.7-1.8% throughput improvements for MoE models during token generation, but these gains occur in GPU-side Vulkan shader execution, not in the CPU inference path measured by static analysis.

Power Consumption Analysis:

Power consumption changes are negligible across all binaries. The modifications affect pipeline configuration at initialization time, not the runtime execution patterns captured by static analysis. GPU-side optimizations do not alter CPU binary power consumption profiles.

Impacted Functions:

No measurable changes in response time or throughput for CPU-side functions. The optimization targets GPU workload distribution for small matrix dimensions in MoE expert routing operations.

loci-dev temporarily deployed to PROD__AL_DEMO December 3, 2025 05:37 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 5ddab7d to 38683c7 Compare December 3, 2025 06:13

loci-dev force-pushed the main branch 26 times, most recently from 3e4b499 to e81a7eb Compare December 5, 2025 13:17

loci-dev force-pushed the main branch 30 times, most recently from c481809 to 92b887d Compare December 10, 2025 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17711: vulkan: Use one row per workgroup for f32 mmv #411

UPSTREAM PR #17711: vulkan: Use one row per workgroup for f32 mmv #411

Uh oh!

loci-dev commented Dec 3, 2025

Uh oh!

loci-agentic-ai bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17711: vulkan: Use one row per workgroup for f32 mmv #411

Are you sure you want to change the base?

UPSTREAM PR #17711: vulkan: Use one row per workgroup for f32 mmv #411

Uh oh!

Conversation

loci-dev commented Dec 3, 2025

Uh oh!

loci-agentic-ai bot commented Dec 3, 2025

Pull Request Performance Summary: PR #411

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants