CUDA: reduce MMQ stream-k overhead#22298
Conversation
Quantization sweep NVIDIA dense
Quantization sweep NVIDIA dense
|
Bigger MoE
I'll have to re-take some numbers for the AMD sweep. This PR is not rebased on top of the PR that enabled HIP graphs by default and they are not being enabled for the tests. |
Tensor parallelism
|
|
Nice, excited to see this merged. There is still more juice to squeeze out of this, so working on the next performance bump now. |
Quantization sweep AMD dense
Quantization sweep AMD MoE
|
|
Besides MI100 this path is disabled on all of those anyhow. The question would be if with these improvements its now also faster outside of cdna |
Though the impact should be negligible, there is one tiny part that is affected by the |
|
good, thanks for checking! |
| kbc -= (kbc % blocks_per_ne00) % blocks_per_iter; | ||
| kbc_stop -= (kbc_stop % blocks_per_ne00) % blocks_per_iter; | ||
| kbc -= fastmodulo(kbc, blocks_per_ne00) % blocks_per_iter; | ||
| kbc_stop -= fastmodulo(kbc_stop, blocks_per_ne00) % blocks_per_iter; |
There was a problem hiding this comment.
Afaik fastmodulo/fastdiv are currently only implemented for uint32_t data-types. Does this mean kbc & kbc_stop as the output of their respecitve 64-bit signed int calculations can actually be fully represented in this without wrap-around? If so, let's cast them to uint32_t to make this clear.
Sadly, if the respective expressions for computing kbc and kbc_stop require 64_bit precision, we will still have register spills on NVGPUs due to how 64 bit int division is handled here currently (can elaborate if needed, I can also add 64-bit uint fastdiv/fastmodulo for uint64_t, that was fairly easy compared to 64-bit int)
There was a problem hiding this comment.
I think we can pass a template parameter when we know we can use 32 bit fastmodulo vs when we can't?
ORippler
left a comment
There was a problem hiding this comment.
Would love some more context so I can gain clarity for the required precision, types & conversions regarding kbc & kbc_stop, as I hit the register spill caused by 64-bit kbc & kbc_stop and this code touches upon it (it's still present, and removing the spill will bring a bit of perf):
Also, some more numbers on NVGPUS
build: 5f1074e (8871)
- ./scripts/compare-llama-bench.py -b 82209ef -c cuda-mmq-fastdiv-8 --tool llama-bench -i llama-bench.sqlite
| Model | Microbatch size | Test | t/s 82209ef | t/s cuda-mmq-fastdiv-8 | Speedup |
|---|---|---|---|---|---|
| gemma4 26B.A4B Q4_K_M | 16 | pp2048 | 914.79 | 1044.69 | 1.14 |
| gemma4 26B.A4B Q4_K_M | 16 | pp2048@d32768 | 878.58 | 992.74 | 1.13 |
| gemma4 26B.A4B Q4_K_M | 32 | pp2048 | 1401.23 | 1567.44 | 1.12 |
| gemma4 26B.A4B Q4_K_M | 32 | pp2048@d32768 | 1323.25 | 1472.42 | 1.11 |
| gemma4 26B.A4B Q4_K_M | 64 | pp2048 | 1952.74 | 2202.41 | 1.13 |
| gemma4 26B.A4B Q4_K_M | 64 | pp2048@d32768 | 1791.50 | 2000.43 | 1.12 |
| gemma4 26B.A4B Q4_K_M | 128 | pp2048 | 2332.70 | 2597.84 | 1.11 |
| gemma4 26B.A4B Q4_K_M | 128 | pp2048@d32768 | 1974.79 | 2174.40 | 1.10 |
| gemma4 26B.A4B Q4_K_M | 256 | pp2048 | 3721.95 | 4051.54 | 1.09 |
| gemma4 26B.A4B Q4_K_M | 256 | pp2048@d32768 | 2730.86 | 2930.24 | 1.07 |
| gemma4 26B.A4B Q4_K_M | 512 | pp2048 | 5274.42 | 5724.11 | 1.09 |
| gemma4 26B.A4B Q4_K_M | 512 | pp2048@d32768 | 3442.34 | 3666.19 | 1.07 |
| gemma4 26B.A4B Q4_K_M | 1024 | pp2048 | 6510.57 | 7088.02 | 1.09 |
| gemma4 26B.A4B Q4_K_M | 1024 | pp2048@d32768 | 4060.76 | 4310.83 | 1.06 |
| gemma4 26B.A4B Q4_K_M | 2048 | pp2048 | 7040.79 | 7629.48 | 1.08 |
| gemma4 26B.A4B Q4_K_M | 2048 | pp2048@d32768 | 4170.70 | 4394.05 | 1.05 |
| gemma4 31B Q4_K_M | 16 | pp2048 | 477.72 | 480.22 | 1.01 |
| gemma4 31B Q4_K_M | 16 | pp2048@d32768 | 409.16 | 411.12 | 1.00 |
| gemma4 31B Q4_K_M | 32 | pp2048 | 791.78 | 795.84 | 1.01 |
| gemma4 31B Q4_K_M | 32 | pp2048@d32768 | 633.03 | 634.10 | 1.00 |
| gemma4 31B Q4_K_M | 64 | pp2048 | 1156.28 | 1165.77 | 1.01 |
| gemma4 31B Q4_K_M | 64 | pp2048@d32768 | 774.43 | 775.35 | 1.00 |
| gemma4 31B Q4_K_M | 128 | pp2048 | 1398.67 | 1386.47 | 0.99 |
| gemma4 31B Q4_K_M | 128 | pp2048@d32768 | 872.87 | 870.31 | 1.00 |
| gemma4 31B Q4_K_M | 256 | pp2048 | 1537.61 | 1531.11 | 1.00 |
| gemma4 31B Q4_K_M | 256 | pp2048@d32768 | 932.54 | 929.32 | 1.00 |
| gemma4 31B Q4_K_M | 512 | pp2048 | 1612.27 | 1599.30 | 0.99 |
| gemma4 31B Q4_K_M | 512 | pp2048@d32768 | 974.72 | 969.81 | 0.99 |
| gemma4 31B Q4_K_M | 1024 | pp2048 | 1632.93 | 1630.52 | 1.00 |
| gemma4 31B Q4_K_M | 1024 | pp2048@d32768 | 992.45 | 991.43 | 1.00 |
| gemma4 31B Q4_K_M | 2048 | pp2048 | 1585.85 | 1621.86 | 1.02 |
| gemma4 31B Q4_K_M | 2048 | pp2048@d32768 | 981.80 | 994.26 | 1.01 |
| gemma4 E4B Q4_K_M | 16 | pp2048 | 1985.43 | 2080.39 | 1.05 |
| gemma4 E4B Q4_K_M | 16 | pp2048@d32768 | 1703.79 | 1774.21 | 1.04 |
| gemma4 E4B Q4_K_M | 32 | pp2048 | 3154.90 | 3468.88 | 1.10 |
| gemma4 E4B Q4_K_M | 32 | pp2048@d32768 | 2754.09 | 2795.73 | 1.02 |
| gemma4 E4B Q4_K_M | 64 | pp2048 | 4687.11 | 5024.77 | 1.07 |
| gemma4 E4B Q4_K_M | 64 | pp2048@d32768 | 3789.76 | 3894.56 | 1.03 |
| gemma4 E4B Q4_K_M | 128 | pp2048 | 6135.13 | 6489.89 | 1.06 |
| gemma4 E4B Q4_K_M | 128 | pp2048@d32768 | 4581.40 | 4675.19 | 1.02 |
| gemma4 E4B Q4_K_M | 256 | pp2048 | 8239.29 | 8470.00 | 1.03 |
| gemma4 E4B Q4_K_M | 256 | pp2048@d32768 | 5023.51 | 5121.48 | 1.02 |
| gemma4 E4B Q4_K_M | 512 | pp2048 | 9361.88 | 9703.76 | 1.04 |
| gemma4 E4B Q4_K_M | 512 | pp2048@d32768 | 5424.37 | 5513.46 | 1.02 |
| gemma4 E4B Q4_K_M | 1024 | pp2048 | 9788.44 | 9893.58 | 1.01 |
| gemma4 E4B Q4_K_M | 1024 | pp2048@d32768 | 5907.11 | 5935.78 | 1.00 |
| gemma4 E4B Q4_K_M | 2048 | pp2048 | 9257.66 | 9284.09 | 1.00 |
| gemma4 E4B Q4_K_M | 2048 | pp2048@d32768 | 5538.49 | 5538.96 | 1.00 |
| gpt-oss 20B MXFP4 MoE | 16 | pp2048 | 1266.06 | 1521.89 | 1.20 |
| gpt-oss 20B MXFP4 MoE | 16 | pp2048@d32768 | 1153.43 | 1353.75 | 1.17 |
| gpt-oss 20B MXFP4 MoE | 32 | pp2048 | 2029.82 | 2448.66 | 1.21 |
| gpt-oss 20B MXFP4 MoE | 32 | pp2048@d32768 | 1852.03 | 2091.67 | 1.13 |
| gpt-oss 20B MXFP4 MoE | 64 | pp2048 | 3208.61 | 3706.54 | 1.16 |
| gpt-oss 20B MXFP4 MoE | 64 | pp2048@d32768 | 2539.52 | 2883.99 | 1.14 |
| gpt-oss 20B MXFP4 MoE | 128 | pp2048 | 4327.98 | 5168.11 | 1.19 |
| gpt-oss 20B MXFP4 MoE | 128 | pp2048@d32768 | 3313.67 | 3778.38 | 1.14 |
| gpt-oss 20B MXFP4 MoE | 256 | pp2048 | 6651.97 | 7946.76 | 1.19 |
| gpt-oss 20B MXFP4 MoE | 256 | pp2048@d32768 | 4444.43 | 4982.64 | 1.12 |
| gpt-oss 20B MXFP4 MoE | 512 | pp2048 | 8905.91 | 10501.62 | 1.18 |
| gpt-oss 20B MXFP4 MoE | 512 | pp2048@d32768 | 5507.18 | 6091.97 | 1.11 |
| gpt-oss 20B MXFP4 MoE | 1024 | pp2048 | 10225.28 | 12091.25 | 1.18 |
| gpt-oss 20B MXFP4 MoE | 1024 | pp2048@d32768 | 5945.34 | 6543.82 | 1.10 |
| gpt-oss 20B MXFP4 MoE | 2048 | pp2048 | 9961.50 | 11529.04 | 1.16 |
| gpt-oss 20B MXFP4 MoE | 2048 | pp2048@d32768 | 5682.62 | 6187.17 | 1.09 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 16 | pp2048 | 791.58 | 923.37 | 1.17 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 16 | pp2048@d32768 | 774.97 | 905.00 | 1.17 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 32 | pp2048 | 1048.88 | 1201.36 | 1.15 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 32 | pp2048@d32768 | 1030.68 | 1175.21 | 1.14 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 64 | pp2048 | 1322.33 | 1473.22 | 1.11 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 64 | pp2048@d32768 | 1301.95 | 1453.62 | 1.12 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 128 | pp2048 | 1497.22 | 1628.06 | 1.09 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 128 | pp2048@d32768 | 1478.76 | 1608.97 | 1.09 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 256 | pp2048 | 2462.64 | 2613.81 | 1.06 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 256 | pp2048@d32768 | 2343.09 | 2494.88 | 1.06 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 512 | pp2048 | 3765.66 | 3986.95 | 1.06 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 512 | pp2048@d32768 | 3437.19 | 3623.00 | 1.05 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 1024 | pp2048 | 5083.19 | 5357.53 | 1.05 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 1024 | pp2048@d32768 | 4373.03 | 4588.62 | 1.05 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 2048 | pp2048 | 5732.08 | 6068.46 | 1.06 |
| nemotron_h_moe 31B.A3.5B NVFP4 | 2048 | pp2048@d32768 | 4688.64 | 4946.34 | 1.05 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 16 | pp2048 | 766.28 | 891.92 | 1.16 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 16 | pp2048@d32768 | 737.00 | 849.94 | 1.15 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 32 | pp2048 | 1112.57 | 1236.99 | 1.11 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 32 | pp2048@d32768 | 1062.44 | 1177.25 | 1.11 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 64 | pp2048 | 1491.11 | 1634.99 | 1.10 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 64 | pp2048@d32768 | 1414.13 | 1534.08 | 1.08 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 128 | pp2048 | 1810.81 | 1934.42 | 1.07 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 128 | pp2048@d32768 | 1702.64 | 1807.84 | 1.06 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 256 | pp2048 | 2956.06 | 3098.15 | 1.05 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 256 | pp2048@d32768 | 2719.20 | 2833.47 | 1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 512 | pp2048 | 4481.30 | 4659.46 | 1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 512 | pp2048@d32768 | 3981.66 | 4105.71 | 1.03 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 1024 | pp2048 | 5971.13 | 6208.51 | 1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 1024 | pp2048@d32768 | 5032.85 | 5188.73 | 1.03 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 2048 | pp2048 | 6783.78 | 7013.52 | 1.03 |
| nemotron_h_moe 31B.A3.5B Q4_K_M | 2048 | pp2048@d32768 | 5402.65 | 5550.75 | 1.03 |
| qwen35 27B Q4_K_M | 16 | pp2048 | 512.54 | 515.14 | 1.01 |
| qwen35 27B Q4_K_M | 16 | pp2048@d32768 | 461.18 | 466.58 | 1.01 |
| qwen35 27B Q4_K_M | 32 | pp2048 | 836.91 | 850.53 | 1.02 |
| qwen35 27B Q4_K_M | 32 | pp2048@d32768 | 734.27 | 742.34 | 1.01 |
| qwen35 27B Q4_K_M | 64 | pp2048 | 1214.65 | 1213.17 | 1.00 |
| qwen35 27B Q4_K_M | 64 | pp2048@d32768 | 1006.92 | 1015.35 | 1.01 |
| qwen35 27B Q4_K_M | 128 | pp2048 | 1471.08 | 1472.33 | 1.00 |
| qwen35 27B Q4_K_M | 128 | pp2048@d32768 | 1022.19 | 1025.24 | 1.00 |
| qwen35 27B Q4_K_M | 256 | pp2048 | 1633.20 | 1646.32 | 1.01 |
| qwen35 27B Q4_K_M | 256 | pp2048@d32768 | 1124.95 | 1134.36 | 1.01 |
| qwen35 27B Q4_K_M | 512 | pp2048 | 1712.19 | 1730.45 | 1.01 |
| qwen35 27B Q4_K_M | 512 | pp2048@d32768 | 1223.96 | 1234.40 | 1.01 |
| qwen35 27B Q4_K_M | 1024 | pp2048 | 1724.11 | 1735.87 | 1.01 |
| qwen35 27B Q4_K_M | 1024 | pp2048@d32768 | 1287.23 | 1298.57 | 1.01 |
| qwen35 27B Q4_K_M | 2048 | pp2048 | 1694.41 | 1742.45 | 1.03 |
| qwen35 27B Q4_K_M | 2048 | pp2048@d32768 | 1266.60 | 1300.48 | 1.03 |
| qwen35moe 35B.A3B Q4_K_M | 16 | pp2048 | 751.25 | 883.56 | 1.18 |
| qwen35moe 35B.A3B Q4_K_M | 16 | pp2048@d32768 | 704.92 | 826.94 | 1.17 |
| qwen35moe 35B.A3B Q4_K_M | 32 | pp2048 | 1164.01 | 1382.06 | 1.19 |
| qwen35moe 35B.A3B Q4_K_M | 32 | pp2048@d32768 | 1103.32 | 1282.73 | 1.16 |
| qwen35moe 35B.A3B Q4_K_M | 64 | pp2048 | 1624.36 | 1898.11 | 1.17 |
| qwen35moe 35B.A3B Q4_K_M | 64 | pp2048@d32768 | 1483.76 | 1743.78 | 1.18 |
| qwen35moe 35B.A3B Q4_K_M | 128 | pp2048 | 1923.22 | 2262.29 | 1.18 |
| qwen35moe 35B.A3B Q4_K_M | 128 | pp2048@d32768 | 1726.89 | 1994.00 | 1.15 |
| qwen35moe 35B.A3B Q4_K_M | 256 | pp2048 | 2912.44 | 3262.79 | 1.12 |
| qwen35moe 35B.A3B Q4_K_M | 256 | pp2048@d32768 | 2501.22 | 2758.12 | 1.10 |
| qwen35moe 35B.A3B Q4_K_M | 512 | pp2048 | 4109.02 | 4521.77 | 1.10 |
| qwen35moe 35B.A3B Q4_K_M | 512 | pp2048@d32768 | 3352.94 | 3625.34 | 1.08 |
| qwen35moe 35B.A3B Q4_K_M | 1024 | pp2048 | 5001.70 | 5532.55 | 1.11 |
| qwen35moe 35B.A3B Q4_K_M | 1024 | pp2048@d32768 | 3928.51 | 4255.15 | 1.08 |
| qwen35moe 35B.A3B Q4_K_M | 2048 | pp2048 | 5495.72 | 6036.81 | 1.10 |
| qwen35moe 35B.A3B Q4_K_M | 2048 | pp2048@d32768 | 4163.90 | 4455.37 | 1.07 |
|
|
||
| kb0_start = 0; | ||
| kb0_stop = min(blocks_per_ne00, kbc_stop - kbc); | ||
| kb0_stop = min(blocks_per_ne00.z, uint32_t(kbc_stop - kbc)); |
There was a problem hiding this comment.
cf. above w.r.t wrap-around during int -> uint32_t conversions
|
@ORippler during development I had experimented with a version where |
|
Don't have a lot to add, but if you wanted for reference a fastdiv64, had agent write one up: |
|
The maximum value of a 32 bit signed integer is ~2 billion, the minimum ggml block size we have is 32. So in terms of the continuous k space that we are trying to represent we can go up to at least 68.7 billion. I think that for typical tensor sizes we should be well below this limit. So I've just changed the data type of Performance changes
In a quick test I am seeing a slight performance impact from the change, I think only the measurements for LLaMA 3 8b are sufficiently precise to be meaningful. |
* CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc (cherry picked from commit 9725a31)
The patched rocWMMA FA path (commit 030e290, lhl PR ggml-org#16827 port) silently regressed at D=256 between 2026-04-19 (when rocwmma-tuned.md benched it flat ±1.5% and concluded "kept anyway") and 2026-04-27. Re-bench against the same source/host with only GGML_HIP_ROCWMMA_FATTN toggled showed pp512@d=16k 244 → 853 t/s. server-configs flipped the flag back OFF; source carried as #if-gated dead code, harmless when off. - rocwmma-tuned.md: status block at top + new "Re-bench 2026-04-27" section at the bottom with the systematic bisect (eliminated container ROCm version, llama.cpp upstream delta, source patches, host ROCm package set, modprobe.d state, KFD userptr eviction, memory pressure, GPU clocks, firmware versions) and candidate mechanisms for the post-landing regression. - qwen3.6-baseline.md: rewrite the same-day regression note that initially blamed ROCm 7.13 nightly progression. Retract the amdgpu_amdkfd_restore_userptr_worker hypothesis (zero firings observed since boot). - tg-at-depth-regression.md: mark resolved by upstream PR ggml-org#22298 (CUDA: reduce MMQ stream-k overhead, merged 2026-04-26). Recovery numbers; original investigation kept below for posterity. - README.md (root): findings table row 6 reflects the flag flip, watching-upstream gains a ggml-org#22298 row, and a re-bench checklist for future syncs/ROCm bumps so the next "flat at landing" doesn't go five weeks unvalidated. - strix-halo/README.md: status annotations on the rocwmma-tuned and tg-at-depth-regression rows.
This PR reduces the stream-k overhead in the MMQ kernel by using
fastdivwhich precomputes some values on the CPU to speed up integer divisions. Also, as originally suggested by @nisparks in #22170 and #22252 optionally use tiling rather than a stream-k decomposition. The implementation in this PR is different vs the ones linked: in those an extra variant of the kernel is being compiled that has the tiling hard-coded (as is done for relatively old GPUs), in this PR the number of CUDA blocks is scaled dynamically to the number of tiles so that each CUDA block works on exactly one tile; if it turns out that there is a meaningful performance difference it may make sense to still compile the extra kernels. The choice for whether or not to use stream-k does not explicitly depend on MoE in this PR, instead it is determined from the efficiency loss that would be incurred by tiling: if it is <= 10% tiling is used in order to skip the stream-k fixup.Requirements