Name and Version
TheTom/llama-cpp-turboquant
Operating systems
Linux
GGML backends
CUDA
Hardware
Reproduced on two hardware configurations:
- NVIDIA GB10 Grace Blackwell (sm_121, 128GB unified memory, CUDA 13.0.2, aarch64 Ubuntu 24.04)
- NVIDIA RTX 4090 (sm_89, 24GB discrete VRAM, CUDA 12.4, x86_64 Ubuntu 22.04, RunPod hosts in Iowa US and Timisoara Romania)
Models
Qwen3-30B-A3B Q4_K_M and UD-Q4_K_XL variants from https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
Problem description & steps to reproduce
Running llama-bench with flash attention enabled and any turbo V-cache type paired with f16 K-cache crashes consistently across both architectures.
Reproduce on any CUDA hardware (sm_89 or sm_121 both affected):
./build/bin/llama-bench
-m Qwen3-30B-A3B-Q4_K_M.gguf
-fa 1 -t 1 -ngl 99
-p 0 -n 128 -pg 8192,128
-ctk f16 -ctv turbo3
Expected: benchmark runs on GPU.
Actual: crash at fattn.cu:339 with ggml_cuda_flash_attn_ext then GGML_ABORT("fatal error").
Root cause analysis:
Line 339 in ggml/src/ggml-cuda/fattn.cu is the GGML_ABORT that fires when no K/V type case matches in the dispatch. Checking the cases above it, F16 K-cache has cases for:
Missing:
- F16 + TURBO2_0
- F16 + TURBO3_0
- F16 + TURBO4_0
Symmetric turbo (turbo3/turbo3, turbo4/turbo4) cases exist. Turbo/q8_0 asymmetric exists. But F16-K + turbo-V asymmetric has no dispatch case, so it hits the abort.
Same class of missing-case bug that PR #82 fixed for F16+Q8_0, just not extended to the turbo V-cache variants.
Proposed fix:
Add FATTN_VEC_CASES_ALL_D entries for F16 + TURBO2_0, F16 + TURBO3_0, F16 + TURBO4_0 in fattn.cu around line 284 where the existing F16-K asymmetric cases live. Happy to submit a PR if the fix is as simple as it looks and there is no deeper reason these were intentionally excluded.
Context: This was found during validation of PR #82 on DGX Spark GB10 unified memory, posted in Discussion ggml-org#20969.
First Bad Commit
Not a regression. These dispatch cases appear to have never been added.
Relevant log output
Logs
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24090 MiB):
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24090 MiB
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | ... | f16 | f16 | 1 | pp8192+tg128 | 4872.72 ± 15.83 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | ... | f16 | q8_0 | 1 | pp32768+tg128 | 4306.00 ± 5.17 |
/workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/fattn.cu:339: fatal error
/workspace/llama-cpp-turboquant/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)
/workspace/llama-cpp-turboquant/build/bin/libggml-base.so.0(ggml_abort+0x152)
/workspace/llama-cpp-turboquant/build/bin/libggml-cuda.so.0(_Z24ggml_cuda_flash_attn_extR25ggml_backend_cuda_contextP11ggml_tensor+0x6fe)
/workspace/llama-cpp-turboquant/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x81f)
/workspace/llama-cpp-turboquant/build/bin/libllama.so.0(llama_decode+0x10)
Same signature reproduced on sm_121 (DGX Spark GB10) with HEAD 107362298.
</details> ```
Name and Version
TheTom/llama-cpp-turboquant
Both builds include PR fix(cuda): allow f16/bf16 + q8_0 KV without GGML_CUDA_FA_ALL_QUANTS #82 cherry-picked and -DGGML_CUDA_FA_ALL_QUANTS=ON
Operating systems
Linux
GGML backends
CUDA
Hardware
Reproduced on two hardware configurations:
Models
Qwen3-30B-A3B Q4_K_M and UD-Q4_K_XL variants from https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF
Problem description & steps to reproduce
Running llama-bench with flash attention enabled and any turbo V-cache type paired with f16 K-cache crashes consistently across both architectures.
Reproduce on any CUDA hardware (sm_89 or sm_121 both affected):
./build/bin/llama-bench
-m Qwen3-30B-A3B-Q4_K_M.gguf
-fa 1 -t 1 -ngl 99
-p 0 -n 128 -pg 8192,128
-ctk f16 -ctv turbo3
Expected: benchmark runs on GPU.
Actual: crash at fattn.cu:339 with ggml_cuda_flash_attn_ext then GGML_ABORT("fatal error").
Root cause analysis:
Line 339 in ggml/src/ggml-cuda/fattn.cu is the GGML_ABORT that fires when no K/V type case matches in the dispatch. Checking the cases above it, F16 K-cache has cases for:
Missing:
Symmetric turbo (turbo3/turbo3, turbo4/turbo4) cases exist. Turbo/q8_0 asymmetric exists. But F16-K + turbo-V asymmetric has no dispatch case, so it hits the abort.
Same class of missing-case bug that PR #82 fixed for F16+Q8_0, just not extended to the turbo V-cache variants.
Proposed fix:
Add FATTN_VEC_CASES_ALL_D entries for F16 + TURBO2_0, F16 + TURBO3_0, F16 + TURBO4_0 in fattn.cu around line 284 where the existing F16-K asymmetric cases live. Happy to submit a PR if the fix is as simple as it looks and there is no deeper reason these were intentionally excluded.
Context: This was found during validation of PR #82 on DGX Spark GB10 unified memory, posted in Discussion ggml-org#20969.
First Bad Commit
Not a regression. These dispatch cases appear to have never been added.
Relevant log output
Logs