Skip to content

Eval bug: F16-K + TURBO-V missing from fattn dispatch, crashes at fattn.cu:339 on sm_89 and sm_121 #83

@dentity007

Description

@dentity007

Name and Version

TheTom/llama-cpp-turboquant

Operating systems

Linux

GGML backends

CUDA

Hardware

Reproduced on two hardware configurations:

  • NVIDIA GB10 Grace Blackwell (sm_121, 128GB unified memory, CUDA 13.0.2, aarch64 Ubuntu 24.04)
  • NVIDIA RTX 4090 (sm_89, 24GB discrete VRAM, CUDA 12.4, x86_64 Ubuntu 22.04, RunPod hosts in Iowa US and Timisoara Romania)

Models

Qwen3-30B-A3B Q4_K_M and UD-Q4_K_XL variants from https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF

Problem description & steps to reproduce

Running llama-bench with flash attention enabled and any turbo V-cache type paired with f16 K-cache crashes consistently across both architectures.

Reproduce on any CUDA hardware (sm_89 or sm_121 both affected):

./build/bin/llama-bench
-m Qwen3-30B-A3B-Q4_K_M.gguf
-fa 1 -t 1 -ngl 99
-p 0 -n 128 -pg 8192,128
-ctk f16 -ctv turbo3

Expected: benchmark runs on GPU.
Actual: crash at fattn.cu:339 with ggml_cuda_flash_attn_ext then GGML_ABORT("fatal error").

Root cause analysis:

Line 339 in ggml/src/ggml-cuda/fattn.cu is the GGML_ABORT that fires when no K/V type case matches in the dispatch. Checking the cases above it, F16 K-cache has cases for:

Missing:

  • F16 + TURBO2_0
  • F16 + TURBO3_0
  • F16 + TURBO4_0

Symmetric turbo (turbo3/turbo3, turbo4/turbo4) cases exist. Turbo/q8_0 asymmetric exists. But F16-K + turbo-V asymmetric has no dispatch case, so it hits the abort.

Same class of missing-case bug that PR #82 fixed for F16+Q8_0, just not extended to the turbo V-cache variants.

Proposed fix:
Add FATTN_VEC_CASES_ALL_D entries for F16 + TURBO2_0, F16 + TURBO3_0, F16 + TURBO4_0 in fattn.cu around line 284 where the existing F16-K asymmetric cases live. Happy to submit a PR if the fix is as simple as it looks and there is no deeper reason these were intentionally excluded.

Context: This was found during validation of PR #82 on DGX Spark GB10 unified memory, posted in Discussion ggml-org#20969.

First Bad Commit

Not a regression. These dispatch cases appear to have never been added.

Relevant log output

Logs
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24090 MiB):
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24090 MiB
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB | ... |    f16 |    f16 |  1 |    pp8192+tg128 |       4872.72 ± 15.83 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.28 GiB | ... |    f16 |   q8_0 |  1 |   pp32768+tg128 |        4306.00 ± 5.17 |
/workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/fattn.cu:339: fatal error
/workspace/llama-cpp-turboquant/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)
/workspace/llama-cpp-turboquant/build/bin/libggml-base.so.0(ggml_abort+0x152)
/workspace/llama-cpp-turboquant/build/bin/libggml-cuda.so.0(_Z24ggml_cuda_flash_attn_extR25ggml_backend_cuda_contextP11ggml_tensor+0x6fe)
/workspace/llama-cpp-turboquant/build/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x81f)
/workspace/llama-cpp-turboquant/build/bin/libllama.so.0(llama_decode+0x10)

Same signature reproduced on sm_121 (DGX Spark GB10) with HEAD 107362298.
</details> ```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions