Eval bug: llama-server crash on generation using "-sm tensor"

### Name and Version

./llama-server --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 145624 MiB):
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes, VRAM: 48541 MiB
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes, VRAM: 48541 MiB
  Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes, VRAM: 48541 MiB
version: 8892 (0d0764dfd)
built with GNU 12.2.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

3x A6000 (Ampere generation)

### Models

Qwen 3.6 35B MoE (Q8_0)
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

### Problem description & steps to reproduce

When I have "-sm tensor" on, it looks like the model successfully goes through the prompt processing, but immediately stumbles/fails to generate any new output, and exits/aborts.

Command used:
./llama-server   --model ~/models/qwen3.6_35b/Qwen3.6-35B-A3B-Q8_0.gguf --ctx-size 262144    --alias "qwen35b" -ngl 999 -sm tensor --jinja --mmproj ~/models/qwen3.6_35b/mmproj-Qwen_Qwen3.6-35B-A3B-bf16.gguf

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 551
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 35, batch.n_tokens = 35, progress = 0.063521
srv  params_from_: Chat format: peg-native
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  2 | task 2 | processing task, is_child = 0
slot update_slots: id  2 | task 2 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 10738
slot update_slots: id  2 | task 2 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  2 | task 2 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.190725
slot update_slots: id  2 | task 2 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  2 | task 2 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.381449
slot update_slots: id  2 | task 2 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  2 | task 2 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.572174
slot update_slots: id  2 | task 2 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  2 | task 2 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.762898
slot update_slots: id  2 | task 2 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  2 | task 2 | 8192 tokens since last checkpoint at 0, creating new checkpoint during processing at position 10222
slot update_slots: id  2 | task 2 | prompt processing progress, n_tokens = 10222, batch.n_tokens = 2030, progress = 0.951946
/llm_backends/llama.cpp/ggml/src/ggml-backend-meta.cpp:1298: GGML_ASSERT(offset == 0) failed
/llm_backends/llama.cpp/build/bin/libggml-base.so.0(+0x18b98)[0x7e149175eb98]
/llm_backends/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1e4)[0x7e149175ef74]
/llm_backends/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e)[0x7e149175f0fe]
/llm_backends/llama.cpp/build/bin/libggml-base.so.0(+0x3b0fe)[0x7e14917810fe]
/llm_backends/llama.cpp/build/bin/libllama.so.0(_ZN21llama_io_write_buffer12write_tensorEPK11ggml_tensormm+0x2b)[0x7e14918b005b]
/llm_backends/llama.cpp/build/bin/libllama.so.0(_ZNK22llama_memory_recurrent16state_write_dataER16llama_io_write_iRKSt6vectorISt4pairIjjESaIS4_EE+0x11d)[0x7e149190a3fd]
/llm_backends/llama.cpp/build/bin/libllama.so.0(_ZNK22llama_memory_recurrent11state_writeER16llama_io_write_iij+0x218)[0x7e149190a718]
/llm_backends/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context20state_seq_write_dataER16llama_io_write_iij+0x16)[0x7e14918a5ab6]
/llm_backends/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context18state_seq_get_dataEiPhmj+0x3a)[0x7e14918a5b4a]
./llama-server(+0xcb5e6)[0x5a3d467ca5e6]
./llama-server(+0xfed6e)[0x5a3d467fdd6e]
./llama-server(+0x185739)[0x5a3d46884739]
./llama-server(+0x6ae55)[0x5a3d46769e55]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7e149124524a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7e1491245305]
./llama-server(+0x6b4f1)[0x5a3d4676a4f1]
Aborted

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: llama-server crash on generation using "-sm tensor" #22268

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: llama-server crash on generation using "-sm tensor" #22268

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions