Skip to content

Eval bug: OOM Error for Vulkan Backend on models that don't strictly fit in VRAM #18642

@jhemmond

Description

@jhemmond

Name and Version

b7502

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen AI 370HX with 64GB LPDDR5X, and 890m iGPU

Models

Command: ${llamasvr} -m ${mpath}\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf --jinja --no-mmap --ctx-size 20000 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00

Issue also happens with GLM-4.5-Air, and LLama3.3-70B.

Problem description & steps to reproduce

After version b7502, and all the way to current b7642 today, any models loaded into memory will throw:

.........srv  log_server_r: request: GET /health 127.0.0.1 503

................srv  log_server_r: request: GET /health 127.0.0.1 503

..............srv  log_server_r: request: GET /health 127.0.0.1 503

..........llama_model_load: error loading model: vk::Queue::submit: ErrorOutOfDeviceMemory

llama_model_load_from_file_impl: failed to load model

srv  log_server_r: request: GET /health 127.0.0.1 503

common_init_from_params: failed to load model 'C:\Users\PC\llama-cpp\models\\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf'

srv    load_model: failed to load model, 'C:\Users\PC\llama-cpp\models\\Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf'

srv    operator(): operator(): cleaning up before exit...

main: exiting due to model loading error

I have 32GB of my system RAM allocated in BIOS to the 890m iGPU. Models fitting inside the 32GB, run fine. Models larger than that cause issues. If I remove mmap, the model loads, but it goes into disk for whatever reason, which takes a really long time. Prior versions do not have this issue.

First Bad Commit

b7502

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions