Name and Version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16310 MiB):
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
load_backend: loaded CUDA backend from C:\Chu\LLM\Llamacpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Chu\LLM\Llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Chu\LLM\Llamacpp\ggml-cpu-zen4.dll
version: 8883 (134d6e5)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
AMD Ryzen 9 9900X
NVIDIA 5600Ti
Models
Reproduced with any 2 LLM's with image capability, even within the same family.
Try Qwen3.5-35b and Qwen3.5-27b for example.
Problem description & steps to reproduce
-
Load llama-server with any vision model - like Qwen3.5-35b - any gguf.
-
Send a prompt with an image like "Describe this image".
-
Close llama-server back-end to unload the model.
-
Open a different vision model in llama-server - doesn't matter as long as it's not the same model. Try Gemma or qwen3.5-27b for example.
-
Refresh llama-server webui in the browser (to make sure the samplers get updated based on what's in launch params of the new model).
-
Click "regenerate prompt" button:
The new model will initially say it doesn't see any image - you can confirm in llama-server logs no image file was sent.
- Click "regenerate prompt" button again, this time the image will be sent.
So if I'm using it to test different models against a prompt with an image, it won't send the image the first time you click "regenerate prompt" when the new model is loaded. But it will the 2nd time you click it. Weird behavior, it started a few months ago, not sure what version. Not an issue using API with python, so must be a front-end quirk.
First Bad Commit
Unfortunately it's been a few months and I did not take note of which release started it.
Relevant log output
Not sure if useful here, the log just won't have image processing when hitting "regenerate prompt" the first time (after loading a different VLM), and it will be there on the 2nd attempt.
Name and Version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16310 MiB):
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB
load_backend: loaded CUDA backend from C:\Chu\LLM\Llamacpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Chu\LLM\Llamacpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Chu\LLM\Llamacpp\ggml-cpu-zen4.dll
version: 8883 (134d6e5)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
AMD Ryzen 9 9900X
NVIDIA 5600Ti
Models
Reproduced with any 2 LLM's with image capability, even within the same family.
Try Qwen3.5-35b and Qwen3.5-27b for example.
Problem description & steps to reproduce
Load llama-server with any vision model - like Qwen3.5-35b - any gguf.
Send a prompt with an image like "Describe this image".
Close llama-server back-end to unload the model.
Open a different vision model in llama-server - doesn't matter as long as it's not the same model. Try Gemma or qwen3.5-27b for example.
Refresh llama-server webui in the browser (to make sure the samplers get updated based on what's in launch params of the new model).
Click "regenerate prompt" button:
The new model will initially say it doesn't see any image - you can confirm in llama-server logs no image file was sent.
So if I'm using it to test different models against a prompt with an image, it won't send the image the first time you click "regenerate prompt" when the new model is loaded. But it will the 2nd time you click it. Weird behavior, it started a few months ago, not sure what version. Not an issue using API with python, so must be a front-end quirk.
First Bad Commit
Unfortunately it's been a few months and I did not take note of which release started it.
Relevant log output
Not sure if useful here, the log just won't have image processing when hitting "regenerate prompt" the first time (after loading a different VLM), and it will be there on the 2nd attempt.