Describe the bug
llama-swap has some weird interaction with NGINX.
Normally without configuring NGINX proxy_buffering off;, NGINX correctly proxies the response without buffering when serving response directly from llama-server (llama.cpp).
However, when the response is proxied through llama-swap back to NGINX, it decided to buffer the SSE. This causes stream=true requests to stutter.
There are two fixes, the NGINX deployment set proxy_buffering off;.
Or llama-swap must supply the response "X-Accel-Buffering"="no" to stop NGINX from buffering.
I propose adding w.Header().Set("X-Accel-Buffering", "no") to ProxyRequest in process.go.
Operating system and version
- OS: linux
- GPUs: no applicable
Describe the bug
llama-swap has some weird interaction with NGINX.
Normally without configuring NGINX
proxy_buffering off;, NGINX correctly proxies the response without buffering when serving response directly from llama-server (llama.cpp).However, when the response is proxied through llama-swap back to NGINX, it decided to buffer the SSE. This causes
stream=truerequests to stutter.There are two fixes, the NGINX deployment set
proxy_buffering off;.Or llama-swap must supply the response "X-Accel-Buffering"="no" to stop NGINX from buffering.
I propose adding
w.Header().Set("X-Accel-Buffering", "no")toProxyRequestinprocess.go.Operating system and version