fix(ai-proxy): yield to scheduler in streaming SSE loop to avoid worker CPU starvation by nic-6443 · Pull Request #13255 · apache/apisix

nic-6443 · 2026-04-19T03:05:36Z

What

Add an explicit ngx.sleep(0) at the end of each iteration of the streaming SSE loop in apisix/plugins/ai-providers/base.lua::parse_streaming_response. This guarantees the coroutine yields to the nginx scheduler at least once per upstream chunk.

Why

In production we observed worker processes pinned at 100% CPU during AI proxy traffic. Root cause: when an upstream LLM emits SSE chunks in a tight burst (e.g. a model hallucinating and producing tokens at 100+ per second, or upstreams that batch multiple SSE events into a single TCP segment), the streaming loop runs for an extended period without yielding.

Specifically:

body_reader() (cosocket socket:receive()) only yields when the recv buffer is empty. If the kernel has already buffered several chunks, successive calls return immediately without yielding.
ngx.flush(true) (used downstream) only yields when the send buffer is full. A fast downstream client drains immediately, so flush returns without yielding.

Neither end of the loop guarantees a yield. The result: the SSE coroutine monopolizes the worker — starving health checks, concurrent requests on the same worker, and timer callbacks. Even modest traffic can saturate a single core because Lua coroutines on the same OpenResty worker share one OS thread.

ngx.sleep(0) is the canonical OpenResty primitive for this — it queues a 0-second timer and yields the current coroutine, letting the scheduler pick up any other ready coroutines, then resumes.

Cost

Normal traffic: chunks already arrive with inter-chunk gaps, so body_reader() already yields naturally between chunks. The extra ngx.sleep(0) is invisible.
Burst traffic: caps per-coroutine runtime to one chunk's worth of work between yields. The yield itself is microseconds.

Test plan

This is a concurrency / scheduling fix where deterministic reproduction in test-nginx is difficult — burst behavior depends on TCP buffering between the mock upstream and the proxy, both of which run in the same nginx instance during tests, so timing rarely matches the real-world scenario. Existing streaming correctness tests (t/plugin/ai-proxy*.t, t/plugin/ai-proxy-client-disconnect.t) cover that the loop still produces correct output and that the new yield doesn't break the disconnect-detection or limit-enforcement paths.

Per the project's testing exception for "concurrency issues that are hard to simulate", I'm relying on existing tests for correctness regression coverage.

…er CPU starvation When an upstream LLM emits SSE chunks in a tight burst (e.g. a model hallucinating and producing tokens at 100+ per second), the streaming loop in parse_streaming_response can run for an extended period without yielding to the nginx scheduler. body_reader() (cosocket recv) only yields when the recv buffer is empty; if the kernel has already buffered several chunks, successive calls return immediately. ngx.flush(true) only yields when the downstream send buffer is full; a fast client drains immediately. So neither end of the loop guarantees a yield, and the SSE coroutine ends up monopolizing the worker — starving health checks, concurrent requests, and timer callbacks on the same worker. Add an explicit ngx.sleep(0) at the end of each loop iteration. This is a no-op timer that just yields the current coroutine, allowing other ready coroutines to run. The cost is negligible: in normal AI traffic chunks already arrive with inter-chunk gaps so an extra yield per chunk is invisible; in burst scenarios it caps per-coroutine runtime to one chunk's worth of work.

Copilot

Pull request overview

This PR mitigates OpenResty worker CPU starvation during bursty AI SSE streaming by forcing a cooperative yield in the AI provider streaming loop, ensuring other coroutines (health checks, concurrent requests, timers) can run even when both upstream reads and downstream flushes return immediately.

Changes:

Add an explicit ngx.sleep(0) yield at the end of each iteration of the SSE streaming loop.
Document in-code why the yield is necessary under bursty upstream + fast downstream conditions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

membphis · 2026-04-20T02:06:01Z

            plugin.lua_response_filter(ctx, res.headers, chunk)
        end
+
+        -- Yield to the nginx scheduler so other coroutines on this worker


It doesn't completely prevent CPU exhaustion; it only prevents the current request from monopolizing CPU resources.
It's a temporary, quick fix, but it doesn't truly solve or avoid the problem.

Create a todo issue and update the current code comments to clarify that the current fix is only a workaround, not a real solution.

Good point — agreed it is a workaround, not a real fix.

Filed #13256 to track the proper solution (per-stream CPU/time bounds, backpressure, fairness for SSE streaming) and updated the inline comment in 401a7ae to be explicit that this yield only prevents single-request monopolization and points to the tracking issue.

Proposing we keep the workaround for now since it does materially improve worker behavior under bursty providers, with the real fix tracked separately.

…e#13256 Per review feedback, the comment now states explicitly that the yield prevents one request from monopolizing the worker but does not bound per-stream CPU time, add backpressure, or time out stalled streams. A real fix is tracked in apache#13256.

Copilot AI review requested due to automatic review settings April 19, 2026 03:05

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working performance generate flamegraph for the current PR labels Apr 19, 2026

Copilot started reviewing on behalf of nic-6443 April 19, 2026 03:06 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

nic-6443 requested review from AlinsRan, Baoyuantop, membphis, moonming and shreemaan-abhishek April 20, 2026 01:03

moonming previously approved these changes Apr 20, 2026

View reviewed changes

membphis requested changes Apr 20, 2026

View reviewed changes

nic-6443 mentioned this pull request Apr 20, 2026

ai-proxy: streaming SSE loop can starve worker CPU under bursty upstreams #13256

Open

nic-6443 dismissed moonming’s stale review via 401a7ae April 20, 2026 02:16

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 20, 2026

nic-6443 requested review from membphis and moonming April 20, 2026 02:39

membphis approved these changes Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ai-proxy): yield to scheduler in streaming SSE loop to avoid worker CPU starvation#13255

fix(ai-proxy): yield to scheduler in streaming SSE loop to avoid worker CPU starvation#13255
nic-6443 wants to merge 2 commits intoapache:masterfrom
nic-6443:fix-ai-proxy-cpu-yield

nic-6443 commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

membphis Apr 20, 2026

Uh oh!

nic-6443 Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nic-6443 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Cost

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

membphis Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

nic-6443 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nic-6443 commented Apr 19, 2026 •

edited

Loading