Skip to content

[Security] Denial of Service via Blocking Event Loop in Model Workers (Incomplete Fix for ff66426) #3793

@YLChen-007

Description

@YLChen-007

Exploitability Summary

Aspect Status
External Attack Path ✅ Verified - HTTP POST to /worker_generate and /worker_get_embeddings
Runtime Protections Bypassed ✅ Yes - No async wrapping, no request timeout on event loop
Requires Other Vulnerabilities ✅ None - Direct exploitation via standard worker API
Real-World Exploitability ✅ CONFIRMED via live exploit (9,008x amplification) — see Exploit-RealWorld-EventLoop-Blocking-DoS.md

Vulnerability Overview

CWE-400: Uncontrolled Resource Consumption / CWE-834: Excessive Iteration

Three unpatched instances of the same blocking event loop vulnerability exist in FastChat's model worker implementations. These are direct variants of the bug fixed in commit ff66426.

Background: The Original Fix

Commit ff66426 fixed a blocking event loop issue in base_model_worker.py where worker.generate_gate(params) was called synchronously in an async FastAPI handler. The fix wrapped the call with asyncio.to_thread():

# BEFORE (VULNERABLE - ff66426 fix target):
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    output = worker.generate_gate(params)  # ← BLOCKS event loop!
    release_worker_semaphore()
    return JSONResponse(output)

# AFTER (FIXED):
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    output = await asyncio.to_thread(worker.generate_gate, params)  # ← Non-blocking!
    release_worker_semaphore()
    return JSONResponse(output)

Unpatched Variant 1: multi_model_worker.pyapi_generate() (HIGH)

File: fastchat/serve/multi_model_worker.py, line 112

@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    worker = worker_map[params["model"]]
    output = worker.generate_gate(params)  # ← BLOCKS event loop! (not wrapped in asyncio.to_thread)
    release_worker_semaphore()
    return JSONResponse(output)

worker.generate_gate() calls generate_stream_gate() which runs GPU inference via PyTorch. This can take seconds to minutes depending on the model and input. During this time, the FastAPI async event loop is completely blocked, meaning:

  • No other HTTP requests are processed
  • Heartbeat messages stop being sent
  • The controller may deregister the worker
  • Other users' requests are stalled

Unpatched Variant 2: base_model_worker.pyapi_get_embeddings() (HIGH)

File: fastchat/serve/base_model_worker.py, line 218

@app.post("/worker_get_embeddings")
async def api_get_embeddings(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    embedding = worker.get_embeddings(params)  # ← BLOCKS event loop!
    release_worker_semaphore()
    return JSONResponse(content=embedding)

worker.get_embeddings() runs @torch.inference_mode() decorated GPU inference with tokenization, model forward pass, and embedding normalization. This is a synchronous blocking call in an async handler.

Note: The fix for generate_gate in base_model_worker.py line 209 was applied correctly via asyncio.to_thread(), but the developer missed applying the same fix to get_embeddings in the same file.

Unpatched Variant 3: huggingface_api_worker.pyapi_generate() (MEDIUM)

File: fastchat/serve/huggingface_api_worker.py, line 236

@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    worker = worker_map[params["model"]]
    await acquire_worker_semaphore(worker)
    output = worker.generate_gate(params)  # ← BLOCKS event loop!
    release_worker_semaphore(worker)
    return JSONResponse(output)

HuggingfaceApiWorker.generate_gate() calls generate_stream_gate() which makes HTTP calls to the HuggingFace Inference API. While these are I/O-bound rather than GPU-bound, they are still synchronous blocking calls in an async handler. The InferenceClient.text_generation() call is a synchronous HTTP request.

Attack Chain

# Variant 1: multi_model_worker
[Attacker]
    → POST /v1/chat/completions (stream=false, max_tokens=4096)
    → [OpenAI API Server] calls /worker_generate on multi_model_worker
    → multi_model_worker.api_generate():
        → worker.generate_gate(params)  # GPU inference, blocks event loop
        → Duration: 30-120 seconds for large outputs
    → During this time: ALL other requests to this worker STALLED
        → Heartbeat fails → Controller deregisters worker
        → Other users' requests timeout

# Variant 2: base_model_worker (embeddings)
[Attacker]
    → POST /v1/embeddings (large batch of text inputs)
    → [OpenAI API Server] calls /worker_get_embeddings on worker
    → base_model_worker.api_get_embeddings():
        → worker.get_embeddings(params)  # GPU inference, blocks event loop
        → Duration: seconds to minutes for large batches
    → During this time: ALL other requests to this worker STALLED

Impact

  1. Complete Worker DoS: A single request can stall all other requests to the model worker for the duration of GPU inference (potentially minutes)
  2. Heartbeat Failure: Blocked heartbeats cause the controller to deregister the worker, making the model unavailable to all users
  3. Cascading Failure: In a multi-model setup with multi_model_worker, ALL models served by that worker become unavailable
  4. No Authentication Required: Worker endpoints are not authenticated (only the OpenAI API server optionally uses API keys)
  5. Amplification: An attacker can chain multiple blocking requests to keep the worker perpetually unavailable

Reproduction Steps

Prerequisites

  1. A running FastChat deployment with model worker(s)
  2. Python 3 with aiohttp installed

Step 1: Static Validation (no running server needed)

cd /root/llm-project/FastChat-huntr
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py --validation-only

Expected output: Confirms 3 unpatched vulnerable code locations.

Step 2: Start FastChat (for active testing)

# Terminal 1: Controller
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001

# Terminal 2: Multi-Model Worker
python3 -m fastchat.serve.multi_model_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --controller-address http://localhost:21001 \
    --worker-address http://localhost:21002 \
    --port 21002

# Terminal 3: OpenAI API Server
python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 \
    --controller-address http://localhost:21001

Step 3: Exploit - Direct Worker DoS (Variant 1)

# Send a blocking generate request directly to the multi_model_worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
    --target http://localhost:21002 --variant generate --model vicuna-7b-v1.5

Step 4: Exploit - Embeddings DoS (Variant 2)

# Send a blocking embeddings request to any model worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
    --target http://localhost:21002 --variant embedding --model vicuna-7b-v1.5

Step 5: Exploit - Via OpenAI API (indirect)

# Non-streaming requests go through /worker_generate which blocks the worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
    --target http://localhost:8000 --variant api --model vicuna-7b-v1.5

Step 6: Manual Verification with curl

# In terminal A: Send a blocking request (non-streaming mode hits /worker_generate)
curl -X POST http://localhost:21002/worker_generate \
  -H "Content-Type: application/json" \
  -d '{"model":"vicuna-7b-v1.5","prompt":"Write a very long essay about history","temperature":0.7,"max_new_tokens":2048}' &

# Immediately in terminal B: Try to get worker status (will be blocked!)
time curl -X POST http://localhost:21002/worker_get_status -H "Content-Type: application/json" -d '{}'

# Expected: The status check takes many seconds (blocked by generate_gate)

Root Cause Files

File Line Issue
fastchat/serve/multi_model_worker.py 112 output = worker.generate_gate(params) — synchronous GPU inference in async handler
fastchat/serve/base_model_worker.py 218 embedding = worker.get_embeddings(params) — synchronous GPU inference in async handler
fastchat/serve/huggingface_api_worker.py 236 output = worker.generate_gate(params) — synchronous API call in async handler

Suggested Fix

Apply the same asyncio.to_thread() pattern used in the fix for base_model_worker.py api_generate():

# Fix for multi_model_worker.py api_generate():
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    worker = worker_map[params["model"]]
    output = await asyncio.to_thread(worker.generate_gate, params)  # ← Fix
    release_worker_semaphore()
    return JSONResponse(output)

# Fix for base_model_worker.py api_get_embeddings():
@app.post("/worker_get_embeddings")
async def api_get_embeddings(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    embedding = await asyncio.to_thread(worker.get_embeddings, params)  # ← Fix
    release_worker_semaphore()
    return JSONResponse(content=embedding)

# Fix for huggingface_api_worker.py api_generate():
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    worker = worker_map[params["model"]]
    await acquire_worker_semaphore(worker)
    output = await asyncio.to_thread(worker.generate_gate, params)  # ← Fix
    release_worker_semaphore(worker)
    return JSONResponse(output)

CVSS Assessment

  • Attack Vector: Network (AV:N)
  • Attack Complexity: Low (AC:L)
  • Privileges Required: None (PR:N) — worker endpoints are unauthenticated
  • User Interaction: None (UI:N)
  • Scope: Unchanged (S:U)
  • Confidentiality: None (C:N)
  • Integrity: None (I:N)
  • Availability: High (A:H)

CVSS 3.1 Score: 7.5 (High)CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

Classification

  • CWE-400: Uncontrolled Resource Consumption
  • CWE-834: Excessive Iteration (event loop starvation)
  • Vulnerability Type: Denial of Service (DoS) — Event Loop Blocking
  • Severity: HIGH
  • Source Patch: ff664260a5c99d29b57de6489bb0fee1f04b11ca (Fixed model_worker generate_gate may blocked main thread)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions