-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Exploitability Summary
| Aspect | Status |
|---|---|
| External Attack Path | ✅ Verified - HTTP POST to /worker_generate and /worker_get_embeddings |
| Runtime Protections Bypassed | ✅ Yes - No async wrapping, no request timeout on event loop |
| Requires Other Vulnerabilities | ✅ None - Direct exploitation via standard worker API |
| Real-World Exploitability | ✅ CONFIRMED via live exploit (9,008x amplification) — see Exploit-RealWorld-EventLoop-Blocking-DoS.md |
Vulnerability Overview
CWE-400: Uncontrolled Resource Consumption / CWE-834: Excessive Iteration
Three unpatched instances of the same blocking event loop vulnerability exist in FastChat's model worker implementations. These are direct variants of the bug fixed in commit ff66426.
Background: The Original Fix
Commit ff66426 fixed a blocking event loop issue in base_model_worker.py where worker.generate_gate(params) was called synchronously in an async FastAPI handler. The fix wrapped the call with asyncio.to_thread():
# BEFORE (VULNERABLE - ff66426 fix target):
@app.post("/worker_generate")
async def api_generate(request: Request):
params = await request.json()
await acquire_worker_semaphore()
output = worker.generate_gate(params) # ← BLOCKS event loop!
release_worker_semaphore()
return JSONResponse(output)
# AFTER (FIXED):
@app.post("/worker_generate")
async def api_generate(request: Request):
params = await request.json()
await acquire_worker_semaphore()
output = await asyncio.to_thread(worker.generate_gate, params) # ← Non-blocking!
release_worker_semaphore()
return JSONResponse(output)Unpatched Variant 1: multi_model_worker.py — api_generate() (HIGH)
File: fastchat/serve/multi_model_worker.py, line 112
@app.post("/worker_generate")
async def api_generate(request: Request):
params = await request.json()
await acquire_worker_semaphore()
worker = worker_map[params["model"]]
output = worker.generate_gate(params) # ← BLOCKS event loop! (not wrapped in asyncio.to_thread)
release_worker_semaphore()
return JSONResponse(output)worker.generate_gate() calls generate_stream_gate() which runs GPU inference via PyTorch. This can take seconds to minutes depending on the model and input. During this time, the FastAPI async event loop is completely blocked, meaning:
- No other HTTP requests are processed
- Heartbeat messages stop being sent
- The controller may deregister the worker
- Other users' requests are stalled
Unpatched Variant 2: base_model_worker.py — api_get_embeddings() (HIGH)
File: fastchat/serve/base_model_worker.py, line 218
@app.post("/worker_get_embeddings")
async def api_get_embeddings(request: Request):
params = await request.json()
await acquire_worker_semaphore()
embedding = worker.get_embeddings(params) # ← BLOCKS event loop!
release_worker_semaphore()
return JSONResponse(content=embedding)worker.get_embeddings() runs @torch.inference_mode() decorated GPU inference with tokenization, model forward pass, and embedding normalization. This is a synchronous blocking call in an async handler.
Note: The fix for generate_gate in base_model_worker.py line 209 was applied correctly via asyncio.to_thread(), but the developer missed applying the same fix to get_embeddings in the same file.
Unpatched Variant 3: huggingface_api_worker.py — api_generate() (MEDIUM)
File: fastchat/serve/huggingface_api_worker.py, line 236
@app.post("/worker_generate")
async def api_generate(request: Request):
params = await request.json()
worker = worker_map[params["model"]]
await acquire_worker_semaphore(worker)
output = worker.generate_gate(params) # ← BLOCKS event loop!
release_worker_semaphore(worker)
return JSONResponse(output)HuggingfaceApiWorker.generate_gate() calls generate_stream_gate() which makes HTTP calls to the HuggingFace Inference API. While these are I/O-bound rather than GPU-bound, they are still synchronous blocking calls in an async handler. The InferenceClient.text_generation() call is a synchronous HTTP request.
Attack Chain
# Variant 1: multi_model_worker
[Attacker]
→ POST /v1/chat/completions (stream=false, max_tokens=4096)
→ [OpenAI API Server] calls /worker_generate on multi_model_worker
→ multi_model_worker.api_generate():
→ worker.generate_gate(params) # GPU inference, blocks event loop
→ Duration: 30-120 seconds for large outputs
→ During this time: ALL other requests to this worker STALLED
→ Heartbeat fails → Controller deregisters worker
→ Other users' requests timeout
# Variant 2: base_model_worker (embeddings)
[Attacker]
→ POST /v1/embeddings (large batch of text inputs)
→ [OpenAI API Server] calls /worker_get_embeddings on worker
→ base_model_worker.api_get_embeddings():
→ worker.get_embeddings(params) # GPU inference, blocks event loop
→ Duration: seconds to minutes for large batches
→ During this time: ALL other requests to this worker STALLED
Impact
- Complete Worker DoS: A single request can stall all other requests to the model worker for the duration of GPU inference (potentially minutes)
- Heartbeat Failure: Blocked heartbeats cause the controller to deregister the worker, making the model unavailable to all users
- Cascading Failure: In a multi-model setup with
multi_model_worker, ALL models served by that worker become unavailable - No Authentication Required: Worker endpoints are not authenticated (only the OpenAI API server optionally uses API keys)
- Amplification: An attacker can chain multiple blocking requests to keep the worker perpetually unavailable
Reproduction Steps
Prerequisites
- A running FastChat deployment with model worker(s)
- Python 3 with
aiohttpinstalled
Step 1: Static Validation (no running server needed)
cd /root/llm-project/FastChat-huntr
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py --validation-onlyExpected output: Confirms 3 unpatched vulnerable code locations.
Step 2: Start FastChat (for active testing)
# Terminal 1: Controller
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001
# Terminal 2: Multi-Model Worker
python3 -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.5 \
--controller-address http://localhost:21001 \
--worker-address http://localhost:21002 \
--port 21002
# Terminal 3: OpenAI API Server
python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 \
--controller-address http://localhost:21001Step 3: Exploit - Direct Worker DoS (Variant 1)
# Send a blocking generate request directly to the multi_model_worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
--target http://localhost:21002 --variant generate --model vicuna-7b-v1.5Step 4: Exploit - Embeddings DoS (Variant 2)
# Send a blocking embeddings request to any model worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
--target http://localhost:21002 --variant embedding --model vicuna-7b-v1.5Step 5: Exploit - Via OpenAI API (indirect)
# Non-streaming requests go through /worker_generate which blocks the worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
--target http://localhost:8000 --variant api --model vicuna-7b-v1.5Step 6: Manual Verification with curl
# In terminal A: Send a blocking request (non-streaming mode hits /worker_generate)
curl -X POST http://localhost:21002/worker_generate \
-H "Content-Type: application/json" \
-d '{"model":"vicuna-7b-v1.5","prompt":"Write a very long essay about history","temperature":0.7,"max_new_tokens":2048}' &
# Immediately in terminal B: Try to get worker status (will be blocked!)
time curl -X POST http://localhost:21002/worker_get_status -H "Content-Type: application/json" -d '{}'
# Expected: The status check takes many seconds (blocked by generate_gate)Root Cause Files
| File | Line | Issue |
|---|---|---|
fastchat/serve/multi_model_worker.py |
112 | output = worker.generate_gate(params) — synchronous GPU inference in async handler |
fastchat/serve/base_model_worker.py |
218 | embedding = worker.get_embeddings(params) — synchronous GPU inference in async handler |
fastchat/serve/huggingface_api_worker.py |
236 | output = worker.generate_gate(params) — synchronous API call in async handler |
Suggested Fix
Apply the same asyncio.to_thread() pattern used in the fix for base_model_worker.py api_generate():
# Fix for multi_model_worker.py api_generate():
@app.post("/worker_generate")
async def api_generate(request: Request):
params = await request.json()
await acquire_worker_semaphore()
worker = worker_map[params["model"]]
output = await asyncio.to_thread(worker.generate_gate, params) # ← Fix
release_worker_semaphore()
return JSONResponse(output)
# Fix for base_model_worker.py api_get_embeddings():
@app.post("/worker_get_embeddings")
async def api_get_embeddings(request: Request):
params = await request.json()
await acquire_worker_semaphore()
embedding = await asyncio.to_thread(worker.get_embeddings, params) # ← Fix
release_worker_semaphore()
return JSONResponse(content=embedding)
# Fix for huggingface_api_worker.py api_generate():
@app.post("/worker_generate")
async def api_generate(request: Request):
params = await request.json()
worker = worker_map[params["model"]]
await acquire_worker_semaphore(worker)
output = await asyncio.to_thread(worker.generate_gate, params) # ← Fix
release_worker_semaphore(worker)
return JSONResponse(output)CVSS Assessment
- Attack Vector: Network (AV:N)
- Attack Complexity: Low (AC:L)
- Privileges Required: None (PR:N) — worker endpoints are unauthenticated
- User Interaction: None (UI:N)
- Scope: Unchanged (S:U)
- Confidentiality: None (C:N)
- Integrity: None (I:N)
- Availability: High (A:H)
CVSS 3.1 Score: 7.5 (High) — CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H
Classification
- CWE-400: Uncontrolled Resource Consumption
- CWE-834: Excessive Iteration (event loop starvation)
- Vulnerability Type: Denial of Service (DoS) — Event Loop Blocking
- Severity: HIGH
- Source Patch:
ff664260a5c99d29b57de6489bb0fee1f04b11ca(Fixed model_worker generate_gate may blocked main thread)