[Security] Denial of Service via Blocking Event Loop in Model Workers (Incomplete Fix for ff66426)

## Exploitability Summary

| Aspect | Status |
|--------|--------|
| External Attack Path | ✅ Verified - HTTP POST to /worker_generate and /worker_get_embeddings |
| Runtime Protections Bypassed | ✅ Yes - No async wrapping, no request timeout on event loop |
| Requires Other Vulnerabilities | ✅ None - Direct exploitation via standard worker API |
| Real-World Exploitability | ✅ CONFIRMED via live exploit (9,008x amplification) — see `Exploit-RealWorld-EventLoop-Blocking-DoS.md` |

## Vulnerability Overview

**CWE-400: Uncontrolled Resource Consumption / CWE-834: Excessive Iteration**

Three unpatched instances of the same blocking event loop vulnerability exist in FastChat's model worker implementations. These are **direct variants** of the bug fixed in commit [ff66426](https://github.com/lm-sys/FastChat/commit/ff664260a5c99d29b57de6489bb0fee1f04b11ca).

### Background: The Original Fix

Commit `ff66426` fixed a blocking event loop issue in `base_model_worker.py` where `worker.generate_gate(params)` was called synchronously in an async FastAPI handler. The fix wrapped the call with `asyncio.to_thread()`:

```python
# BEFORE (VULNERABLE - ff66426 fix target):
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    output = worker.generate_gate(params)  # ← BLOCKS event loop!
    release_worker_semaphore()
    return JSONResponse(output)

# AFTER (FIXED):
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    output = await asyncio.to_thread(worker.generate_gate, params)  # ← Non-blocking!
    release_worker_semaphore()
    return JSONResponse(output)
```

### Unpatched Variant 1: `multi_model_worker.py` — `api_generate()` (HIGH)

**File:** `fastchat/serve/multi_model_worker.py`, line 112

```python
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    worker = worker_map[params["model"]]
    output = worker.generate_gate(params)  # ← BLOCKS event loop! (not wrapped in asyncio.to_thread)
    release_worker_semaphore()
    return JSONResponse(output)
```

`worker.generate_gate()` calls `generate_stream_gate()` which runs GPU inference via PyTorch. This can take **seconds to minutes** depending on the model and input. During this time, the FastAPI async event loop is **completely blocked**, meaning:

- No other HTTP requests are processed
- Heartbeat messages stop being sent
- The controller may deregister the worker
- Other users' requests are stalled

### Unpatched Variant 2: `base_model_worker.py` — `api_get_embeddings()` (HIGH)

**File:** `fastchat/serve/base_model_worker.py`, line 218

```python
@app.post("/worker_get_embeddings")
async def api_get_embeddings(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    embedding = worker.get_embeddings(params)  # ← BLOCKS event loop!
    release_worker_semaphore()
    return JSONResponse(content=embedding)
```

`worker.get_embeddings()` runs `@torch.inference_mode()` decorated GPU inference with tokenization, model forward pass, and embedding normalization. This is a **synchronous blocking call** in an async handler.

**Note:** The fix for `generate_gate` in `base_model_worker.py` line 209 was applied correctly via `asyncio.to_thread()`, but the developer **missed** applying the same fix to `get_embeddings` in the same file.

### Unpatched Variant 3: `huggingface_api_worker.py` — `api_generate()` (MEDIUM)

**File:** `fastchat/serve/huggingface_api_worker.py`, line 236

```python
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    worker = worker_map[params["model"]]
    await acquire_worker_semaphore(worker)
    output = worker.generate_gate(params)  # ← BLOCKS event loop!
    release_worker_semaphore(worker)
    return JSONResponse(output)
```

`HuggingfaceApiWorker.generate_gate()` calls `generate_stream_gate()` which makes HTTP calls to the HuggingFace Inference API. While these are I/O-bound rather than GPU-bound, they are still synchronous blocking calls in an async handler. The `InferenceClient.text_generation()` call is a synchronous HTTP request.

## Attack Chain

```
# Variant 1: multi_model_worker
[Attacker]
    → POST /v1/chat/completions (stream=false, max_tokens=4096)
    → [OpenAI API Server] calls /worker_generate on multi_model_worker
    → multi_model_worker.api_generate():
        → worker.generate_gate(params)  # GPU inference, blocks event loop
        → Duration: 30-120 seconds for large outputs
    → During this time: ALL other requests to this worker STALLED
        → Heartbeat fails → Controller deregisters worker
        → Other users' requests timeout

# Variant 2: base_model_worker (embeddings)
[Attacker]
    → POST /v1/embeddings (large batch of text inputs)
    → [OpenAI API Server] calls /worker_get_embeddings on worker
    → base_model_worker.api_get_embeddings():
        → worker.get_embeddings(params)  # GPU inference, blocks event loop
        → Duration: seconds to minutes for large batches
    → During this time: ALL other requests to this worker STALLED
```

## Impact

1. **Complete Worker DoS**: A single request can stall all other requests to the model worker for the duration of GPU inference (potentially minutes)
2. **Heartbeat Failure**: Blocked heartbeats cause the controller to deregister the worker, making the model unavailable to all users
3. **Cascading Failure**: In a multi-model setup with `multi_model_worker`, ALL models served by that worker become unavailable
4. **No Authentication Required**: Worker endpoints are not authenticated (only the OpenAI API server optionally uses API keys)
5. **Amplification**: An attacker can chain multiple blocking requests to keep the worker perpetually unavailable

## Reproduction Steps

### Prerequisites

1. A running FastChat deployment with model worker(s)
2. Python 3 with `aiohttp` installed

### Step 1: Static Validation (no running server needed)

```bash
cd /root/llm-project/FastChat-huntr
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py --validation-only
```

**Expected output:** Confirms 3 unpatched vulnerable code locations.

### Step 2: Start FastChat (for active testing)

```bash
# Terminal 1: Controller
python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001

# Terminal 2: Multi-Model Worker
python3 -m fastchat.serve.multi_model_worker \
    --model-path lmsys/vicuna-7b-v1.5 \
    --controller-address http://localhost:21001 \
    --worker-address http://localhost:21002 \
    --port 21002

# Terminal 3: OpenAI API Server
python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 \
    --controller-address http://localhost:21001
```

### Step 3: Exploit - Direct Worker DoS (Variant 1)

```bash
# Send a blocking generate request directly to the multi_model_worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
    --target http://localhost:21002 --variant generate --model vicuna-7b-v1.5
```

### Step 4: Exploit - Embeddings DoS (Variant 2)

```bash
# Send a blocking embeddings request to any model worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
    --target http://localhost:21002 --variant embedding --model vicuna-7b-v1.5
```

### Step 5: Exploit - Via OpenAI API (indirect)

```bash
# Non-streaming requests go through /worker_generate which blocks the worker
python3 llm-enhance/cve-finding/DoS/poc-blocking-event-loop.py \
    --target http://localhost:8000 --variant api --model vicuna-7b-v1.5
```

### Step 6: Manual Verification with curl

```bash
# In terminal A: Send a blocking request (non-streaming mode hits /worker_generate)
curl -X POST http://localhost:21002/worker_generate \
  -H "Content-Type: application/json" \
  -d '{"model":"vicuna-7b-v1.5","prompt":"Write a very long essay about history","temperature":0.7,"max_new_tokens":2048}' &

# Immediately in terminal B: Try to get worker status (will be blocked!)
time curl -X POST http://localhost:21002/worker_get_status -H "Content-Type: application/json" -d '{}'

# Expected: The status check takes many seconds (blocked by generate_gate)
```

## Root Cause Files

| File | Line | Issue |
|------|------|-------|
| `fastchat/serve/multi_model_worker.py` | 112 | `output = worker.generate_gate(params)` — synchronous GPU inference in async handler |
| `fastchat/serve/base_model_worker.py` | 218 | `embedding = worker.get_embeddings(params)` — synchronous GPU inference in async handler |
| `fastchat/serve/huggingface_api_worker.py` | 236 | `output = worker.generate_gate(params)` — synchronous API call in async handler |

## Suggested Fix

Apply the same `asyncio.to_thread()` pattern used in the fix for `base_model_worker.py` `api_generate()`:

```python
# Fix for multi_model_worker.py api_generate():
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    worker = worker_map[params["model"]]
    output = await asyncio.to_thread(worker.generate_gate, params)  # ← Fix
    release_worker_semaphore()
    return JSONResponse(output)

# Fix for base_model_worker.py api_get_embeddings():
@app.post("/worker_get_embeddings")
async def api_get_embeddings(request: Request):
    params = await request.json()
    await acquire_worker_semaphore()
    embedding = await asyncio.to_thread(worker.get_embeddings, params)  # ← Fix
    release_worker_semaphore()
    return JSONResponse(content=embedding)

# Fix for huggingface_api_worker.py api_generate():
@app.post("/worker_generate")
async def api_generate(request: Request):
    params = await request.json()
    worker = worker_map[params["model"]]
    await acquire_worker_semaphore(worker)
    output = await asyncio.to_thread(worker.generate_gate, params)  # ← Fix
    release_worker_semaphore(worker)
    return JSONResponse(output)
```

## CVSS Assessment

- **Attack Vector**: Network (AV:N)
- **Attack Complexity**: Low (AC:L)
- **Privileges Required**: None (PR:N) — worker endpoints are unauthenticated
- **User Interaction**: None (UI:N)
- **Scope**: Unchanged (S:U)
- **Confidentiality**: None (C:N)
- **Integrity**: None (I:N)
- **Availability**: High (A:H)

**CVSS 3.1 Score: 7.5 (High)** — `CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H`

## Classification

- **CWE-400**: Uncontrolled Resource Consumption
- **CWE-834**: Excessive Iteration (event loop starvation)
- **Vulnerability Type**: Denial of Service (DoS) — Event Loop Blocking
- **Severity**: HIGH
- **Source Patch**: `ff664260a5c99d29b57de6489bb0fee1f04b11ca` (Fixed model_worker generate_gate may blocked main thread)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security] Denial of Service via Blocking Event Loop in Model Workers (Incomplete Fix for ff66426) #3793

Exploitability Summary

Vulnerability Overview

Background: The Original Fix

Unpatched Variant 1: `multi_model_worker.py` — `api_generate()` (HIGH)

Unpatched Variant 2: `base_model_worker.py` — `api_get_embeddings()` (HIGH)

Unpatched Variant 3: `huggingface_api_worker.py` — `api_generate()` (MEDIUM)

Attack Chain

Impact

Reproduction Steps

Prerequisites

Step 1: Static Validation (no running server needed)

Step 2: Start FastChat (for active testing)

Step 3: Exploit - Direct Worker DoS (Variant 1)

Step 4: Exploit - Embeddings DoS (Variant 2)

Step 5: Exploit - Via OpenAI API (indirect)

Step 6: Manual Verification with curl

Root Cause Files

Suggested Fix

CVSS Assessment

Classification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Status
External Attack Path	✅ Verified - HTTP POST to /worker_generate and /worker_get_embeddings
Runtime Protections Bypassed	✅ Yes - No async wrapping, no request timeout on event loop
Requires Other Vulnerabilities	✅ None - Direct exploitation via standard worker API
Real-World Exploitability	✅ CONFIRMED via live exploit (9,008x amplification) — see `Exploit-RealWorld-EventLoop-Blocking-DoS.md`

File	Line	Issue
`fastchat/serve/multi_model_worker.py`	112	`output = worker.generate_gate(params)` — synchronous GPU inference in async handler
`fastchat/serve/base_model_worker.py`	218	`embedding = worker.get_embeddings(params)` — synchronous GPU inference in async handler
`fastchat/serve/huggingface_api_worker.py`	236	`output = worker.generate_gate(params)` — synchronous API call in async handler

[Security] Denial of Service via Blocking Event Loop in Model Workers (Incomplete Fix for ff66426) #3793

Description

Exploitability Summary

Vulnerability Overview

Background: The Original Fix

Unpatched Variant 1: multi_model_worker.py — api_generate() (HIGH)

Unpatched Variant 2: base_model_worker.py — api_get_embeddings() (HIGH)

Unpatched Variant 3: huggingface_api_worker.py — api_generate() (MEDIUM)

Attack Chain

Impact

Reproduction Steps

Prerequisites

Step 1: Static Validation (no running server needed)

Step 2: Start FastChat (for active testing)

Step 3: Exploit - Direct Worker DoS (Variant 1)

Step 4: Exploit - Embeddings DoS (Variant 2)

Step 5: Exploit - Via OpenAI API (indirect)

Step 6: Manual Verification with curl

Root Cause Files

Suggested Fix

CVSS Assessment

Classification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unpatched Variant 1: `multi_model_worker.py` — `api_generate()` (HIGH)

Unpatched Variant 2: `base_model_worker.py` — `api_get_embeddings()` (HIGH)

Unpatched Variant 3: `huggingface_api_worker.py` — `api_generate()` (MEDIUM)