Skip to content
This repository was archived by the owner on Jun 5, 2025. It is now read-only.

Respond with JSON if the request is non-stream#149

Merged
aponcedeleonch merged 1 commit intomainfrom
normalize-vllm-output
Dec 2, 2024
Merged

Respond with JSON if the request is non-stream#149
aponcedeleonch merged 1 commit intomainfrom
normalize-vllm-output

Conversation

@aponcedeleonch
Copy link
Copy Markdown
Member

@aponcedeleonch aponcedeleonch commented Dec 2, 2024

We are currently not handling non-streaming requests, e.g.

% curl -SsX POST "http://localhost:8989/vllm/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $token" \
     -d '{
           "model": "Qwen/Qwen2.5-Coder-14B-Instruct",
           "messages": [{"role": "user", "content": "hello."}],
           "stream": false
        }'

This PR enables to respons with the entire JSON if the request was non-streaming.

Before PR response:

data: 'async for' requires an object with __aiter__ method, got ModelResponse

data: [DONE]

After this PR:

{
  "id": "chatcmpl-AZyYAAOFMsHdx2rabESYcHjsObIB2",
  "created": 1733137598,
  "model": "gpt-4o-mini-2024-07-18",
  "object": "chat.completion",
  "system_fingerprint": "fp_0705bf87c0",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! How can I assist you today?",
        "role": "assistant",
        "tool_calls": null,
        "function_call": null
      }
    }
  ],
  "usage": {
    "completion_tokens": 9,
    "prompt_tokens": 9,
    "total_tokens": 18,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  },
  "service_tier": null
}

We are currently not handling non-streaming requests, e.g.
```sh
% curl -SsX POST "http://localhost:8989/vllm/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer $token" \
     -d '{
           "model": "Qwen/Qwen2.5-Coder-14B-Instruct",
           "messages": [{"role": "user", "content": "hello."}],
           "stream": false
        }'
```

This PR enables to respons with the entire JSON if the request was non-streaming
@aponcedeleonch aponcedeleonch changed the title Respond with JSON if the request is non-async Respond with JSON if the request is non-stream Dec 2, 2024
@aponcedeleonch aponcedeleonch merged commit d4f1ab8 into main Dec 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants