[rollout] perf: replace AsyncOpenAI to aiohttp client in ChatCompletionScheduler by wuxibin89 · Pull Request #1588 · verl-project/verl

wuxibin89 · 2025-05-19T16:19:14Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

AsyncOpenAI has very severe performance issue due to httpx, replace it to aiohttp client. For train_batch_size=1024, AsyncOpenAI introduces ~25s per generation phase.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

…onScheduler

hongpeng-guo

LGTM, just with one small nit question.

hongpeng-guo · 2025-05-20T02:51:01Z

verl/workers/rollout/async_server.py

-            timeout=None,
-            max_retries=0
-        )
+        client = AsyncOpenAI(base_url=f"http://{address}/v1", api_key="token-abc123", timeout=None, max_retries=0)


It seems this line is the same as before, but the format changes. Just want to double check if the current one is lint with the pre-commit hook :)

Yes, it's auto format by pre-commit hook.

casper-hansen · 2025-05-26T14:15:27Z

@wuxibin89 I found that this PR reintroduced the problem fixed in #1483 because we switched from httpx to aiohttp which has a default timeout of 5 minutes. Would you mind having a look at this error to fix it? CC @U-rara.

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/verl/workers/rollout/async_server.py", line 188, in submit_chat_completions
    completions = await self._chat_completions_aiohttp(address, **chat_complete_request)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/verl/workers/rollout/async_server.py", line 203, in _chat_completions_aiohttp
    async with session.post(
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/client.py", line 1425, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/client.py", line 730, in _request
    await resp.start(conn)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1054, in start
    with self._timer:
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/aiohttp/helpers.py", line 685, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError

casper-hansen · 2025-05-26T14:36:19Z

I ended up creating a PR #1702 @U-rara @wuxibin89. Please take a look

…onScheduler (verl-project#1588) ### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? AsyncOpenAI has very severe performance issue due to httpx, replace it to aiohttp client. For train_batch_size=1024, AsyncOpenAI introduces ~25s per generation phase. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary.

[rollout] perf: replace AsyncOpenAI to aiohttp client in ChatCompleti…

81354d6

…onScheduler

wuxibin89 requested review from PeterSH6, hongpeng-guo, tongyx361 and vermouth1992 May 19, 2025 16:19

hongpeng-guo reviewed May 20, 2025

View reviewed changes

vermouth1992 approved these changes May 20, 2025

View reviewed changes

vermouth1992 merged commit 3eaaf24 into main May 20, 2025
40 of 43 checks passed

vermouth1992 deleted the wuxibin/async_vllm_perf branch May 20, 2025 03:31

casper-hansen mentioned this pull request May 26, 2025

fix TimeoutError in aiohttp #1702

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] perf: replace AsyncOpenAI to aiohttp client in ChatCompletionScheduler#1588

[rollout] perf: replace AsyncOpenAI to aiohttp client in ChatCompletionScheduler#1588
vermouth1992 merged 1 commit intomainfrom
wuxibin/async_vllm_perf

wuxibin89 commented May 19, 2025

Uh oh!

hongpeng-guo left a comment

Uh oh!

hongpeng-guo May 20, 2025 •

edited

Loading

Uh oh!

wuxibin89 May 20, 2025

Uh oh!

Uh oh!

casper-hansen commented May 26, 2025

Uh oh!

casper-hansen commented May 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wuxibin89 commented May 19, 2025

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

hongpeng-guo left a comment

Choose a reason for hiding this comment

Uh oh!

hongpeng-guo May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

casper-hansen commented May 26, 2025

Uh oh!

casper-hansen commented May 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hongpeng-guo May 20, 2025 •

edited

Loading