Skip to content

[data.llm] Allow vLLM engine to be shared by sequential processors with serve deployments#55179

Merged
kouroshHakha merged 11 commits intoray-project:masterfrom
jeffreyjeffreywang:llm-multi-turn-2
Aug 28, 2025
Merged

[data.llm] Allow vLLM engine to be shared by sequential processors with serve deployments#55179
kouroshHakha merged 11 commits intoray-project:masterfrom
jeffreyjeffreywang:llm-multi-turn-2

Conversation

@jeffreyjeffreywang
Copy link
Contributor

@jeffreyjeffreywang jeffreyjeffreywang commented Aug 3, 2025

Why are these changes needed?

Sequential batch inference with Ray Data LLM API requires creating separate processors for each step, and each processor inherently creates its own vLLM engine when UDF is dispatched to actors, leading to inefficient resource usage as each engine needs dedicated resources. It's ideal to enable engine sharing across sequential processors to reduce resource requirements. Please refer to #52277 for more details regarding the motivation.

In this PR, we allow different processors to share the same ray.serve deployment instances with the new, generic ServeDeploymentProcessorConfig. The deployment's actors are kept alive until the Ray process shuts down. This addresses the core issue where each processor step was creating its own dedicated actor pool.

  • User experience: To share LLMs across processors, users create Ray LLM serve deployments and reference the app_name and deployment_name while creating ServeDeploymentProcessorConfig.
  • Data flow: Once data batches reach ServeDeploymentStage, ray data workers redirect them to the underlying serve deployment. The results are retrieved asynchronously from the serve deployment and sent to downstream stages within the processor.

Benchmarks

Concurrency vs. Throughput

The following benchmark is performed on A10G (g5.48xlarge instance) with batch_size=64, drawing 10,000 samples from https://huggingface.co/datasets/Crystalcareai/Code-feedback-sharegpt-renamed. GPU utilization is higher and more stable with vLLMEngineStage, while ServeDeploymentStage exhibits increasing fluctuations as concurrency grows.

Stage type Concurrency Throughput (samples/sec)
ServeDeploymentStage 8 145.62
6 112.44
4 81.17
2 40.23
1 29.36
vLLMEngineStage 8 105.69
6 93.78
4 78.27
2 50.14
1 29.02
concurrency-throughput-comp

Multi-turn Usage

In this benchmark, we evaluate multi-turn conversation scenarios where the output of the first processor serves as the input to the second. The benchmark was conducted on an A10G GPU with batch_size=64, concurrency=1, and 10,000 samples drawn from the https://huggingface.co/datasets/Crystalcareai/Code-feedback-sharegpt-renamed.

With vLLMEngineStage (processor built by vLLMEngineProcessorConfig), two GPUs are required since GPUs cannot be shared across processors. In contrast, ServeDeploymentStage (processor built by ServeDeploymentProcessorConfig) only needs a single GPU, as multiple stages can share the same underlying deployment and therefore the same GPU.

In the table below, throughput refers to the number of samples processed per second, while normalized throughput is the throughput divided by the number of GPUs used.

Stage type Throughput Normalized throughput
vLLMEngineStage -> vLLMEngineStage 28.64 14.32
ServeDeploymentStage -> ServeDeploymentStage 22.46 22.46

Benchmark script: https://github.com/jeffreyjeffreywang/ray/blob/benchmark-serve-deployment-stage/python/ray/llm/examples/benchmark.py.

Related issue number

Resolves #52277

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jeffreyjeffreywang jeffreyjeffreywang requested review from a team as code owners August 3, 2025 06:48
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jeffreyjeffreywang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a significant enhancement to the Ray Data LLM API by enabling the sharing of vLLM engine instances across sequential processors. Previously, each step in a multi-stage LLM pipeline would create its own dedicated vLLM engine, leading to substantial resource overhead. My changes introduce a new configuration that leverages Ray Serve to deploy a single, shared vLLM engine, allowing multiple processing stages to utilize the same engine. This dramatically improves resource efficiency and is particularly beneficial for complex, multi-turn conversational AI applications.

Highlights

  • Shared vLLM Engine Configuration: I've introduced a new configuration class, vLLMSharedEngineProcessorConfig, which allows multiple sequential processors to share a single vLLM engine instance. This is crucial for optimizing resource utilization, especially in multi-turn conversational AI workflows where each step previously instantiated its own engine.
  • Ray Serve Integration for Engine Sharing: To enable this sharing, I've integrated Ray Serve deployments. The vLLMSharedEngineProcessorConfig now wraps an LLMConfig to define the vLLM engine and its deployment settings, ensuring that the engine is deployed as a persistent Ray Serve application.
  • Centralized Shared Engine Registry: I've implemented a _SharedEngineRegistry to manage and track these shared vLLM engine deployments. This registry ensures that if the same vLLMSharedEngineProcessorConfig instance is used across different processors, they correctly connect to and reuse the same underlying Serve deployment, preventing redundant engine instantiations.
  • Dedicated Shared Engine Processing Stage: A new processing stage, vLLMSharedEngineStage, has been added. This stage is specifically designed to interact with the shared vLLM engine via its Ray Serve handle, abstracting away the complexities of managing the shared deployment for the data processing pipeline.
  • Dynamic Processor Building Logic: I've updated the build_llm_processor function to intelligently detect whether a shared engine configuration is being used and to route the processing through the appropriate vLLM engine stage (either the traditional vLLMEngineStage or the new vLLMSharedEngineStage).
  • New Unit Tests for Shared Engine: Comprehensive unit tests have been added to validate the shared engine functionality, including scenarios for multi-turn conversations and verifying that only a single Serve deployment is instantiated for a given shared configuration instance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the capability to share a vLLM engine across multiple sequential processors using Ray Serve, which is a great feature for optimizing resource usage in multi-turn inference scenarios. The implementation introduces a new vLLMSharedEngineProcessorConfig and a registry to manage shared engine deployments. The code is well-structured and includes good tests that verify the new functionality.

My review focuses on a few key areas for improvement:

  1. Performance: The current implementation for the shared engine stage sends requests individually instead of in batches, which could be a significant performance bottleneck.
  2. Correctness: There's a minor issue with how a metric is calculated and a potential typo in how deployment names are generated.
  3. Clarity: The documentation for vLLMSharedEngineProcessorConfig could be improved to make a crucial usage detail more explicit to users.

Overall, this is a solid contribution that addresses an important use case. Addressing the feedback will make it more robust and performant.

Comment on lines +148 to +171
async def udf(self, batch: List[Dict[str, Any]]) -> AsyncIterator[Dict[str, Any]]:
"""Run the shared vLLM engine through serve deployment.

Args:
batch: A list of rows to run the vLLM engine on.

Returns:
The response of the vLLM engine.
"""
requests = [self._prepare_llm_request(row) for row in batch]

batch_uuid = uuid.uuid4()
t = time.perf_counter()

tasks = []
for i, req in enumerate(requests):

async def process_with_index(request, idx):
result = await self._process_request(request)
return idx, request, result

task = asyncio.create_task(process_with_index(req, i))
tasks.append(task)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of udf processes a batch of requests by sending them to the Serve deployment one by one. This is inefficient and undermines the benefits of batch processing in Ray Data. The vLLM Serve deployment supports batching for both completion and embedding requests via the OpenAI-compatible API. To improve performance, this method should be refactored to send a single batched request to the deployment instead of iterating and sending individual requests.

For example, for GENERATE tasks, you could collect all prompts into a list and send a single CompletionRequest.

This change is critical for achieving good performance with the shared engine.

Comment on lines +256 to +258
Use vLLMSharedEngineProcessorConfig only when you want multiple processors to share
the same vLLM engine (e.g. save compute resources for multi-turn conversations). For most
use cases where engine sharing is not needed, use vLLMEngineProcessorConfig instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To share a vLLM engine across multiple processors, users must pass the exact same instance of the vLLMSharedEngineProcessorConfig object when building them. Creating two separate config objects, even with identical parameters, will result in two separate engine deployments. This is a crucial detail that can be easily missed and should be explicitly mentioned in the docstring to avoid confusion and unexpected resource usage.

I suggest adding a note to clarify this behavior.

shared_processor_config: vLLMSharedEngineProcessorConfig,
) -> str:
"""Create a Ray Serve deployment for the shared engine configuration."""
deployment_name = f"shared_llm_engine_{uuid.uuid4().hex}:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deployment name is constructed with a trailing colon: f"shared_llm_engine_{uuid.uuid4().hex}:". While colons are sometimes used in Serve application names for versioning (e.g., my_app:v1), a trailing colon here seems unintentional and may result in an oddly named application in Ray Serve. This is likely a typo and should be removed.

Suggested change
deployment_name = f"shared_llm_engine_{uuid.uuid4().hex}:"
deployment_name = f"shared_llm_engine_{uuid.uuid4().hex}"

@ray-gardener ray-gardener bot added community-contribution Contributed by the community serve Ray Serve Related Issue performance data Ray Data-related issues llm labels Aug 3, 2025
@jeffreyjeffreywang
Copy link
Contributor Author

As discussed in #55179, my first attempt to enable ActorPool sharing across ActorPoolMapOperators violated Ray Data assumption -- each stage's worker pool is independent. After discussing with Kourosh, we think that leveraging Ray Serve for actor pool management to enable engine sharing across processors is the most promising approach.

Here are some open questions:

  • Should Ray LLM explicitly manage the lifecycle of deployments? To be simple, the initial version relies on Ray automatic cleanup when program shuts down (e.g. ray.shutdown()). @kouroshHakha what risks do you see with relying on automatic cleanup?
  • I'm uncertain about the conversions between OpenAI-native request/responses (e.g. CompletionRequest, CompletionResponse) and Ray LLM's input/output format (e.g. vLLMEngineRequest, vLLMOutputData). There is a mismatch between the output formats of vllm.AsyncLLMEngine.generate and LLMDeployment.completions. The former returns rich metadata (e.g. token IDs, token counts), while the latter only provides generated text. vLLMOutputData is tailored to the vLLM engine output, but the current version uses it for both use cases, leaving missing fields when used with Serve deployments. We could create a new data format tailored to LLMDeployment.completions's limited output instead.

@kouroshHakha Please take a look at this PR, and let me know if you have any concerns!

@richardliaw
Copy link
Contributor

(@kouroshHakha is currently OOO will get back later this week)

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeffreyjeffreywang for taking a stab at this. I think the poc overall is in the right direction. I think we should hash out the overall architecture and API in a design doc first. I started a draft here: https://docs.google.com/document/d/1QClDPT_iyUYIPg4ybNrnKxsw7igf15UeVtKgw0swzag/edit?tab=t.0 (It is viewable publically)

The benefits of this design over the PoC:

  • It does not need the introduction of a shared registry and all the mechanics around it. You let the user manage the lifetime of the application and get access via name.
  • We introduce a new generic primitive that is abstracted with generic serve concepts like deployment name, and method, etc rather than vLLM vs. sglang vs. other specific implementation of stages
  • preprocess / post-process functions should be generalized to work directly with the native data types than the convention specific for vLLMProcessor stage as a result You do not have to deal with the intricacies of converting data types back and forth between vLLMEngineInput/Output and OpenAI data objects.
  • You don’t have to keep dealing with offline LLM stage specific things like has_image, apply_chat_template, tokenize, etc since those concept do not really directly generalize to this serve deployment conventions.

Let's see if we can get this design prototyped.

num_input_tokens=num_input_tokens,
generated_text=response.choices[0].text,
)
elif self.task_type == vLLMTaskType.EMBED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this later and not support this for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest design, it seems like this will be natively handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chat and completions API are supported in the latest revision. Will leave embedding for follow-ups.

@jeffreyjeffreywang
Copy link
Contributor Author

Will rebase onto master once we're aligned on the overall approach.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving my first pass review here. We chatted offline about the main stuff that might not be mentioned in this review feedback. I left a few nits, etc in this round.

@jeffreyjeffreywang
Copy link
Contributor Author

To compare the performance of the new ServeDeploymentStage and the existing vLLMEngineStage, I ran a benchmark against A10G (g5.8xlarge instance) with batch size of 64, concurrency of 1, and drawing 10,000 samples from https://huggingface.co/datasets/Crystalcareai/Code-feedback-sharegpt-renamed.

Throughput (samples/second)

  • ServeDeploymentStage: 31.37
  • vLLMEngineStage: 29.16

Baseed on the numbers, there isn't a significant performance difference between two processors. GPU utilization remains consistently around 90% for both, though ServeDeploymentStage shows slightly more fluctuation compared to vLLMEngineStage.

cc: @kouroshHakha

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed together. leaving comments until I do a full pass review.

…equests through serve deployments

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
@jeffreyjeffreywang
Copy link
Contributor Author

jeffreyjeffreywang commented Aug 26, 2025

Linked my benchmark script in the PR description. Also fixed doc tests identified by CI.

@kouroshHakha
Copy link
Contributor

kouroshHakha commented Aug 26, 2025

doc lint is broken @jeffreyjeffreywang ?


[2025-08-26T23:54:22Z] python/ray/llm/_internal/batch/stages/serve_deployment_stage.py
--
  | [2025-08-26T23:54:22Z]     114: DOC404: Method `ServeDeploymentStageUDF.udf` yield type(s) in docstring not consistent with the return annotation. The yield type (the 0th arg in Generator[...]/Iterator[...]): Dict[str, Any]; docstring "yields" section types:


Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one nit. + doc lint is not passing.

@jeffreyjeffreywang jeffreyjeffreywang force-pushed the llm-multi-turn-2 branch 2 times, most recently from 319622e to 0bd2287 Compare August 27, 2025 21:09
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
*,
name_prefix: Optional[str] = None,
deployment_kwargs: Optional[dict] = None,
override_serve_options: Optional[dict] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant only from the public api. This is the internal api that we are removing this from.
basically keep the deployment_kwargs under _internal application_builders but remove it from serve/llm/__init__.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah mb, restored.

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
@kouroshHakha kouroshHakha enabled auto-merge (squash) August 27, 2025 23:35
@kouroshHakha kouroshHakha disabled auto-merge August 28, 2025 00:19
@kouroshHakha kouroshHakha enabled auto-merge (squash) August 28, 2025 00:20
@kouroshHakha kouroshHakha merged commit a5d032b into ray-project:master Aug 28, 2025
7 checks passed
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…th serve deployments (ray-project#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana pushed a commit to tohtana/ray that referenced this pull request Aug 29, 2025
…th serve deployments (ray-project#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
gangsf pushed a commit to gangsf/ray that referenced this pull request Sep 2, 2025
…th serve deployments (ray-project#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…th serve deployments (ray-project#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…th serve deployments (ray-project#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…th serve deployments (#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
snorkelopstesting3-bot pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55179_47cc8e7b-a430-4679-b242-c3b87239e004 that referenced this pull request Oct 22, 2025
…th serve deployments

Original PR #55179 by jeffreyjeffreywang
Original: ray-project/ray#55179
snorkelopstesting3-bot added a commit to snorkel-marlin-repos/ray-project_ray_pr_55179_47cc8e7b-a430-4679-b242-c3b87239e004 that referenced this pull request Oct 22, 2025
…equential processors with serve deployments

Merged from original PR #55179
Original: ray-project/ray#55179
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…th serve deployments (ray-project#55179)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests llm performance serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] [LLM] Allow vLLM deployments to be shared by sequential processors

5 participants