[data.llm] Allow vLLM engine to be shared by sequential processors with serve deployments#55179
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @jeffreyjeffreywang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a significant enhancement to the Ray Data LLM API by enabling the sharing of vLLM engine instances across sequential processors. Previously, each step in a multi-stage LLM pipeline would create its own dedicated vLLM engine, leading to substantial resource overhead. My changes introduce a new configuration that leverages Ray Serve to deploy a single, shared vLLM engine, allowing multiple processing stages to utilize the same engine. This dramatically improves resource efficiency and is particularly beneficial for complex, multi-turn conversational AI applications.
Highlights
- Shared vLLM Engine Configuration: I've introduced a new configuration class,
vLLMSharedEngineProcessorConfig, which allows multiple sequential processors to share a single vLLM engine instance. This is crucial for optimizing resource utilization, especially in multi-turn conversational AI workflows where each step previously instantiated its own engine. - Ray Serve Integration for Engine Sharing: To enable this sharing, I've integrated Ray Serve deployments. The
vLLMSharedEngineProcessorConfignow wraps anLLMConfigto define the vLLM engine and its deployment settings, ensuring that the engine is deployed as a persistent Ray Serve application. - Centralized Shared Engine Registry: I've implemented a
_SharedEngineRegistryto manage and track these shared vLLM engine deployments. This registry ensures that if the samevLLMSharedEngineProcessorConfiginstance is used across different processors, they correctly connect to and reuse the same underlying Serve deployment, preventing redundant engine instantiations. - Dedicated Shared Engine Processing Stage: A new processing stage,
vLLMSharedEngineStage, has been added. This stage is specifically designed to interact with the shared vLLM engine via its Ray Serve handle, abstracting away the complexities of managing the shared deployment for the data processing pipeline. - Dynamic Processor Building Logic: I've updated the
build_llm_processorfunction to intelligently detect whether a shared engine configuration is being used and to route the processing through the appropriate vLLM engine stage (either the traditionalvLLMEngineStageor the newvLLMSharedEngineStage). - New Unit Tests for Shared Engine: Comprehensive unit tests have been added to validate the shared engine functionality, including scenarios for multi-turn conversations and verifying that only a single Serve deployment is instantiated for a given shared configuration instance.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces the capability to share a vLLM engine across multiple sequential processors using Ray Serve, which is a great feature for optimizing resource usage in multi-turn inference scenarios. The implementation introduces a new vLLMSharedEngineProcessorConfig and a registry to manage shared engine deployments. The code is well-structured and includes good tests that verify the new functionality.
My review focuses on a few key areas for improvement:
- Performance: The current implementation for the shared engine stage sends requests individually instead of in batches, which could be a significant performance bottleneck.
- Correctness: There's a minor issue with how a metric is calculated and a potential typo in how deployment names are generated.
- Clarity: The documentation for
vLLMSharedEngineProcessorConfigcould be improved to make a crucial usage detail more explicit to users.
Overall, this is a solid contribution that addresses an important use case. Addressing the feedback will make it more robust and performant.
| async def udf(self, batch: List[Dict[str, Any]]) -> AsyncIterator[Dict[str, Any]]: | ||
| """Run the shared vLLM engine through serve deployment. | ||
|
|
||
| Args: | ||
| batch: A list of rows to run the vLLM engine on. | ||
|
|
||
| Returns: | ||
| The response of the vLLM engine. | ||
| """ | ||
| requests = [self._prepare_llm_request(row) for row in batch] | ||
|
|
||
| batch_uuid = uuid.uuid4() | ||
| t = time.perf_counter() | ||
|
|
||
| tasks = [] | ||
| for i, req in enumerate(requests): | ||
|
|
||
| async def process_with_index(request, idx): | ||
| result = await self._process_request(request) | ||
| return idx, request, result | ||
|
|
||
| task = asyncio.create_task(process_with_index(req, i)) | ||
| tasks.append(task) | ||
|
|
There was a problem hiding this comment.
The current implementation of udf processes a batch of requests by sending them to the Serve deployment one by one. This is inefficient and undermines the benefits of batch processing in Ray Data. The vLLM Serve deployment supports batching for both completion and embedding requests via the OpenAI-compatible API. To improve performance, this method should be refactored to send a single batched request to the deployment instead of iterating and sending individual requests.
For example, for GENERATE tasks, you could collect all prompts into a list and send a single CompletionRequest.
This change is critical for achieving good performance with the shared engine.
python/ray/data/llm.py
Outdated
| Use vLLMSharedEngineProcessorConfig only when you want multiple processors to share | ||
| the same vLLM engine (e.g. save compute resources for multi-turn conversations). For most | ||
| use cases where engine sharing is not needed, use vLLMEngineProcessorConfig instead. |
There was a problem hiding this comment.
To share a vLLM engine across multiple processors, users must pass the exact same instance of the vLLMSharedEngineProcessorConfig object when building them. Creating two separate config objects, even with identical parameters, will result in two separate engine deployments. This is a crucial detail that can be easily missed and should be explicitly mentioned in the docstring to avoid confusion and unexpected resource usage.
I suggest adding a note to clarify this behavior.
| shared_processor_config: vLLMSharedEngineProcessorConfig, | ||
| ) -> str: | ||
| """Create a Ray Serve deployment for the shared engine configuration.""" | ||
| deployment_name = f"shared_llm_engine_{uuid.uuid4().hex}:" |
There was a problem hiding this comment.
The deployment name is constructed with a trailing colon: f"shared_llm_engine_{uuid.uuid4().hex}:". While colons are sometimes used in Serve application names for versioning (e.g., my_app:v1), a trailing colon here seems unintentional and may result in an oddly named application in Ray Serve. This is likely a typo and should be removed.
| deployment_name = f"shared_llm_engine_{uuid.uuid4().hex}:" | |
| deployment_name = f"shared_llm_engine_{uuid.uuid4().hex}" |
python/ray/llm/_internal/batch/stages/vllm_shared_engine_stage.py
Outdated
Show resolved
Hide resolved
|
As discussed in #55179, my first attempt to enable Here are some open questions:
@kouroshHakha Please take a look at this PR, and let me know if you have any concerns! |
|
(@kouroshHakha is currently OOO will get back later this week) |
There was a problem hiding this comment.
Thanks @jeffreyjeffreywang for taking a stab at this. I think the poc overall is in the right direction. I think we should hash out the overall architecture and API in a design doc first. I started a draft here: https://docs.google.com/document/d/1QClDPT_iyUYIPg4ybNrnKxsw7igf15UeVtKgw0swzag/edit?tab=t.0 (It is viewable publically)
The benefits of this design over the PoC:
- It does not need the introduction of a shared registry and all the mechanics around it. You let the user manage the lifetime of the application and get access via name.
- We introduce a new generic primitive that is abstracted with generic serve concepts like deployment name, and method, etc rather than vLLM vs. sglang vs. other specific implementation of stages
- preprocess / post-process functions should be generalized to work directly with the native data types than the convention specific for vLLMProcessor stage as a result You do not have to deal with the intricacies of converting data types back and forth between vLLMEngineInput/Output and OpenAI data objects.
- You don’t have to keep dealing with offline LLM stage specific things like has_image, apply_chat_template, tokenize, etc since those concept do not really directly generalize to this serve deployment conventions.
Let's see if we can get this design prototyped.
python/ray/llm/_internal/batch/stages/vllm_shared_engine_stage.py
Outdated
Show resolved
Hide resolved
| num_input_tokens=num_input_tokens, | ||
| generated_text=response.choices[0].text, | ||
| ) | ||
| elif self.task_type == vLLMTaskType.EMBED: |
There was a problem hiding this comment.
Let's add this later and not support this for now?
There was a problem hiding this comment.
With the latest design, it seems like this will be natively handled.
There was a problem hiding this comment.
chat and completions API are supported in the latest revision. Will leave embedding for follow-ups.
492a281 to
d4f4ceb
Compare
|
Will rebase onto master once we're aligned on the overall approach. |
python/ray/llm/tests/batch/gpu/processor/test_serve_deployment_proc.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/batch/stages/serve_deployment_stage.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/batch/processor/serve_deployment_proc.py
Outdated
Show resolved
Hide resolved
python/ray/llm/tests/batch/gpu/processor/test_serve_deployment_proc.py
Outdated
Show resolved
Hide resolved
d4f4ceb to
34688c4
Compare
|
To compare the performance of the new Throughput (samples/second)
Baseed on the numbers, there isn't a significant performance difference between two processors. GPU utilization remains consistently around 90% for both, though ServeDeploymentStage shows slightly more fluctuation compared to vLLMEngineStage. cc: @kouroshHakha |
python/ray/llm/_internal/batch/stages/serve_deployment_stage.py
Outdated
Show resolved
Hide resolved
kouroshHakha
left a comment
There was a problem hiding this comment.
reviewed together. leaving comments until I do a full pass review.
…equests through serve deployments Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
|
Linked my benchmark script in the PR description. Also fixed doc tests identified by CI. |
|
doc lint is broken @jeffreyjeffreywang ? |
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
LGTM. Just one nit. + doc lint is not passing.
319622e to
0bd2287
Compare
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
0bd2287 to
14b0765
Compare
| *, | ||
| name_prefix: Optional[str] = None, | ||
| deployment_kwargs: Optional[dict] = None, | ||
| override_serve_options: Optional[dict] = None, |
There was a problem hiding this comment.
I meant only from the public api. This is the internal api that we are removing this from.
basically keep the deployment_kwargs under _internal application_builders but remove it from serve/llm/__init__.py
There was a problem hiding this comment.
Ah mb, restored.
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
…th serve deployments (ray-project#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…th serve deployments (ray-project#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…th serve deployments (ray-project#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local>
…th serve deployments (ray-project#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>
…th serve deployments (ray-project#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…th serve deployments (#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…th serve deployments Original PR #55179 by jeffreyjeffreywang Original: ray-project/ray#55179
…equential processors with serve deployments Merged from original PR #55179 Original: ray-project/ray#55179
…th serve deployments (ray-project#55179) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Why are these changes needed?
Sequential batch inference with Ray Data LLM API requires creating separate processors for each step, and each processor inherently creates its own vLLM engine when UDF is dispatched to actors, leading to inefficient resource usage as each engine needs dedicated resources. It's ideal to enable engine sharing across sequential processors to reduce resource requirements. Please refer to #52277 for more details regarding the motivation.
In this PR, we allow different processors to share the same
ray.servedeployment instances with the new, genericServeDeploymentProcessorConfig. The deployment's actors are kept alive until the Ray process shuts down. This addresses the core issue where each processor step was creating its own dedicated actor pool.app_nameanddeployment_namewhile creatingServeDeploymentProcessorConfig.ServeDeploymentStage, ray data workers redirect them to the underlying serve deployment. The results are retrieved asynchronously from the serve deployment and sent to downstream stages within the processor.Benchmarks
Concurrency vs. Throughput
The following benchmark is performed on A10G (
g5.48xlargeinstance) withbatch_size=64, drawing 10,000 samples from https://huggingface.co/datasets/Crystalcareai/Code-feedback-sharegpt-renamed. GPU utilization is higher and more stable with vLLMEngineStage, while ServeDeploymentStage exhibits increasing fluctuations as concurrency grows.Multi-turn Usage
In this benchmark, we evaluate multi-turn conversation scenarios where the output of the first processor serves as the input to the second. The benchmark was conducted on an A10G GPU with batch_size=64, concurrency=1, and 10,000 samples drawn from the https://huggingface.co/datasets/Crystalcareai/Code-feedback-sharegpt-renamed.
With
vLLMEngineStage(processor built byvLLMEngineProcessorConfig), two GPUs are required since GPUs cannot be shared across processors. In contrast,ServeDeploymentStage(processor built byServeDeploymentProcessorConfig) only needs a single GPU, as multiple stages can share the same underlying deployment and therefore the same GPU.In the table below, throughput refers to the number of samples processed per second, while normalized throughput is the throughput divided by the number of GPUs used.
Benchmark script: https://github.com/jeffreyjeffreywang/ray/blob/benchmark-serve-deployment-stage/python/ray/llm/examples/benchmark.py.
Related issue number
Resolves #52277
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.