Skip to content

[Data][LLM] Add should_continue_on_error support for ServeDeploymentStage (Data <> Serve) #59325

@nrghosh

Description

@nrghosh

Description

Follow-up to #59212. The current continue_on_error implementation only handles the case where the vLLM engine runs in the same lifecycle as the Ray Data pipeline (vLLMEngineStage). When using Ray Serve handles (ServeDeploymentStage), error handling requires a separate implementation due to differences in error propagation.

Background

ServeDeploymentStage accesses the LLM engine via DeploymentHandle (RPC calls) rather than in-process. This changes how errors show up (from the POV of a Ray Data user):

Error Type vLLMEngineStage (in-process) ServeDeploymentStage (RPC)
Prompt too long ValueError Wrapped in RayTaskError
Engine OOM EngineDeadError RayActorError (replica died)
Network issue N/A RayActorError, timeout
Replica crashed N/A RayActorError

Current State

serve_deployment_stage.py has the same vulnerability as the original vllm_engine_stage.py:

tasks = [asyncio.create_task(self.generate_async(row)) for row in batch]

for resp in asyncio.as_completed(tasks):
    request, output, time_taken = await resp  # Exception propagates, kills batch

Proposed Implementation

  1. Add should_continue_on_error parameter to ServeDeploymentStageUDF.__init__

  2. Create _generate_with_error_handling wrapper for generate_async

  3. Define fatal vs non-fatal errors for serve handle case:

    _SERVE_FATAL_ERRORS = (
        ray.exceptions.RayActorError,  # Replica crashed
        # Connection/timeout errors TBD
    )
  4. Handle error unwrapping - RayTaskError wraps the original exception, need to inspect cause to determine if underlying error was fatal

  5. Wire continue_on_error through ServeDeploymentProcessorConfig

Challenges

  • Error serialization: vLLM exception types (EngineDeadError) are wrapped/serialized over RPC, not directly catchable
  • Fatal error detection: Need to distinguish "replica died" (fatal, don't continue) from "request validation failed" (non-fatal, safe to continue)
  • Error unwrapping: May need to parse error messages or inspect RayTaskError.cause to determine root cause
  • Serve middleware: Errors might be converted to HTTP responses rather than exceptions depending on deployment configuration

Files to Modify

  • python/ray/llm/_internal/batch/stages/serve_deployment_stage.py - Add error handling wrapper
  • python/ray/llm/_internal/batch/processor/serve_deployment_proc.py - Wire config through
  • python/ray/llm/tests/batch/gpu/stages/test_serve_deployment_stage.py - Add tests

Related

Use case

No response

Metadata

Metadata

Assignees

Labels

dataRay Data-related issuesenhancementRequest for new feature and/or capabilityllmtriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions