[Data][LLM] Add should_continue_on_error for graceful error handling in batch inference#59212
Conversation
…h inference Add `continue_on_error` parameter to vLLM batch processor config. When enabled, inference failures yield error rows instead of crashing the job. Scoped to Ray Data LLM batch inference only; no changes to Ray Data core. Addresses: ray-project#52449 Related: ray-project#52457 When running LLM batch inference at scale, a single bad row (e.g., prompt exceeding max_model_len) can crash the whole batch. Add optional `continue_on_error` parameter to processor config. The parameter defaults to False, preserving existing fail-fast behavior. When set to True: - Catch exceptions in vLLMEngineStageUDF - Failed rows yield with `__inference_error__` column set to error message - Successful rows have `__inference_error__: None` - Error rows bypass postprocess (avoids crashes when expected fields missing) - Job completes with mixed success/failure outputs - Users filter downstream: ds.filter(lambda r: r["__inference_error__"] is None) 1. **Default behavior unchanged**: `continue_on_error=False` preserves existing fail-fast semantics. This is opt-in only. 2. **Error rows bypass postprocess**: User's postprocess function likely expects `generated_text` and other output fields. Error rows won't have these, so we skip postprocess to avoid secondary crashes. 3. **Error as Optional[str]**: The `__inference_error__` column is None on success, or contains the error message (with type) on failure. This provides debuggability while keeping schema simple. 4. **LLM operator only**: Per feedback, this is scoped to the LLM processor implementation. No changes to Ray Data core primitives. - **Silent vs visible failures**: Choose visible failures (error column) over silent dropping for observability. - **Schema addition**: All outputs now include `__inference_error__` column. This is necessary for users to distinguish success from failure and debug. - **No retry mechanism**: Retrying and auto-tuning is outside the scope of this PR. --- - `python/ray/data/llm.py` - Document new parameter in public API - `python/ray/llm/_internal/batch/processor/base.py` - Add `continue_on_error` to config - `python/ray/llm/_internal/batch/processor/vllm_engine_proc.py` - Pass config to stage - `python/ray/llm/_internal/batch/stages/base.py` - Skip postprocess for error rows - `python/ray/llm/_internal/batch/stages/vllm_engine_stage.py` - Catch errors and yield error rows ```python from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor config = vLLMEngineProcessorConfig( model_source="meta-llama/Meta-Llama-3.1-8B-Instruct", continue_on_error=True, # Enable graceful error handling ) processor = build_llm_processor( config, preprocess=lambda row: dict( messages=[{"role": "user", "content": row["prompt"]}], sampling_params=dict(temperature=0.3, max_tokens=100), ), postprocess=lambda row: dict( response=row["generated_text"], ), ) ds = ray.data.read_json("prompts.json") result = processor(ds) successful = result.filter(lambda r: r["__inference_error__"] is None) successful.write_json("outputs/") failed = result.filter(lambda r: r["__inference_error__"] is not None) print(f"Failed: {failed.count()} rows") failed.show(5) ``` `python/ray/llm/tests/batch/cpu/stages/test_stage_base.py`: - `test_wrap_postprocess_bypasses_error_rows` - `test_wrap_postprocess_success_rows_run_postprocess` `python/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py`: - `test_vllm_udf_default_raises_on_error` - `test_vllm_udf_continue_on_error_yields_error_row` - `test_vllm_udf_mixed_success_and_error` Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a continue_on_error parameter for graceful error handling during batch inference, which is a great addition for robustness at scale. The implementation correctly isolates the error handling within the vLLM engine stage and provides a mechanism to flag failed rows using the __inference_error__ column.
My review focuses on improving the debuggability of failed requests and ensuring schema consistency in the output. Specifically, I've suggested:
- Preserving more context in error rows to make it easier to identify the problematic input.
- Ensuring the
__inference_error__column is present in all output rows (both success and failure) for a consistent schema, as described in the PR description. - Including request parameters in the error output for more complete debugging information.
Overall, the changes are well-structured and the addition of tests is thorough. Addressing these points will make the feature even more user-friendly and robust.
- Distinguish fatal (EngineDeadError) vs recoverable errors; fatal errors always propagate even when continue_on_error=True - Error rows bypass downstream stage UDFs to prevent crashes when expected fields (e.g., generated_tokens) are missing - Include original prompt in error rows for debuggability - Add __inference_error__ column to success rows when continue_on_error=True for consistent output schema; no schema change when False (backwards compatible) - Add tests for fatal error propagation, error row bypass, and schema consistency Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a continue_on_error flag for graceful error handling in vLLM batch inference, which is an excellent addition for improving the robustness of large-scale inference jobs. The implementation is well-structured, creating error rows with an __inference_error__ column instead of crashing. My review identified a high-severity issue where debugging information added to error rows is later stripped out during post-processing. I've provided a suggestion to fix this, along with corresponding updates to the tests to ensure the debugging information is preserved as intended.
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a continue_on_error parameter for vLLM batch inference in Ray Data. This allows inference jobs to continue even if some rows fail, by yielding error rows with an __inference_error__ column instead of crashing. The changes are well-structured, with error handling logic localized to the vLLM stage and a generic mechanism for bypassing error rows in subsequent stages. The implementation correctly distinguishes between recoverable and fatal vLLM errors. The changes are also well-covered by new unit tests. My main feedback is a minor suggestion to replace a magic number with a constant for better code maintainability.
| if len(prompt) > 500: | ||
| prompt = prompt[:500] + "...[truncated]" |
kouroshHakha
left a comment
There was a problem hiding this comment.
LGTM. just some minor comments
| if len(prompt) > 500: | ||
| prompt = prompt[:500] + "...[truncated]" |
| # Assign the index of the row in the batch to the idx_in_batch_column. | ||
| # This is beacuse the UDF output may be out-of-order (if asyncio.as_completed | ||
| # is used interanlly for example), and we need to carry over unused input | ||
| # This is because the UDF output may be out-of-order (if asyncio.as_completed |
There was a problem hiding this comment.
Bug: Validation fails for error rows before bypass logic
The validate_inputs call at line 171 runs on all rows before error rows are separated (lines 181-190). Error rows from vLLMEngineStage lack required keys like generated_tokens that downstream stages such as DetokenizeStage expect. This causes validation to fail with "Required input keys not found" before error rows can be bypassed. The tests avoid this by passing expected_input_keys=None, but in production the DetokenizeStage has required keys. The validation needs to skip or be called after error row separation.
Additional Locations (1)
| # to avoid bloating the output. | ||
| prompt = row.get("prompt", "") | ||
| if len(prompt) > 500: | ||
| prompt = prompt[:500] + "...[truncated]" |
There was a problem hiding this comment.
Bug: Error handler crashes if prompt is None
The error handler at line 619 uses row.get("prompt", "") to retrieve the prompt for debugging. However, Python's dict.get() only returns the default when the key is missing - if the key exists with value None, it returns None. Then len(prompt) at line 620 raises TypeError: object of type 'NoneType' has no len(). This crashes inside the except Exception block, causing the job to fail despite continue_on_error=True, which defeats the purpose of graceful error handling.
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
LGTM just a few non-blockers?
| """ | ||
| super().__init__(data_column, expected_input_keys) | ||
| self.model = model | ||
| self.continue_on_error = continue_on_error |
There was a problem hiding this comment.
can we renamed this parameter to should _continue_on_error globally?
| "time_taken_llm": time_taken_llm, | ||
| "params": str(request.params), | ||
| } | ||
| output = await resp |
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
| # Include snippet of failed prompt | ||
| prompt = row.get("prompt", "") | ||
| if len(prompt) > _MAX_PROMPT_LENGTH_IN_ERROR: | ||
| prompt = prompt[:_MAX_PROMPT_LENGTH_IN_ERROR] + "...[truncated]" |
There was a problem hiding this comment.
Bug: Prompt lost in error rows after early pop
The error handling in _generate_with_error_handling tries to capture the prompt via row.get("prompt", "") on line 629. However, generate_async internally calls _prepare_llm_request which pops the prompt from the row (prompt = row.pop("prompt") on line 326) before the actual LLM call. When an error occurs after the prompt is popped (e.g., during vLLM engine processing), the error row will have an empty string for prompt instead of the actual prompt content, defeating the purpose of including it for debugging. The prompt needs to be captured before calling generate_async.
nrghosh
left a comment
There was a problem hiding this comment.
@kouroshHakha addressed comments, thanks
will add support for serve handles <> data in a separate PR
cc @richardliaw
…tage Add continue_on_error parameter to ServeDeploymentProcessorConfig. When enabled, inference failures yield error rows instead of crashing the job. Changes: - Add should_continue_on_error to ServeDeploymentProcessorConfig - Add error handling to ServeDeploymentStageUDF with fatal error detection - Wire config through build_serve_deployment_processor - Update base.py: wrap_postprocess bypasses error rows, StatefulStageUDF skips UDF for error rows, include_error_column for schema consistency - Add tests for error handling in serve deployment stage and base stage Fatal errors (RayActorError, BackPressureError, DeploymentUnavailableError) always propagate since the replica/deployment is dead. Non-fatal errors (e.g., ValueError from vLLM) yield error rows with __inference_error__ set. Review questions: - Should should_continue_on_error move to base ProcessorConfig? - BackPressureError could benefit from retry logic (out of scope here) Addresses: ray-project#59325 Related: ray-project#59212, ray-project#52449 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…tage Extends ray-project#59212 to support ServeDeploymentStage. When enabled, inference failures yield error rows with __inference_error__ set instead of crashing. Changes: - Add should_continue_on_error to ServeDeploymentProcessorConfig - Add error handling to ServeDeploymentStageUDF with fatal error detection - Fatal errors (RayActorError, BackPressureError, DeploymentUnavailableError) always propagate; non-fatal errors yield error rows - Add tests for error handling behavior Addresses: ray-project#59325 Related: ray-project#59212 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…tage ## Why is this change needed? This is a follow-up to ray-project#59212 which added graceful error handling for vLLMEngineProcessorConfig. ServeDeploymentStage also needs error handling to prevent entire batch jobs from crashing due to single bad rows. When processing large datasets via Ray Serve deployments, a single malformed request (e.g., invalid parameters, too-long prompt) would crash the entire job. Users want the ability to: 1. Continue processing despite individual row failures 2. Receive error information for failed rows for debugging/retry 3. Distinguish between recoverable errors (single row) and fatal errors (replica/deployment crash) ## Changes **ServeDeploymentProcessorConfig**: - Add `should_continue_on_error: bool = False` field **ServeDeploymentStageUDF**: - Add `_generate_with_error_handling()` that catches non-fatal exceptions and yields error rows with `__inference_error__` set - Add `_is_fatal_error()` to distinguish fatal vs recoverable errors - Include truncated `request_kwargs` in error rows for debuggability **Fatal Error Detection**: - `RayActorError`: Replica crashed or unavailable - `BackPressureError`: System overloaded (note: could benefit from retry) - `DeploymentUnavailableError`: Deployment failed to deploy - Also handles `RayTaskError` wrapping by inspecting `.cause` Non-fatal errors (e.g., ValueError from invalid params) yield error rows when `should_continue_on_error=True`; fatal errors always propagate. ## Tests - test_serve_udf_default_raises_on_error - test_serve_udf_continue_on_error_yields_error_row - test_serve_udf_mixed_success_and_error - test_serve_udf_fatal_errors_always_propagate (parametrized for all 3 types) - test_serve_udf_success_with_continue_on_error_includes_none_error Related to ray-project#59325
Update test_vllm_udf_fatal_error_always_raises to verify that fatal errors (EngineDeadError) now trigger ray.actor.exit_actor() for recovery instead of simply re-raising. The original intent (PR ray-project#59212) was that fatal errors should NOT be swallowed by should_continue_on_error. This is preserved - fatal errors still don't yield error rows. The change is that instead of re-raising (which caused infinite retry loops on the same broken actor), we now exit the actor to enable recovery. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…in batch inference (ray-project#59212) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Add
should_continue_on_errorparameter to vLLM batch processor config. Whenenabled, even nonfatal inference failures yield error rows instead of crashing the job.
Scoped to Ray Data LLM batch inference only; no changes to Ray Data core.
Addresses: #52449
Related: #52457
Problem
When running LLM batch inference at scale, a single bad row (e.g., prompt
exceeding max_model_len) can crash the whole batch.
vLLM AsyncLLM engine distinguishes two types of errors: EngineGenerateError and EngineDeadError.
Solution
Add optional
should_continue_on_errorparameter to processor config. Theparameter defaults to False, preserving existing fail-fast behavior.
When set to True:
__inference_error__column set to error message__inference_error__: Noneds.filter(lambda r: r["__inference_error__"] is None)Design
Default behavior unchanged:
should_continue_on_error=Falsepreservesexisting fail-fast semantics. This is opt-in only.
Error rows bypass postprocess: User's postprocess function likely
expects
generated_textand other output fields. Error rows won't havethese, so we skip postprocess to avoid secondary crashes.
Error as Optional[str]: The
__inference_error__column is None onsuccess, or contains the error message (with type) on failure. This
provides debuggability while keeping schema simple.
LLM operator only: Per feedback, this is scoped to the LLM processor
implementation. No changes to Ray Data core primitives.
Questions
Tradeoffs
Files Changed
python/ray/data/llm.py- Document new parameter in public APIpython/ray/llm/_internal/batch/processor/base.py- Addshould_continue_on_errorto configpython/ray/llm/_internal/batch/processor/vllm_engine_proc.py- Pass config to stagepython/ray/llm/_internal/batch/stages/base.py- Skip postprocess for error rowspython/ray/llm/_internal/batch/stages/vllm_engine_stage.py- Catch errors and yield error rowsExample Usage
Tests
python/ray/llm/tests/batch/cpu/stages/test_stage_base.py:test_wrap_postprocess_bypasses_error_rowstest_wrap_postprocess_success_rows_run_postprocesspython/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py:test_vllm_udf_default_raises_on_errortest_vllm_udf_should_continue_on_error_yields_error_rowtest_vllm_udf_mixed_success_and_error