[trainer] fix: address serialization issues when using async reward function and ray ppo trainer#3769
Conversation
There was a problem hiding this comment.
Code Review
This pull request fixes a serialization bug that occurs when using an asynchronous custom reward function with the Ray PPO trainer. The issue stems from ray.remote being unable to serialize a partial function that references a dynamically loaded module (custom_module). The fix correctly addresses this by no longer passing the reward_fn to the remote worker. Instead, it passes the config and tokenizer, allowing the reward function to be reconstructed on the worker, thus avoiding the serialization problem. The change is well-reasoned and directly solves the described issue. My only concern is that this fix relies on a code path that is marked as deprecated, which introduces a maintenance risk.
| future_reward = compute_reward_async.remote( | ||
| data=batch, config=self.config, tokenizer=self.tokenizer | ||
| ) |
There was a problem hiding this comment.
This change correctly resolves the serialization error. However, by passing config and tokenizer, this code now relies on a path in compute_reward_async that is marked as deprecated with a warnings.warn in verl/trainer/ppo/reward.py. This introduces a maintenance risk, as the deprecated path could be removed in the future. Since this path is necessary to fix the bug, the deprecation warning seems incorrect and should probably be removed to avoid future confusion and potential breakage. This design inconsistency should be addressed to ensure long-term maintainability.
|
@yyDing1 Please help review |
|
I checked that the reuse of Also recommend that future implementations will deprecate this feature and directly support loading asynchronous reward functions for better efficiency (in progress). |
|
correct @yyDing1, it was actually acknowledged originally in this PR conversation, but I guess it fell through the cracks as these things do. Please let me know if there's anything I can do to expedite approval, it looks like the CI is failing due to infra errors? |
|
@benprofessionaledition I've repeatedly tried running these CI workflows, but they still fail. I suspect that the CI on your base branch might be broken, so you may need to merge it with the latest branch first. |
|
got it! I've merged current main into this; hopefully that solves it @yyDing1 |
|
@vermouth1992 This PR looks good to me and can be merged. |
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
…unction and ray ppo trainer (verl-project#3769)
What does this PR do?
This PR fixes a serialization bug that occurs in async reward functions, arising from relatively recent changes to implement caching of the reward function.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Tested on our internal cluster
API and Usage Example
Design & Code Changes
This addresses a bug in the serialization behavior when using Ray. It seems that you cannot pass
reward_fn, a partial function, tocompute_reward_asyncbecause@ray.remotewill attempt to serialize a partial function reference that is imported under the namecustom_module, which is not present on the worker process. A minimal example to reproduce this issue is as follows:and then a facsimile of the trainer code:
Running the above example is roughly the equivalent of running a command like this:
You will get the expected error message (identical to VERL's in real-world training):
By passing the config and tokenizer, as is done in the SPPO Ray Trainer recipe, instead of the reward_fn reference, we force re-initialization of this module so that it's available on the worker.
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)