-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Dear authors,
Thanks for your great work on process verification in RL! I have a question regarding the computation of process rewards in your paper.
In Equation (4), the process reward is defined as the sum of the global advantage and local advantage:
The global advantage in Eq. (2) is computed by subtracting the average reward across all leaf nodes:
| leaf.R = leaf.R - mean[i] |
The local advantage in Eq. (3) is calculated by subtracting the value of the parent node:
| path.append({'answer': node.answer, 'token_answer':node.answer_token,'reward': node.value,"pass_ratio":node.correct_terminal_in_subtree/node.terminal_in_subtree,"value":child_value - parent_value,"state_value":child_value}) |
However, I noticed that in the provided training scripts, use_state_value_reward is enabled:
TreeRL/scripts/treerl-qw14b.sh
Line 78 in 63ea99e
| --use_state_value_reward \ |
From the implementation, this means the final process reward for each node is computed as the sum of:
- Global advantage
- Local advantage
- The value of the node itself
This can be seen here:
TreeRL/openrlhf/trainer/ppo_utils/parallel_mcts.py
Lines 1664 to 1670 in 63ea99e
| elif use_state_value_reward: | |
| print("use state value reward in mcts!!") | |
| # paths = normalize_all_paths(paths,step_level_norm) | |
| for path in paths: | |
| for node in path: | |
| # node["value"] = (node["value"] + node["state_value"])/2 | |
| node["value"] = (node["value"] + node["state_value"]) |
Could you clarify whether adding the node value to the process reward is intentional, and if so, how it aligns with the formulation in the paper?
Looking forward to your reply!