Skip to content

Question on Process Reward Computation with use_state_value_reward in TreeRL #2

@RyanLiu112

Description

@RyanLiu112

Dear authors,

Thanks for your great work on process verification in RL! I have a question regarding the computation of process rewards in your paper.

In Equation (4), the process reward is defined as the sum of the global advantage and local advantage:

The global advantage in Eq. (2) is computed by subtracting the average reward across all leaf nodes:

leaf.R = leaf.R - mean[i]

The local advantage in Eq. (3) is calculated by subtracting the value of the parent node:

path.append({'answer': node.answer, 'token_answer':node.answer_token,'reward': node.value,"pass_ratio":node.correct_terminal_in_subtree/node.terminal_in_subtree,"value":child_value - parent_value,"state_value":child_value})

However, I noticed that in the provided training scripts, use_state_value_reward is enabled:

--use_state_value_reward \

From the implementation, this means the final process reward for each node is computed as the sum of:

  • Global advantage
  • Local advantage
  • The value of the node itself

This can be seen here:

elif use_state_value_reward:
print("use state value reward in mcts!!")
# paths = normalize_all_paths(paths,step_level_norm)
for path in paths:
for node in path:
# node["value"] = (node["value"] + node["state_value"])/2
node["value"] = (node["value"] + node["state_value"])

Could you clarify whether adding the node value to the process reward is intentional, and if so, how it aligns with the formulation in the paper?

Looking forward to your reply!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions