Question on Process Reward Computation with `use_state_value_reward` in TreeRL

Dear authors,

Thanks for your great work on process verification in RL! I have a question regarding the computation of process rewards in your paper.

In Equation (4), the process reward is defined as the sum of the global advantage and local advantage:

The global advantage in Eq. (2) is computed by subtracting the average reward across all leaf nodes:
https://github.com/THUDM/TreeRL/blob/63ea99e0253cc320e53b2b47eb1bcfa8adafd666/openrlhf/trainer/ppo_utils/tree_node.py#L390

The local advantage in Eq. (3) is calculated by subtracting the value of the parent node:
https://github.com/THUDM/TreeRL/blob/63ea99e0253cc320e53b2b47eb1bcfa8adafd666/openrlhf/trainer/ppo_utils/parallel_mcts.py#L1615

However, I noticed that in the provided training scripts, `use_state_value_reward` is enabled:
https://github.com/THUDM/TreeRL/blob/63ea99e0253cc320e53b2b47eb1bcfa8adafd666/scripts/treerl-qw14b.sh#L78

From the implementation, this means the final process reward for each node is computed as the sum of:

- Global advantage
- Local advantage
- The value of the node itself

This can be seen here:
https://github.com/THUDM/TreeRL/blob/63ea99e0253cc320e53b2b47eb1bcfa8adafd666/openrlhf/trainer/ppo_utils/parallel_mcts.py#L1664-L1670

Could you clarify whether adding the node value to the process reward is intentional, and if so, how it aligns with the formulation in the paper?

Looking forward to your reply!

	elif use_state_value_reward:
	print("use state value reward in mcts!!")
	# paths = normalize_all_paths(paths,step_level_norm)
	for path in paths:
	for node in path:
	# node["value"] = (node["value"] + node["state_value"])/2
	node["value"] = (node["value"] + node["state_value"])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Process Reward Computation with `use_state_value_reward` in TreeRL #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on Process Reward Computation with use_state_value_reward in TreeRL #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Question on Process Reward Computation with `use_state_value_reward` in TreeRL #2