You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The score of the reward model is obtained directly from the final linear layer without undergoing normalization, which feels somewhat peculiar. What is the value range of the reward model's score, and what does it represent?