Feature request
Add GenRM capabilities to the fully async pipeline.
FullyAsyncRollouter hardcodes self.use_rm = False (code), preventing GenRM from being used with fully async GRPO. The sync pipeline already supports this via the reward loop infrastructure.
Happy to submit a PR.
Motivation
Fully async GRPO currently only supports rule-based rewards. If you want to use a reward model that evaluates reasoning quality (GenRM / LLM-as-a-judge), there's no way to do it without modifying the source — even though the sync pipeline already supports it and the underlying infrastructure (reward loop, reward router, agent loop worker scoring) is all there. It's just not wired up.
I would prefer to manage my own judge model and not rely on external APIs.
Your contribution
Happy to submit the PR.
Feature request
Add GenRM capabilities to the fully async pipeline.
FullyAsyncRollouter hardcodes self.use_rm = False (code), preventing GenRM from being used with fully async GRPO. The sync pipeline already supports this via the reward loop infrastructure.
Happy to submit a PR.
Motivation
Fully async GRPO currently only supports rule-based rewards. If you want to use a reward model that evaluates reasoning quality (GenRM / LLM-as-a-judge), there's no way to do it without modifying the source — even though the sync pipeline already supports it and the underlying infrastructure (reward loop, reward router, agent loop worker scoring) is all there. It's just not wired up.
I would prefer to manage my own judge model and not rely on external APIs.
Your contribution
Happy to submit the PR.