Add support for MaxRL by catherinelee274 · Pull Request #5026 · huggingface/trl

catherinelee274 · 2026-02-09T13:32:32Z

What does this PR do?

Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…o script

LeonEricsson · 2026-02-19T13:55:29Z

Doesn't MaxRL reduce to simply changing the advantage normalization denominator from std(r) to mean(r)?

# GRPO
A_i = (r_i - mean(r)) / (std(r) + eps)

# MaxRL
A_i = (r_i - mean(r)) / (mean(r) + eps)

If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module.

LeonEricsson

Would you mind writing a paper index section for MaxRL as well?

trl/trainer/grpo_trainer.py

LeonEricsson · 2026-02-26T07:12:32Z

tests/test_grpo_trainer.py

+    def test_maxrl_advantage_normalization(self):
+        """Unit test: MaxRL uses A_i = (r_i - mean(r)) / (mean(r) + eps), not std(r)."""
+        # rewards for two groups of 3
+        rewards = torch.tensor([1.0, 0.0, 0.0, 1.0, 1.0, 0.0])
+        num_generations = 3
+
+        mean_grouped = rewards.view(-1, num_generations).mean(dim=1)
+        mean_grouped = mean_grouped.repeat_interleave(num_generations, dim=0)
+
+        eps = 1e-4
+        advantages = (rewards - mean_grouped) / (mean_grouped + eps)
+
+        # group 0: mean=1/3, advantages = (r - 1/3) / (1/3 + eps)
+        # group 1: mean=2/3, advantages = (r - 2/3) / (2/3 + eps)
+        mean0 = torch.tensor(1.0 / 3.0)
+        mean1 = torch.tensor(2.0 / 3.0)
+        expected = torch.tensor(
+            [
+                (1.0 - mean0) / (mean0 + eps),
+                (0.0 - mean0) / (mean0 + eps),
+                (0.0 - mean0) / (mean0 + eps),
+                (1.0 - mean1) / (mean1 + eps),
+                (1.0 - mean1) / (mean1 + eps),
+                (0.0 - mean1) / (mean1 + eps),
+            ]
+        )
+        torch.testing.assert_close(advantages, expected)


Suggested change

def test_maxrl_advantage_normalization(self):

"""Unit test: MaxRL uses A_i = (r_i - mean(r)) / (mean(r) + eps), not std(r)."""

# rewards for two groups of 3

rewards = torch.tensor([1.0, 0.0, 0.0, 1.0, 1.0, 0.0])

num_generations = 3

mean_grouped = rewards.view(-1, num_generations).mean(dim=1)

mean_grouped = mean_grouped.repeat_interleave(num_generations, dim=0)

eps = 1e-4

advantages = (rewards - mean_grouped) / (mean_grouped + eps)

# group 0: mean=1/3, advantages = (r - 1/3) / (1/3 + eps)

# group 1: mean=2/3, advantages = (r - 2/3) / (2/3 + eps)

mean0 = torch.tensor(1.0 / 3.0)

mean1 = torch.tensor(2.0 / 3.0)

expected = torch.tensor(

[

(1.0 - mean0) / (mean0 + eps),

(0.0 - mean0) / (mean0 + eps),

(0.0 - mean0) / (mean0 + eps),

(1.0 - mean1) / (mean1 + eps),

(1.0 - mean1) / (mean1 + eps),

(0.0 - mean1) / (mean1 + eps),

]

)

torch.testing.assert_close(advantages, expected)

This is fine as an offline test for the PR but doesn't need to be a part of the test suite

LeonEricsson · 2026-02-26T07:14:58Z

tests/test_grpo_trainer.py

+    # ------------------------------------------------------------------
+    # MaxRL tests (scale_rewards="mean")
+    # ------------------------------------------------------------------


Suggested change

# ------------------------------------------------------------------

# MaxRL tests (scale_rewards="mean")

# ------------------------------------------------------------------

LeonEricsson · 2026-02-26T07:18:10Z

tests/test_grpo_trainer.py

+    def test_maxrl_training_with_eval(self):
+        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
+
+        training_args = GRPOConfig(
+            output_dir=self.tmp_dir,
+            learning_rate=0.1,
+            per_device_train_batch_size=3,
+            num_generations=3,
+            max_completion_length=8,
+            scale_rewards="mean",
+            eval_strategy="steps",
+            eval_steps=2,
+            report_to="none",
+        )
+        trainer = GRPOTrainer(
+            model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
+            reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
+            args=training_args,
+            train_dataset=dataset,
+            eval_dataset=dataset,
+        )
+
+        trainer.train()
+
+        assert trainer.state.log_history[-1]["train_loss"] is not None


Suggested change

def test_maxrl_training_with_eval(self):

dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")

training_args = GRPOConfig(

output_dir=self.tmp_dir,

learning_rate=0.1,

per_device_train_batch_size=3,

num_generations=3,

max_completion_length=8,

scale_rewards="mean",

eval_strategy="steps",

eval_steps=2,

report_to="none",

)

trainer = GRPOTrainer(

model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",

reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",

args=training_args,

train_dataset=dataset,

eval_dataset=dataset,

)

trainer.train()

assert trainer.state.log_history[-1]["train_loss"] is not None

LeonEricsson · 2026-02-26T07:33:28Z

tests/test_grpo_trainer.py

+    def test_maxrl_advantage_zero_mean(self):
+        """When all rewards in a group are 0, advantages should be 0 (not NaN)."""
+        rewards = torch.tensor([0.0, 0.0, 0.0])
+        num_generations = 3
+
+        mean_grouped = rewards.view(-1, num_generations).mean(dim=1)
+        mean_grouped = mean_grouped.repeat_interleave(num_generations, dim=0)
+
+        eps = 1e-4
+        advantages = (rewards - mean_grouped) / (mean_grouped + eps)
+
+        # numerator is 0 for all, denominator is eps → advantages all 0
+        assert not torch.isnan(advantages).any(), "advantages must not contain NaN"
+        torch.testing.assert_close(advantages, torch.zeros(3))


Suggested change

def test_maxrl_advantage_zero_mean(self):

"""When all rewards in a group are 0, advantages should be 0 (not NaN)."""

rewards = torch.tensor([0.0, 0.0, 0.0])

num_generations = 3

mean_grouped = rewards.view(-1, num_generations).mean(dim=1)

mean_grouped = mean_grouped.repeat_interleave(num_generations, dim=0)

eps = 1e-4

advantages = (rewards - mean_grouped) / (mean_grouped + eps)

# numerator is 0 for all, denominator is eps → advantages all 0

assert not torch.isnan(advantages).any(), "advantages must not contain NaN"

torch.testing.assert_close(advantages, torch.zeros(3))

same for this. These tests are tautological, they reimplement behavior inline rather than testing trl code.

LeonEricsson · 2026-02-26T07:39:05Z

tests/test_grpo_trainer.py

+    def test_maxrl_training_multiple_reward_funcs(self):
+        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
+
+        def reward_func1(completions, **kwargs):
+            return [1.0] * len(completions)
+
+        def reward_func2(completions, **kwargs):
+            return [len(c) * 0.01 for c in completions]
+
+        training_args = GRPOConfig(
+            output_dir=self.tmp_dir,
+            learning_rate=0.1,
+            per_device_train_batch_size=3,
+            num_generations=3,
+            max_completion_length=8,
+            scale_rewards="mean",
+            report_to="none",
+        )
+        trainer = GRPOTrainer(
+            model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
+            reward_funcs=[reward_func1, reward_func2],
+            args=training_args,
+            train_dataset=dataset,
+        )
+
+        previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
+
+        trainer.train()
+
+        assert trainer.state.log_history[-1]["train_loss"] is not None
+        for n, param in previous_trainable_params.items():
+            new_param = trainer.model.get_parameter(n)
+            assert not torch.equal(param, new_param), f"Parameter {n} has not changed."


this seems like a duplicate of test_maxrl_training since we're not actually verifying anything related to the multiple reward functions.

LeonEricsson · 2026-02-26T07:43:23Z

tests/test_grpo_trainer.py

+    def test_maxrl_training_conversational(self):
+        dataset = load_dataset("trl-internal-testing/zen", "conversational_prompt_only", split="train")
+
+        training_args = GRPOConfig(
+            output_dir=self.tmp_dir,
+            learning_rate=0.1,
+            per_device_train_batch_size=3,
+            num_generations=3,
+            max_completion_length=8,
+            scale_rewards="mean",
+            report_to="none",
+        )
+        trainer = GRPOTrainer(
+            model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
+            reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
+            args=training_args,
+            train_dataset=dataset,
+        )
+
+        previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}
+
+        trainer.train()
+
+        assert trainer.state.log_history[-1]["train_loss"] is not None
+        for n, param in previous_trainable_params.items():
+            new_param = trainer.model.get_parameter(n)
+            assert not torch.equal(param, new_param), f"Parameter {n} has not changed."


similar comment to test_maxrl_training_multiple_reward_funcs. I also don't see why the combination of MaxRL and conversatinal training need a dedicated test, isn't this already covered by existing conversational tests?

removed. Yes perhaps maxrl is too small of a new change to warrant new tests.

LeonEricsson · 2026-02-26T07:44:45Z

tests/test_grpo_trainer.py

+    def test_maxrl_training_peft(self):
+        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
+
+        training_args = GRPOConfig(
+            output_dir=self.tmp_dir,
+            learning_rate=0.1,
+            per_device_train_batch_size=3,
+            num_generations=3,
+            max_completion_length=8,
+            scale_rewards="mean",
+            report_to="none",
+        )
+        trainer = GRPOTrainer(
+            model="trl-internal-testing/tiny-Qwen2ForCausalLM-2.5",
+            reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",
+            args=training_args,
+            train_dataset=dataset,
+            peft_config=LoraConfig(task_type="CAUSAL_LM"),
+        )
+
+        trainer.train()
+
+        assert trainer.state.log_history[-1]["train_loss"] is not None
+        assert isinstance(trainer.model, PeftModel)


same thoughts as test_maxrl_training_multiple_reward_funcs and test_maxrl_training_conversational

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational

LeonEricsson

final comments. then i'm satisfied.
needs a maintainers approval before merging.

trl/trainer/grpo_trainer.py

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already

catherinelee274 added 6 commits February 9, 2026 08:31

Add support for MaxRL

ac87cc7

Merge branch 'main' into clee_maxrl

cb163ef

Add tests

5fc482e

Update maxrl trainer to use from ..models.utils import

cf52d2d

Merge branch 'huggingface:main' into clee_maxrl

1db77b8

Update config to pass tests, and rename script and add more details t…

4d38e63

…o script

catherinelee274 changed the title ~~Add support for MaxRL [WIP]~~ Add support for MaxRL Feb 17, 2026

catherinelee274 marked this pull request as ready for review February 17, 2026 06:25

catherinelee274 added 2 commits February 19, 2026 17:54

Update maxrl to not have trainer but be variant of grpo

3604dd5

Merge branch 'main' into clee_maxrl

d3fd0a1

LeonEricsson reviewed Feb 20, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

catherinelee274 added 3 commits February 25, 2026 03:29

Add paper index and add test and remove abs

48ccdf3

Merge branch 'main' into clee_maxrl

1a5ac6c

Merge branch 'main' into clee_maxrl

528d580

LeonEricsson reviewed Feb 26, 2026

View reviewed changes

catherinelee274 added 2 commits February 26, 2026 07:08

Remove tests specified in comments and section header

c6597fd

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational

Merge branch 'main' into clee_maxrl

8b748bf

LeonEricsson reviewed Feb 26, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

catherinelee274 added 2 commits February 28, 2026 08:28

Remove some commented sections

a2299b5

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already

Merge branch 'main' into clee_maxrl

a5c7888

	# ------------------------------------------------------------------
	# MaxRL tests (scale_rewards="mean")
	# ------------------------------------------------------------------

Conversation

catherinelee274 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Feb 19, 2026

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catherinelee274 commented Feb 9, 2026 •

edited

Loading