Add support for MaxRL by catherinelee274 · Pull Request #5026 · huggingface/trl

catherinelee274 · 2026-02-09T13:32:32Z

What does this PR do?

Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Touches core GRPO advantage computation and validation logic; incorrect scaling/edge cases (e.g., low/zero mean rewards) could destabilize training despite added tests.

Overview
Adds an experimental MaxRL-style reward scaling option to GRPO by introducing scale_rewards="mean", which normalizes group advantages by the group mean (supported only with multi_objective_aggregation="sum_then_normalize").

Updates GRPOConfig docs/help text, extends trainer validation/advantage computation to handle the new mode, adds a new examples/scripts/maxrl.py end-to-end training script, and expands GRPO trainer tests to cover the new scaling option (including a dedicated MaxRL training smoke test).

^{Written by Cursor Bugbot for commit 4dbc89c. This will update automatically on new commits. Configure here.}

…o script

LeonEricsson · 2026-02-19T13:55:29Z

Doesn't MaxRL reduce to simply changing the advantage normalization denominator from std(r) to mean(r)?

# GRPO
A_i = (r_i - mean(r)) / (std(r) + eps)

# MaxRL
A_i = (r_i - mean(r)) / (mean(r) + eps)

If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module.

LeonEricsson

Would you mind writing a paper index section for MaxRL as well?

trl/trainer/grpo_trainer.py

tests/test_grpo_trainer.py

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational

LeonEricsson

final comments. then i'm satisfied.
needs a maintainers approval before merging.

trl/trainer/grpo_trainer.py

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-05T04:03:58Z

trl/trainer/grpo_trainer.py

            advantages = rewards - mean_grouped_rewards
-            if self.scale_rewards != "none":
+            if self.scale_rewards == "mean":
+                advantages = advantages / (mean_grouped_rewards + 1e-4)


Negative mean reward silently inverts advantage signs

Medium Severity

When scale_rewards="mean", advantages are divided by mean_grouped_rewards + 1e-4. Unlike std_rewards (always ≥ 0), mean_grouped_rewards can be negative when rewards are not binary (e.g., from a reward model). If the group mean is below -1e-4, the denominator becomes negative, silently flipping all advantage signs — the model would then reinforce bad completions and penalize good ones. The docs say this is for binary rewards, but no runtime validation enforces that constraint. Using abs(mean_grouped_rewards) or clamping to non-negative in the denominator would prevent this silent inversion.

cursor · 2026-03-05T04:03:58Z

examples/scripts/maxrl.py

+def extract_boxed(text: str) -> str | None:
+    """Return the last \\boxed{...} content from text, or None if absent."""
+    matches = re.findall(r"\\boxed\{([^}]*)\}", text)
+    return matches[-1].strip() if matches else None


Regex fails to extract boxed content with nested braces

High Severity

The extract_boxed regex r"\\boxed\{([^}]*)\}" uses [^}]* which stops at the first closing brace. Math answers almost always contain nested braces (e.g. \boxed{\frac{1}{3}}), and this regex would extract only \frac{1 instead of \frac{1}{3}. Since the accuracy_reward function relies on extract_boxed for answer matching, correct answers with fractions, exponents, or any nested LaTeX will never receive a reward of 1.0, effectively making the training signal broken for the majority of math problems.

catherinelee274 added 6 commits February 9, 2026 08:31

Add support for MaxRL

ac87cc7

Merge branch 'main' into clee_maxrl

cb163ef

Add tests

5fc482e

Update maxrl trainer to use from ..models.utils import

cf52d2d

Merge branch 'huggingface:main' into clee_maxrl

1db77b8

Update config to pass tests, and rename script and add more details t…

4d38e63

…o script

catherinelee274 changed the title ~~Add support for MaxRL [WIP]~~ Add support for MaxRL Feb 17, 2026

catherinelee274 marked this pull request as ready for review February 17, 2026 06:25

catherinelee274 added 2 commits February 19, 2026 17:54

Update maxrl to not have trainer but be variant of grpo

3604dd5

Merge branch 'main' into clee_maxrl

d3fd0a1

LeonEricsson reviewed Feb 20, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

catherinelee274 added 3 commits February 25, 2026 03:29

Add paper index and add test and remove abs

48ccdf3

Merge branch 'main' into clee_maxrl

1a5ac6c

Merge branch 'main' into clee_maxrl

528d580

LeonEricsson reviewed Feb 26, 2026

View reviewed changes

catherinelee274 added 2 commits February 26, 2026 07:08

Remove tests specified in comments and section header

c6597fd

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational

Merge branch 'main' into clee_maxrl

8b748bf

LeonEricsson reviewed Feb 26, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

catherinelee274 added 3 commits February 28, 2026 08:28

Remove some commented sections

a2299b5

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already

Merge branch 'main' into clee_maxrl

a5c7888

Merge branch 'main' into clee_maxrl

4dbc89c

catherinelee274 requested a review from LeonEricsson March 5, 2026 04:00

cursor bot reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for MaxRL #5026

Add support for MaxRL #5026
catherinelee274 wants to merge 16 commits intohuggingface:mainfrom
catherinelee274:clee_maxrl

catherinelee274 commented Feb 9, 2026 •

edited by cursor bot

Loading

Uh oh!

LeonEricsson commented Feb 19, 2026

Uh oh!

LeonEricsson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson left a comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 5, 2026

Uh oh!

cursor bot Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

catherinelee274 commented Feb 9, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Feb 19, 2026

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Negative mean reward silently inverts advantage signs

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Regex fails to extract boxed content with nested braces

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catherinelee274 commented Feb 9, 2026 •

edited by cursor bot

Loading