Skip to content

Add support for MaxRL #5026

Open
catherinelee274 wants to merge 16 commits intohuggingface:mainfrom
catherinelee274:clee_maxrl
Open

Add support for MaxRL #5026
catherinelee274 wants to merge 16 commits intohuggingface:mainfrom
catherinelee274:clee_maxrl

Conversation

@catherinelee274
Copy link

@catherinelee274 catherinelee274 commented Feb 9, 2026

What does this PR do?

Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

Medium Risk
Touches core GRPO advantage computation and validation logic; incorrect scaling/edge cases (e.g., low/zero mean rewards) could destabilize training despite added tests.

Overview
Adds an experimental MaxRL-style reward scaling option to GRPO by introducing scale_rewards="mean", which normalizes group advantages by the group mean (supported only with multi_objective_aggregation="sum_then_normalize").

Updates GRPOConfig docs/help text, extends trainer validation/advantage computation to handle the new mode, adds a new examples/scripts/maxrl.py end-to-end training script, and expands GRPO trainer tests to cover the new scaling option (including a dedicated MaxRL training smoke test).

Written by Cursor Bugbot for commit 4dbc89c. This will update automatically on new commits. Configure here.

@catherinelee274 catherinelee274 changed the title Add support for MaxRL [WIP] Add support for MaxRL Feb 17, 2026
@catherinelee274 catherinelee274 marked this pull request as ready for review February 17, 2026 06:25
@LeonEricsson
Copy link
Collaborator

Doesn't MaxRL reduce to simply changing the advantage normalization denominator from std(r) to mean(r)?

# GRPO
A_i = (r_i - mean(r)) / (std(r) + eps)

# MaxRL
A_i = (r_i - mean(r)) / (mean(r) + eps)

If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module.

Copy link
Collaborator

@LeonEricsson LeonEricsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind writing a paper index section for MaxRL as well?

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean  as they do not test TRL code
 test_maxrl_training_conversational
Copy link
Collaborator

@LeonEricsson LeonEricsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final comments. then i'm satisfied.
needs a maintainers approval before merging.

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment

- removed comment in grpo_trainer, due to us having a paper index already
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

advantages = rewards - mean_grouped_rewards
if self.scale_rewards != "none":
if self.scale_rewards == "mean":
advantages = advantages / (mean_grouped_rewards + 1e-4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative mean reward silently inverts advantage signs

Medium Severity

When scale_rewards="mean", advantages are divided by mean_grouped_rewards + 1e-4. Unlike std_rewards (always ≥ 0), mean_grouped_rewards can be negative when rewards are not binary (e.g., from a reward model). If the group mean is below -1e-4, the denominator becomes negative, silently flipping all advantage signs — the model would then reinforce bad completions and penalize good ones. The docs say this is for binary rewards, but no runtime validation enforces that constraint. Using abs(mean_grouped_rewards) or clamping to non-negative in the denominator would prevent this silent inversion.

Fix in Cursor Fix in Web

def extract_boxed(text: str) -> str | None:
"""Return the last \\boxed{...} content from text, or None if absent."""
matches = re.findall(r"\\boxed\{([^}]*)\}", text)
return matches[-1].strip() if matches else None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex fails to extract boxed content with nested braces

High Severity

The extract_boxed regex r"\\boxed\{([^}]*)\}" uses [^}]* which stops at the first closing brace. Math answers almost always contain nested braces (e.g. \boxed{\frac{1}{3}}), and this regex would extract only \frac{1 instead of \frac{1}{3}. Since the accuracy_reward function relies on extract_boxed for answer matching, correct answers with fractions, exponents, or any nested LaTeX will never receive a reward of 1.0, effectively making the training signal broken for the majority of math problems.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Maximum Likelihood Reinforcement Learning

2 participants