Conversation
|
Doesn't MaxRL reduce to simply changing the advantage normalization denominator from If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module. |
LeonEricsson
left a comment
There was a problem hiding this comment.
Would you mind writing a paper index section for MaxRL as well?
Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational
LeonEricsson
left a comment
There was a problem hiding this comment.
final comments. then i'm satisfied.
needs a maintainers approval before merging.
- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| advantages = rewards - mean_grouped_rewards | ||
| if self.scale_rewards != "none": | ||
| if self.scale_rewards == "mean": | ||
| advantages = advantages / (mean_grouped_rewards + 1e-4) |
There was a problem hiding this comment.
Negative mean reward silently inverts advantage signs
Medium Severity
When scale_rewards="mean", advantages are divided by mean_grouped_rewards + 1e-4. Unlike std_rewards (always ≥ 0), mean_grouped_rewards can be negative when rewards are not binary (e.g., from a reward model). If the group mean is below -1e-4, the denominator becomes negative, silently flipping all advantage signs — the model would then reinforce bad completions and penalize good ones. The docs say this is for binary rewards, but no runtime validation enforces that constraint. Using abs(mean_grouped_rewards) or clamping to non-negative in the denominator would prevent this silent inversion.
| def extract_boxed(text: str) -> str | None: | ||
| """Return the last \\boxed{...} content from text, or None if absent.""" | ||
| matches = re.findall(r"\\boxed\{([^}]*)\}", text) | ||
| return matches[-1].strip() if matches else None |
There was a problem hiding this comment.
Regex fails to extract boxed content with nested braces
High Severity
The extract_boxed regex r"\\boxed\{([^}]*)\}" uses [^}]* which stops at the first closing brace. Math answers almost always contain nested braces (e.g. \boxed{\frac{1}{3}}), and this regex would extract only \frac{1 instead of \frac{1}{3}. Since the accuracy_reward function relies on extract_boxed for answer matching, correct answers with fractions, exponents, or any nested LaTeX will never receive a reward of 1.0, effectively making the training signal broken for the majority of math problems.


What does this PR do?
Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Touches core GRPO advantage computation and validation logic; incorrect scaling/edge cases (e.g., low/zero mean rewards) could destabilize training despite added tests.
Overview
Adds an experimental MaxRL-style reward scaling option to GRPO by introducing
scale_rewards="mean", which normalizes group advantages by the group mean (supported only withmulti_objective_aggregation="sum_then_normalize").Updates
GRPOConfigdocs/help text, extends trainer validation/advantage computation to handle the new mode, adds a newexamples/scripts/maxrl.pyend-to-end training script, and expands GRPO trainer tests to cover the new scaling option (including a dedicated MaxRL training smoke test).Written by Cursor Bugbot for commit 4dbc89c. This will update automatically on new commits. Configure here.