Add PSPO trust region method as alternative to clipping in GRPOTrainer#4548
Add PSPO trust region method as alternative to clipping in GRPOTrainer#4548MCDwyer wants to merge 10 commits intohuggingface:mainfrom
Conversation
|
thanks! can you apply the style ( and also, it would be nice to have a test case for this. |
|
I've applied the style, added the paper documentation in docs/source/gr_pspo.md, and added a test in tests/test_grpo_trainer.py. Please let me know if there is anything else I need to do? |
|
Oh, sorry, maybe I wasn't clear. You need to add a section to this part of the documentation: https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md |
|
Sorry, I had misunderstood, thank you for clarifying. I've moved the documentation to a section in the paper_index.md |
|
Hey, just checking in, is there anything else I need to do? |
|
Hi, I found in a backlog of papers PSPO and while checking if it was already implemented, found your PR, nice! When this gets merged I wonder if we could simplify/merge the logic with SAPO, since both are soft trust region methods. |
|
Hi, yes I think one way to do this could be to move SAPO to a trust region method rather than as a loss, and have it that if SAPO is picked for the trust region method it uses the GRPO loss? I think that could disentangle the trust region methods from the loss types, whilst keeping the implementation of SAPO essentially the same? I did this for a comparison as I was using an older trl version which didn't have the SAPO implementation yet, and I changed the _compute_loss to have this: I could add this change into this PR if that would be useful? |
|
When SAPO was implemented, it was the first soft trust region method in TRL GRPO, if I'm not mistaken, hence this choice to keep "per loss" I presume. But if more variants are added, that solely differ from a trust region standpoint (also seen this one #5027), I agree it might be good to re-evaluate at some point. In any case, I don't want to give you extra work without approval. Ultimately, the decision is up to @kashif since he's assigned. It's not a big deal anyway, I can always open a PR later, once this is merged. Btw concerning your snippet, I simplified SAPO in this PR #4956 since. |
What does this PR do?
Adds PSPO (Probability Smoothing Policy Optimisation) as an alternative trust-region method to
GRPOTrainer. PSPO smooths probabilities toward the behaviour policy instead of using ratio clipping.Paper: https://arxiv.org/abs/2509.21282
Changes:
trust_region_methodparameter toGRPOConfig(default:"clip")smoothing_alphaparameter for PSPO (default:0.1)GRPOTrainerBefore submitting
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@kashif