feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199
feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199casinca wants to merge 7 commits intohuggingface:mainfrom
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)#5199Conversation
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO)
|
I owe some better explanations to facilitate the review concerning importing From the original implementation below, the author is recomputing the
In order to avoid a 2nd log op in TRL, I'm directly clamping in logspace This is solely to follow the original implementation, otherwise I'm not really sure if reducing from If keeping the original logic and importing |

What does this PR do?
This PR implements the VESPO loss, resolve #5196
Official implementation: https://github.com/FloyedShen/VESPO/blob/main/recipe/vespo/code/core_algos.py
Paper: https://huggingface.co/papers/2602.10693
Note:
The paper and the official implementation can have different variable names, to make things clearer:
Docstrings/comments are a mix of official impl and my writing.
Alternative options:
k_pos,lambda_pos,k_neg,lambda_negbut I could reduce with 2 tuples of 2 floats eg: lambdas (pos, neg) if it's better.w_seq. I can include it in metrics, but this would force me to return a tuple inget_gamma_weightsor remove@staticmethod. Not sure here what's the preference.For efficiency, the TRL VESPO implementation is slightly different than the official one. It's ~25% faster per call on gpu, and tested for equivalence.
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.