Add SDPO (Self-Distillation Policy Optimization) trainer#4935
Add SDPO (Self-Distillation Policy Optimization) trainer#4935MengAiDev wants to merge 12 commits intohuggingface:mainfrom
Conversation
Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal. - Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.) - Add SDPOTrainer extending GRPOTrainer with self-distillation loss - Add comprehensive tests for SDPOConfig and SDPOTrainer - Add example script demonstrating SDPO usage
|
@MengAiDev I have cleaned up the structure and docs and tests. Next we need to address the main TODOs regarding the teacher logits. |
|
cc @jonhue here is a port of SDPO for TRL |
|
@MengAiDev @kashif Thanks so much for implementing this!! Let's coordinate with @Shekswess and #4941. It might be cleanest to have one implementation for SDFT & SDPO ("self-distillation") since both are algorithmically the same and they differ only in whether data is offline or online. |
|
agree! lets try that if its ok for you @MengAiDev |
|
Wohoo ! |
|
Regarding the discussion on how to combine SDFT/SDPO PRs: This PR inherits from GRPOTrainer, while the SDFT PR modifies it in place. Both approaches carry baggage from GRPOTrainer that isn’t necessarily applicable to SDPO/SDFT — but this also provides a nice playground for experimentation. The tradeoff with inheritance is less control, but I like how it nicely isolates SDPO’s key contributions and exposes relevant hparams clearly. If future research demands more flexibility, we can revisit and consider breaking out SDPO into its own trainer. If we proceed with this PR’s approach, extending it to cover the offline case should, at first glance, just require modifying the _build_teacher_inputs function. |
|
That a good point Leon, I need to review the PR carefully, but in general, I’d rather isolate first and abstract later, if needed. (abstractions are easy to do, hard to undo) |
|
@qgallouedec @LeonEricsson if you see my implementation #4941 (comment), of the offline SDFT I think it can be really really improved, tried to follow the official code from the authors with small modifications, feel free to ping us on how we can make these stuff better. Cannot wait to start to experiment hehehehe |
|
Any progress on this is much appreciated |
Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal.
Fixes #4929