Skip to content

Add SDPO (Self-Distillation Policy Optimization) trainer#4935

Open
MengAiDev wants to merge 12 commits intohuggingface:mainfrom
MengAiDev:4929
Open

Add SDPO (Self-Distillation Policy Optimization) trainer#4935
MengAiDev wants to merge 12 commits intohuggingface:mainfrom
MengAiDev:4929

Conversation

@MengAiDev
Copy link
Contributor

Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal.

  • Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.)
  • Add SDPOTrainer extending GRPOTrainer with self-distillation loss
  • Add comprehensive tests for SDPOConfig and SDPOTrainer
  • Add example script demonstrating SDPO usage

Fixes #4929

MengAiDev and others added 9 commits January 30, 2026 10:04
Implements SDPO algorithm from arxiv.org/abs/2601.20802.
SDPO augments on-policy optimization with self-distillation from
the model's own high-reward trajectories, converting tokenized
feedback into a dense learning signal.

- Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.)
- Add SDPOTrainer extending GRPOTrainer with self-distillation loss
- Add comprehensive tests for SDPOConfig and SDPOTrainer
- Add example script demonstrating SDPO usage
@kashif
Copy link
Collaborator

kashif commented Feb 2, 2026

@MengAiDev I have cleaned up the structure and docs and tests. Next we need to address the main TODOs regarding the teacher logits.

@kashif
Copy link
Collaborator

kashif commented Feb 2, 2026

cc @jonhue here is a port of SDPO for TRL

@jonhue
Copy link

jonhue commented Feb 2, 2026

@MengAiDev @kashif Thanks so much for implementing this!! Let's coordinate with @Shekswess and #4941. It might be cleanest to have one implementation for SDFT & SDPO ("self-distillation") since both are algorithmically the same and they differ only in whether data is offline or online.

@kashif
Copy link
Collaborator

kashif commented Feb 2, 2026

agree! lets try that if its ok for you @MengAiDev

@Shekswess
Copy link

Wohoo !
This is really awesome, bravo legends @kashif @jonhue @MengAiDev. Maybe we should also then have the offline version of the trainer, knowing that some folks (like me that are GPU poor hahahahaha) can experiment with the approaches

@LeonEricsson
Copy link
Collaborator

Regarding the discussion on how to combine SDFT/SDPO PRs:

This PR inherits from GRPOTrainer, while the SDFT PR modifies it in place. Both approaches carry baggage from GRPOTrainer that isn’t necessarily applicable to SDPO/SDFT — but this also provides a nice playground for experimentation.

The tradeoff with inheritance is less control, but I like how it nicely isolates SDPO’s key contributions and exposes relevant hparams clearly. If future research demands more flexibility, we can revisit and consider breaking out SDPO into its own trainer.

If we proceed with this PR’s approach, extending it to cover the offline case should, at first glance, just require modifying the _build_teacher_inputs function.​​​​​​​​​​​​​​​​

@qgallouedec
Copy link
Member

That a good point Leon, I need to review the PR carefully, but in general, I’d rather isolate first and abstract later, if needed. (abstractions are easy to do, hard to undo)

@Shekswess
Copy link

@qgallouedec @LeonEricsson if you see my implementation #4941 (comment), of the offline SDFT I think it can be really really improved, tried to follow the official code from the authors with small modifications, feel free to ping us on how we can make these stuff better. Cannot wait to start to experiment hehehehe

@niksdagr8
Copy link

Any progress on this is much appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDPO: Reinforcement Learning via Self-Distillation

7 participants