Add SDPO (Self-Distillation Policy Optimization) trainer by MengAiDev · Pull Request #4935 · huggingface/trl

MengAiDev · 2026-01-30T02:07:30Z

Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal.

Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.)
Add SDPOTrainer extending GRPOTrainer with self-distillation loss
Add comprehensive tests for SDPOConfig and SDPOTrainer
Add example script demonstrating SDPO usage

Fixes #4929

Implements SDPO algorithm from arxiv.org/abs/2601.20802. SDPO augments on-policy optimization with self-distillation from the model's own high-reward trajectories, converting tokenized feedback into a dense learning signal. - Add SDPOConfig with distillation parameters (alpha, topk, ema_update_rate, etc.) - Add SDPOTrainer extending GRPOTrainer with self-distillation loss - Add comprehensive tests for SDPOConfig and SDPOTrainer - Add example script demonstrating SDPO usage

kashif · 2026-02-02T11:26:22Z

@MengAiDev I have cleaned up the structure and docs and tests. Next we need to address the main TODOs regarding the teacher logits.

kashif · 2026-02-02T13:11:27Z

cc @jonhue here is a port of SDPO for TRL

jonhue · 2026-02-02T13:51:38Z

@MengAiDev @kashif Thanks so much for implementing this!! Let's coordinate with @Shekswess and #4941. It might be cleanest to have one implementation for SDFT & SDPO ("self-distillation") since both are algorithmically the same and they differ only in whether data is offline or online.

kashif · 2026-02-02T13:53:24Z

agree! lets try that if its ok for you @MengAiDev

Shekswess · 2026-02-02T15:24:45Z

Wohoo !
This is really awesome, bravo legends @kashif @jonhue @MengAiDev. Maybe we should also then have the offline version of the trainer, knowing that some folks (like me that are GPU poor hahahahaha) can experiment with the approaches

LeonEricsson · 2026-02-08T13:18:27Z

Regarding the discussion on how to combine SDFT/SDPO PRs:

This PR inherits from GRPOTrainer, while the SDFT PR modifies it in place. Both approaches carry baggage from GRPOTrainer that isn’t necessarily applicable to SDPO/SDFT — but this also provides a nice playground for experimentation.

The tradeoff with inheritance is less control, but I like how it nicely isolates SDPO’s key contributions and exposes relevant hparams clearly. If future research demands more flexibility, we can revisit and consider breaking out SDPO into its own trainer.

If we proceed with this PR’s approach, extending it to cover the offline case should, at first glance, just require modifying the _build_teacher_inputs function.

qgallouedec · 2026-02-08T16:02:21Z

That a good point Leon, I need to review the PR carefully, but in general, I’d rather isolate first and abstract later, if needed. (abstractions are easy to do, hard to undo)

Shekswess · 2026-02-08T17:30:06Z

@qgallouedec @LeonEricsson if you see my implementation #4941 (comment), of the offline SDFT I think it can be really really improved, tried to follow the official code from the authors with small modifications, feel free to ping us on how we can make these stuff better. Cannot wait to start to experiment hehehehe

niksdagr8 · 2026-02-24T18:52:22Z

Any progress on this is much appreciated

MengAiDev and others added 9 commits January 30, 2026 10:04

move to experimental

b382ea5

rename

4139122

remove example

9afaa0b

add docs

4de7cfb

fix tests and formatting

2ece95a

added paper index

63e9423

align loss hyper-params with paper suggestion

0d07988

update the docs

0c0f4d7

kashif added 2 commits February 2, 2026 13:16

add helper to make teacher prompt

4c321e9

Merge branch 'main' into 4929

cbf221c

Merge branch 'main' into 4929

067322f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SDPO (Self-Distillation Policy Optimization) trainer#4935

Add SDPO (Self-Distillation Policy Optimization) trainer#4935
MengAiDev wants to merge 12 commits intohuggingface:mainfrom
MengAiDev:4929

MengAiDev commented Jan 30, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

LeonEricsson commented Feb 8, 2026

Uh oh!

qgallouedec commented Feb 8, 2026

Uh oh!

Shekswess commented Feb 8, 2026

Uh oh!

niksdagr8 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

MengAiDev commented Jan 30, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

kashif commented Feb 2, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

LeonEricsson commented Feb 8, 2026

Uh oh!

qgallouedec commented Feb 8, 2026

Uh oh!

Shekswess commented Feb 8, 2026

Uh oh!

niksdagr8 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants