Add validation for conversational prompts in multimodal training#5067
Merged
qgallouedec merged 6 commits intomainfrom Feb 17, 2026
Merged
Add validation for conversational prompts in multimodal training#5067qgallouedec merged 6 commits intomainfrom
qgallouedec merged 6 commits intomainfrom
Conversation
5 tasks
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
This addresses the error message for string prompts, but the dtype mismatch (expected scalar type BFloat16 but found Float in layer_norm) reported in #4451 comments is a separate crash that still happens even with conversational prompts. I have a fix for that want me to open a separate PR for it, or should I add it here? |
Member
Author
|
Yes please open a separate PR 🙏 |
|
on it |
5 tasks
Member
Author
|
@codex review |
1 similar comment
Member
Author
|
@codex review |
qgallouedec
added a commit
to kansalaman/trl
that referenced
this pull request
Mar 3, 2026
commit 489331e703e1e8d39534957f465fadce7f00ff99
Author: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Date: Tue Mar 3 14:50:42 2026 +0000
Replace deprecated asyncio.iscoroutinefunction with inspect.iscoroutinefunction in RLOO/GRPO trainers
commit 484c1c1acf0b437c20e230d5e135613daf1a59fa
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Mar 3 08:42:04 2026 -0600
CI: Add Qwen 3.5 tiny model to tests (#5204)
commit 7eebb294a9175ea2f0ffbf20cf759f772491d815
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Mar 3 07:35:22 2026 +0100
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn (#5122)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: swappy <59965507+rycerzes@users.noreply.github.com>
commit 0bf875c0cbb879c4b264f66a6e556769d42e2f52
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Mar 2 15:54:35 2026 +0100
Mark CI test_training_vlm_and_liger as xfail (#5202)
commit 7544c3a784147dbfc53bb1314558137320ecc3ed
Author: Michael Royzen <45830328+michaelroyzen@users.noreply.github.com>
Date: Fri Feb 27 14:42:57 2026 -0500
Support sequence sampling in Liger Kernel and pass importance_samplin… (#5190)
Co-authored-by: Michael Royzen <michaelroyzen@mac.mynetworksettings.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
commit 5cffd59a8a814b9132c6d08e5aa88347a41c66e3
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 27 15:43:33 2026 +0100
Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM (#5197)
commit eb8b8a510b3ee0e7e83e33f8cfbb6eada8eb7f34
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 27 08:11:51 2026 -0600
Re-add liger-kernel to dev deps (#5164)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit 68f807b5e2ba4994898a7ef21ba631b64fb7c4b5
Author: Zhenkun Cai <zekucai@gmail.com>
Date: Fri Feb 27 05:11:51 2026 -0800
Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer (#5180)
commit e53c98feb463c0897451b307432360c1616a8905
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 27 13:34:17 2026 +0100
Fix CI tests patching BaseTrainer (#5192)
commit bd2d21e02cc722221c0c7f91f4ddc7cbd9d271fa
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 27 08:27:22 2026 +0100
Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports (#5187)
commit e63cd79c68fc62edf63f01904ee02b0e63ab4336
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 27 08:25:30 2026 +0100
Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports (#5186)
commit e941ff58121d382b470f8c8011dd76088192c46b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 27 08:23:38 2026 +0100
Fix deprecation warning of fork in multi-threaded process (#5185)
commit b9263efa25e05ebf1c8c1525a9d5a6a7e94efbb2
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 27 08:22:58 2026 +0100
Fix deprecation warning of create_reference_model (#5184)
commit 410c00bfaead36b0048921a123739bd0cb4c3e7c
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 26 10:38:56 2026 -0600
Align documentation with the intended public API (#5162)
commit 519225384f9aaa7acf3959fbf6a218c2490d4a0e
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Thu Feb 26 15:44:50 2026 +0100
Add minimal CARLA example script (#5161)
commit 64b47513982e2845c8cb6f4d5d611037f605d9bf
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 26 11:11:52 2026 +0100
Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser (#5170)
commit f00379fa221689d67a3736c44eaf07137c11d5f9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 26 10:02:45 2026 +0100
Make _BaseConfig and _BaseTrainer explicitly private (#5169)
commit eb973af2d1109c84600c7fdddf259e06a547f583
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 26 09:11:09 2026 +0100
Document parameters with differing default values in core configs (#5168)
commit b2b3045dfe3a3b6a0c52785b055b60e9a1a0e73b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 26 07:56:32 2026 +0100
Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError (#5178)
commit 27e3e2ff68929b25045caf8af32799b2e1dc3965
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 25 16:03:45 2026 -0600
⬆️ Bump dev version (#5182)
commit d24e19424da2837d435a7884c0b307b605413829
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 25 15:56:26 2026 -0600
Release: v0.29 (#5181)
commit 70cf097fb8a39b8ad86aa6e27d49f081e96da4a5
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date: Wed Feb 25 22:01:16 2026 +0100
feature: Configurable num logprobs in vLLM generation (#5107)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit 57d749336487d7ece06e58b941e4180f13649d8f
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 25 11:17:13 2026 -0600
Rename input keys in `RewardTrainer` collator from `chosen/rejected_input_ids` to `chosen/rejected_ids` (#5179)
commit a0d7d8e1257dea15fae6df434285958d22ce9c4e
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 17:31:50 2026 +0100
Update upstream tracking info about CI PyTorch JIT deprecation warnings (#5166)
commit 51fdc53e08b0ee39b65ba699fb49281d183701ce
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 17:16:55 2026 +0100
Document parameters with differing default values in experimental configs (#5172)
commit dd15cbb04a47c8efb4c8ed13e315dc4f2e1f853e
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 17:07:45 2026 +0100
Fix default learning_rate in BCO according to paper (#5173)
commit 0b2cd5c04e26e13358413c00e98a56e2c2914eb9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 16:42:43 2026 +0100
Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies (#5176)
commit 95cedba36e5e015f9402bb997529337d6c90b0bb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 16:39:36 2026 +0100
Fix default learning_rate in PPO according to paper (#5174)
commit 6d78858d176b9fb385b6d0f332d369e1ee2e27fb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 16:26:16 2026 +0100
Fix experimental TestUpdateWithReplayBuffer: ValueError: `train_dataset` is required (#5171)
commit 0efaec33fbd3445eb1142c306e797940fad4de28
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 25 09:25:39 2026 -0600
Revert changes in vLLM client/server (#5165)
commit e540d687f8df6f3596fa6eb3cc50116b41d58f42
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 10:48:43 2026 +0100
Refactor CLI [3/N]: Self-contain VllmServeCommand argument parsing (#5160)
commit 9cc95a97927e59c3532ce2be3babcfd8a35adcd9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 09:57:53 2026 +0100
Refactor CLI [2/N]: Move accelerate concerns into TrainingCommand (#5159)
commit 827457ce5845c5a5b02dab164e12f55cd1c4c532
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 09:34:28 2026 +0100
Raise ValueError for None train_dataset in core trainers (#5157)
commit 8b3934ce1681c9f959167804692d6d94fbb36eb0
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 25 07:57:00 2026 +0100
Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX (#5153)
commit 4cd198e856b98cae6ed6d0632ab86ca22b432e23
Author: Blake Ledden <47259830+bledden@users.noreply.github.com>
Date: Tue Feb 24 19:41:05 2026 -0800
fix: wake up vLLM weights before sync to prevent writes to freed memory (#5147)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 1149b746db9f39dc28859e426b16f0e6557db240
Author: ehofm <ella@rilix.ai>
Date: Tue Feb 24 20:36:07 2026 -0500
Fix structured_outputs handling and tool normalization in vLLM backend (#5155)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit ea2b4958d0165e01a11b6f07ec024ee8c1d1835d
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 24 18:10:55 2026 -0600
Fix CI by removing liger-kernel from dev deps (#5163)
commit cfbdd3bea4448cde878c0da0de49551f553c61fe
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 23 22:27:02 2026 -0600
Fix `SFTTrainer` support for single-image data (#5132)
commit fa313fd57244008953753047795c954a782f9cfc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 23 16:06:04 2026 +0100
Add support for Python 3.14 (#4225)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit bc4edf6f02e6f07549d43b3543cb54d597cd3d91
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 23 15:06:58 2026 +0100
Fix type of TrainingArguments.logging_steps in docs (#5149)
commit 5269393f4269462ce5d4a9227a97af6911da7939
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 23 15:06:14 2026 +0100
Use BaseConfig in all experimental configs (#5148)
commit ef08730432721d67139e273775acf14846fa95d9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 23 15:04:11 2026 +0100
Fix PPOTrainer.save_model (#5151)
commit ae97f06954b274f582f82ac60e444897f73f14c3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 23 08:00:44 2026 -0600
Fix wording in DPO and SFT trainer documentation for clarity (#5140)
commit f150780cda7b0a82a4840c44b7026732ee17c4bb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 23 09:06:38 2026 +0100
Move common fields from stable trainer configs to BaseConfig (#5136)
commit 93f2e480daa0ee9962a8a5deff6ea3da347fe911
Author: casinca <47400729+casinca@users.noreply.github.com>
Date: Sun Feb 22 20:09:44 2026 +0100
refactor(gkd_trainer): small optim (#5143)
commit 8067ea7558ed4477afce710bbf2f8a1a79973ba7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 20 11:16:55 2026 -0600
Add `environment_factory` to `GRPOTrainer` (#5093)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 7a4156a1bb3224a3c7f5861d39ef76367273a26b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 20 16:05:55 2026 +0100
Fix `trl <command> --help` TypeError caused by unescaped `%` in `TrainingArguments` help strings (#5135)
commit c3ead5b556d9ea588b4a95cae1775913118ddbc6
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 20 11:09:02 2026 +0100
Fix NameError: name 'importlib' is not defined (#5134)
commit b7fa6bf17322f03d5ec47d12efc142de9ea5981a
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 20 09:46:30 2026 +0100
Fix import latency [2/N]: Implement native _is_package_available (#5129)
commit bb147645fad777c01ce1ccd2f10350b3cc50fceb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 20 09:45:59 2026 +0100
Fix import latency [1/N]: Extract _LazyModule to dedicated module (#5128)
commit e3b7897c873f94c26bf1a661df19e428239be114
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 19 20:53:15 2026 -0600
Refactor DPO (#3906)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
commit a68fb896f008f58a3d37abf11ae665357b7c679a
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 19 13:15:42 2026 -0600
Remove revision references in dataset loading for toolcall tests (#5133)
commit 699b8420cd6601474788effdab063d3d5e7bbc3b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 19 17:53:30 2026 +0100
Refactor TRL CLI into modular command architecture (#5124)
commit b46614e235f126c1c8d0fd9f41f4d217a8299c34
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 19 17:08:14 2026 +0100
Implement Agent Skills [4/N]: Create skills CLI (#5103)
commit f8181886c6a59f5f8c2a2bf31bb4cb7bda225d39
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 18 11:59:13 2026 -0600
Update tool handling to support JSON string schemas in trainers (#5118)
commit 9fc9a7dcebe3938a273e18ca3ed5b2cfdb6c0839
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Wed Feb 18 18:11:42 2026 +0100
Add Tiny Aya tool calling examples (script/notebook) (#5123)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit 27431134e3447181821ffaf94c405a44d87d1bc1
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 18 10:10:14 2026 -0600
Add GLM-4.5 model to tests (#5114)
commit 0e531bdd1eb654bed32d12b474ea998285cb1253
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 18 09:22:06 2026 -0600
Add check for `None` in `get_trackio_space_url()` to prevent errors (#5115)
commit 8b082bb2d4d599d66cf36df0b754c2f4de0371de
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 18 09:18:02 2026 -0600
Fix Qwen3 schema (#5111)
commit 269217f92092e4497260e34bd535756cb9e76f64
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 18 08:27:27 2026 -0600
Add test for Cohere2 models (#5116)
commit 57df014377bef538c87c616379d4c11aeaf05b30
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 18 06:45:38 2026 -0600
Add more tests for `get_training_chat_template` (#5108)
commit 70efa963f1c9bb88ec3144b051db6fdf4ffc10a4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 17 10:51:47 2026 -0600
Update version check for transformers to 5.2.0 in online_dpo_trainer.py (#5110)
commit 269ed992dca0858f290e08b5ebae271a15df8aa6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 17 06:49:34 2026 -0600
Add validation for conversational prompts in multimodal training (#5067)
commit 997536a2b56d4a0824bb55f9265e6561c3fd1e43
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 17 10:30:06 2026 +0100
Implement Agent Skills [3/N]: Create skills installer (#5100)
commit 8b9b972878243505d26b3dc69945613ff5ddc98b
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 16 15:50:59 2026 -0600
Remove outdated liger-kernel compatibility checks and warnings in tests and SFTTrainer (#5105)
commit 8c232f64b5bb00ef854bff157f6857241d415fe0
Author: Harikrishna KP <harikp2002@gmail.com>
Date: Tue Feb 17 02:09:35 2026 +0530
Fix SFT loss type rewards being overwritten in dpo_loss() (#5079)
commit 99b26fb2e6f241195fc9b378ee8d50a6219083b2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 16 14:34:54 2026 -0600
Add Trackio integration for model card visualization (#5101)
commit c94c032129af436c55764fef66389f30856df3d0
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 16 21:13:09 2026 +0100
Fix style (#5106)
commit 3d1c785762ce87892a7eaf18d1c0fb8771a74bc3
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 16 19:17:55 2026 +0100
Implement Agent Skills [2/N]: Create skills module (#5097)
commit 1702fc07b2d0c8ba23ad3299879d4edaaccb3b30
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date: Mon Feb 16 19:09:59 2026 +0100
feature: top_k selective_log_softmax (#5104)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 29ace1ad72152f4d648bf23763a319ce2400b9e6
Author: flutist <30485581+flutist@users.noreply.github.com>
Date: Tue Feb 17 00:56:52 2026 +0800
Fix DPO and RLOO incompatibility with FSDP2 (#4838)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 3d2e898ce4d538b5502f735a9c5116c8b45aaa46
Author: Yuki Uehara <74698040+yukiu00@users.noreply.github.com>
Date: Tue Feb 17 01:28:43 2026 +0900
Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss (#5031)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
commit b6957fc4c04a100e6829cb56e28f20668e4ad1ae
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 16 17:02:42 2026 +0100
Implement Agent Skills [1/N]: Create training skill (MVP) (#5096)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit abf6033b99fc75eb9d58458b44b81f7f7faebdc1
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Mon Feb 16 04:49:30 2026 -0800
docs: Unify model examples to use trl-lib namespace (#4431)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 7694e0e878361c0de6c6cb566c123f46af13d91d
Author: Nabin Oli <107109731+nabin2004@users.noreply.github.com>
Date: Sat Feb 14 00:36:27 2026 +0545
docs: add Multi-Node Training subsection (#4384) (#5091)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 28fc3f2c336bb7f734aab49c1ad073e152dccf61
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 10:37:15 2026 -0800
docs: Add MPO paper (2411.10442) to paper index (#5089)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 051b52fbf2c68edcae092357d0d4118b35a5f60b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 12 18:53:17 2026 +0100
Validate reward model has 1 num_labels (#5087)
commit a558fba8a5700933207c3963a1dab8a28291f2f1
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 12 18:04:13 2026 +0100
Fix BFD packing for SFT datasets (#5076)
commit 0073db963788d6cc77d51789f4b3d2c34930cfdc
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:55:06 2026 -0800
docs: Add PPO paper (1707.06347) to paper index (#5085)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 29ed9cb4ff346100bee004acd1ce7cc97554f064
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:45:57 2026 -0800
docs: Add T5 packing paper (1910.10683) to paper index (#5084)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit d0e06fcda40607ad7bd1a3b639a3018c3bb4bfca
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:38:35 2026 -0800
docs: Add PRM paper (2211.14275) to paper index (#5083)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit ee979a9d2f23c100cb9d4010b4ad99275a6c726c
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:29:29 2026 -0800
docs: Add GKD paper (2306.13649) to paper index (#5082)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit ff84817d27241643abfe3f7691448e509e093320
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:23:04 2026 -0800
docs: Add CPO paper (2401.08417) to paper index (#5081)
Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>
commit fe890df6e2a84345a29be849bb5f27ca72052034
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:05:10 2026 -0800
docs: Add ORPO paper (2403.07691) to paper index (#5080)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit c88f8c550137ddd1ddde56baebae2d9b97b9d54d
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Thu Feb 12 01:01:37 2026 -0800
docs: Add TR-DPO paper (2404.09656) to paper index (#5078)
commit 0562c3fa26c1bc827aff83800b046f9a2af925a6
Author: Logan Vegna <logan.vegna@shopify.com>
Date: Wed Feb 11 16:17:04 2026 -0500
[SFT] Fix high vRAM consumption during eval with liger kernel (#5069)
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
commit 6b38db6ad85cf67ce1b7d4f037e5e5840d474587
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 11 19:22:00 2026 +0100
Fix CI ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) (#5074)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 060fbfebbddf1e539e5dcee456bef643c29036d3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 11 10:44:54 2026 -0600
Update model from SequenceClassification to CausalLM in `RewardTrainer` tests (#5060)
commit 0933b7fc5ddb933c632708bba5936b99238168d8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 11 17:18:12 2026 +0100
Fix logging warning suppression for transformers 4.56.2 (#5077)
commit a07fb82b9a4333ea91cfe289a697b9b178d99021
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 11 09:21:48 2026 -0600
fix: Set `num_labels` to 1 in causal model initialization for RewardTrainer (#5066)
commit 29fe68205caf4acbf888307487e8423d692ee496
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Feb 11 15:08:11 2026 +0100
Fix GRPO multi-turn training with liger kernels (#4975)
commit 68399dfa6a03e4dea6ea5087c4d181b3b400cab5
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 11 07:39:27 2026 -0600
fix: Use `launch_args` for all trainers (#5059)
commit d1b066fdc4a8a7d0bde59e3bf1aeaac8803746d1
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 11 07:37:54 2026 -0600
Fix logging warning suppression with scoped override for seq-clf head key (#5058)
commit 0c3d33b955730308bab3d28ba2ef6eebe704c7f8
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Wed Feb 11 02:41:07 2026 -0800
docs: Add SimPO paper (2405.14734) to paper index (#5071)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit f23e3a775155f458108cb01ebcfde085ecca4733
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Wed Feb 11 02:33:11 2026 -0800
docs: Add RPO paper (2405.16436) to paper index (#5070)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit e46005c07c1a8194f4ff71b749dceb44f17d7eb7
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Wed Feb 11 02:22:52 2026 -0800
docs: Add XPO (2405.21046) to Paper Index (#5068)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 6d9bba1b3b9f181d9cf53eb9092a3d52b66de93b
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Wed Feb 11 02:15:59 2026 -0800
docs: Add REINFORCE++ (2501.03262) to Paper Index (#5062)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit b992e9284aac5979ab5716d14587067857663398
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date: Wed Feb 11 01:37:06 2026 -0800
docs: Add INTELLECT-2 (2505.07291) to Paper Index (#5061)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit c985dbadd3a499dc049f8c39c61034525d381006
Author: Jen Wei <45276133+JenWei0312@users.noreply.github.com>
Date: Wed Feb 11 02:25:55 2026 -0700
docs: add DeepSeek-R1 training dynamics and GRPO example (#5053)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit d934eb757806501a5106b6e4374d920961dc4e9f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 19:14:40 2026 +0100
Remove deprecated mergekit_utils moved to experimental (#5057)
commit 991fd0755aa1cce7a800d271dae4b525b6357bfd
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 10 11:57:11 2026 -0600
Remove duplicated tests for SFT and add gradient checkpointing tests (#5054)
commit d42b23f63f164af241c34c79e6a855d1eb896d4d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 18:17:10 2026 +0100
Remove deprecated classes moved to experimental (#5044)
commit e1a84cf626d249e9b55447d63eece88cbf92d100
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 18:06:05 2026 +0100
Remove deprecated RLOOConfig.max_prompt_length (#5056)
commit fc560370d97042cb7f90a9bf0e2e30d29a304240
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 17:47:21 2026 +0100
Remove deprecated XPO after moved to experimental (#5055)
commit 13bd37e1426eb81aca81cd68eb8d0efcbc6351b9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 17:36:36 2026 +0100
Remove deprecated PRM after moved to experimental (#5052)
commit 0aea3144031abacb0efadd8aab5a3ca9fe6380e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 17:33:24 2026 +0100
Remove deprecated PPO after moved to experimental (#5051)
commit d705ac4d0f13168724f548c9ecbb1a586c187e16
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 17:06:11 2026 +0100
Remove deprecated ORPO after moved to experimental (#5050)
commit b393c6bf6605d04f00a110085bf59feef59ffa6a
Author: Salman Chishti <13schishti@gmail.com>
Date: Tue Feb 10 15:29:23 2026 +0000
Upgrade GitHub Actions to latest versions (#4893)
Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit ce2ea744c5f504a026aa9ea41815bfa382417a9d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 15:47:05 2026 +0100
Remove deprecated Judges after moved to experimental (#5048)
commit 4620e91d21ad0dd3d885abe29446d7a52a9d368e
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 15:18:49 2026 +0100
Remove deprecated CPO after moved to experimental (#5046)
commit 6e47225d012aba64efb16789356bee9d037ec171
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 15:18:33 2026 +0100
Remove deprecated BCO after moved to experimental (#5045)
commit 17277e2d963611603eb2655af429975feade3b5c
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date: Tue Feb 10 15:13:24 2026 +0100
[GRPO] fix: remove SAPO temperature check (#5042)
commit 7267b2d3589bcdccb46e8cfd51d63416d4378c76
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 14:42:18 2026 +0100
⬆️ Bump dev version (#5049)
commit 4aaaf064c15ad80bea91895c8f202d44ad17cdb4
Author: casinca <47400729+casinca@users.noreply.github.com>
Date: Tue Feb 10 14:27:45 2026 +0100
[minor] docs: typo in `grpo_trainer.md` (#5047)
commit 49ef33428c47235991acb4e185ea599b70c6dab4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 14:20:48 2026 +0100
Release: 0.28 (#5043)
commit a958acc1e92d9ccf8404d800bf073c4cc7e5dd85
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 10 03:43:23 2026 -0600
Add Online Direct Preference Optimization section to paper index (#5037)
commit 8b935c6378b78adcb4fda9e66a944e05cc99b681
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 10 03:41:41 2026 -0600
Fix multiprocessing start method to 'spawn' for test compatibility with Python 3.12+ (#5036)
commit 40fff2e3bab905c2e7096360f4a8f014aab4cd14
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 10 03:32:24 2026 -0600
Deprecate FDivergenceType in DPOConfig; update f_divergence_type to use string values (#5039)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit fe1949b4da60dc7adbcf3a0bb17a4f42280c5c28
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 10 03:22:43 2026 -0600
Deprecate string usage for `ref_model` in DPOTrainer (#5040)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit 9f7c33600b7555a72234926a846ee65ff2508624
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 10 02:51:40 2026 -0600
Rename AOT loss type 'aot_pair' to 'aot_unpaired' in DPO (#5038)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit 19c5f4460cd9b405a95fabb05322ceb98864e915
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 10 08:04:20 2026 +0100
Allow testing with transformers 5.1.0 via xfail marks (#5034)
commit 442509524b4e7c8ee4d9f1d6f1f1087b5dcd1a0f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 23:45:11 2026 +0100
Fix CI FutureWarning: max_prompt_length is deprecated (#5019)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 0ef315a0f992a30f82792e87b5e3f0fab58a6107
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 22:45:23 2026 +0100
Filter max_prompt_length UserWarning in all test cases (#5035)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 8a27a17e583756b75ce71eb40a9ed295a610ade1
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 22:43:34 2026 +0100
Fix CI FutureWarning: tools is deprecated (#5015)
commit ff55949cccaaa23a752d951d184c04ff579aad89
Author: Haseeb Asif <149416177+Haseebasif7@users.noreply.github.com>
Date: Tue Feb 10 02:42:55 2026 +0500
Add length-unbiased GRPO loss (LUSPO) (#4988)
Co-authored-by: Haseeb Asif <haseeb@Haseebs-MacBook-Air.local>
Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit 765e397ed83f344a8ba7082673d4c4616beeb3a2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 9 15:00:49 2026 -0600
[CI] Silence PyTorch JIT and DataLoader deprecation warnings (#4999)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit b5bd2b98ed615676a7bee40c8ae17de62421bb4a
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 21:12:57 2026 +0100
Mark Qwen3VL tests as xfail for transformers 5.0.x (#5029)
commit 7189bc68d8ab0d19b67c0fc0b849c83679389b46
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 21:04:35 2026 +0100
Fix CI FutureWarning: use_logits_to_keep is deprecated (#5013)
commit db0d95523e5b8039c94c50df1b6286ff8b7e29ce
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 15:53:56 2026 +0100
Fix CI FutureWarning: rpo_alpha is deprecated (#5011)
commit fa06506f9d1c9546f63ae513cf7a5ba1be3247ae
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 15:52:31 2026 +0100
Fix typo in xfail test reason (#5028)
commit 4abd67951f996b511f4b913ead2386fbe357061b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 15:51:21 2026 +0100
Fix CI FutureWarning: generate_during_eval is deprecated (#5017)
commit 9f1e7dd7fd58be3327748234fb952599c4bd4f09
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 15:48:59 2026 +0100
Pin transformers < 5 in judges extra due to incompatibility (#5024)
commit 7c4e7f86047b82ad0e5ff7c8e3bb280b73024f31
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 15:22:02 2026 +0100
Fix vision model prompt truncation bug in DPOTrainer (#5023)
commit a68c82a617be59086b83f5ce941175270926de3f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 15:00:22 2026 +0100
Fix typo in DPO max_prompt_length deprecation warning message (#5020)
commit 5eb25938d44b781687061a69586a06501b11e915
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 14:35:44 2026 +0100
Fix CI FutureWarning: ref_model_init_kwargs is deprecated (#5009)
commit 58f467babd998fe5fe41598b535ceacda690cef0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 9 07:34:48 2026 -0600
Add support for `nested_gather` in OnlineDPOTrainer for transformers v5.2.0 and above (#4981)
commit 71a349335ce554180b2b4947d33594090f74d5cf
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 14:34:44 2026 +0100
Fix CI TRLExperimentalWarning in regular tests (#5007)
commit a7333c8c68f564005ab74fd999ad246de405122f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 9 14:34:15 2026 +0100
Filter CI SWIG deprecation warnings (#5004)
commit 98b00171b81411cd8bf7d6a9135af70c9879aaee
Author: Nabin Oli <107109731+nabin2004@users.noreply.github.com>
Date: Mon Feb 9 18:59:18 2026 +0545
docs: add CGPO/Mixture of Judges (2409.20370) to Paper Index + link ref to AllTrueJudge (#5002)
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
commit 728b0e372fb7de141093aff5513697a4fa743137
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Feb 9 07:13:47 2026 -0600
[tests] Remove xfail for transformers version >= 5.0.0 due to upstream bug resolution (#5000)
commit 637de450e748d1f612c0f6fed6be4df9cbbf1c39
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Sun Feb 8 14:36:10 2026 -0600
Add `sanitize_logprob` function for NaN handling in vLLM log probabilities (#5001)
commit bfb94262b81fd28017a11d7b9ddc61e3095cc2b6
Author: Akshay Ballal <61191840+akshayballal95@users.noreply.github.com>
Date: Sat Feb 7 14:48:58 2026 +0100
Fix GRPO tool calling for corrupted tool calls (#4890)
commit 7a39ff3995f2f8b7cb4f8ca29a09390ac587a43d
Author: casinca <47400729+casinca@users.noreply.github.com>
Date: Fri Feb 6 23:05:14 2026 +0100
perf: Qwen SAPO loss optimization (#4956)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit bd206704f9fc2c08039c522b40b0f68654bb006f
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Fri Feb 6 23:00:24 2026 +0100
Update sampling mode to token level for safety (#4989)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit aa7d457f9bec736c75439f78d081a6dc012ce353
Author: cmunley1 <cmunley@nvidia.com>
Date: Fri Feb 6 13:37:53 2026 -0800
Update NeMo-Gym to use `env_mask` (#4986)
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 5db1c11c52bc95255ea73e7eae3840fbeeb293a2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 11:27:47 2026 -0600
Add distributed smoke tests workflow for Transformers branch (#4996)
commit 90a35d12c9c64129eb023b499a812c5e638db846
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 10:47:49 2026 -0600
Add GitHub Actions workflow for testing against Transformers branch (#4995)
commit f11b4c3fdd511d9adfda74ceec02042cee65a0f3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 09:46:14 2026 -0600
Fix ZeRO-3 + PEFT + gradient checkpointing (#4951)
commit 27cbe98ac7487f326be51e180a1ee078c23b3836
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 6 16:14:18 2026 +0100
Fix post_init warning stacklevel to 3 (#4993)
commit 57cac251bdde714f97458b39b24702cf624dec66
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 6 16:13:47 2026 +0100
Fix deprecation of DPOConfig.max_completion_length (#4992)
commit ce72c067f6b55d4352c71939d9be6f4dfdaf68a0
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 6 16:13:14 2026 +0100
Assert chat_template is applied in test_train_with_chat_template_kwargs (#4991)
commit ffdaba3a97299c0c381512f807c57aa753f6314a
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 6 16:11:40 2026 +0100
Fix import of AutoModelForCausalLMWithValueHead from experimental (#4990)
commit 97a8a9672c0d5fbacb5c60934f40a7af404adecf
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 07:10:06 2026 -0600
Use local variable instead of attribute in collator tests (#4957)
commit 4e212bdeed6c7bf081e494960c65a84a35a85d6e
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 07:07:28 2026 -0600
Update dataset configuration name in toolcall dataset loading (#4984)
commit c82f6aa4766f83b66ea3f37e7bbf3b30453a1cda
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 6 14:03:31 2026 +0100
Fix passing tokenizer in test_train_with_chat_template_kwargs (#4987)
commit c581c1e8829904c6838c38d70d7b5fa646e2f0fc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Fri Feb 6 14:03:02 2026 +0100
Pin transformers!=5.1.0 in deepspeed extra due to incompatibility (#4985)
commit 98aca7f4fdb2c2879c86aa9bd18ccddece112b70
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 06:54:48 2026 -0600
Replace `warmup_ratio` with `warmup_steps` (#4983)
commit 032ee139d90d1279549e2b538f11a2c3b7c22aa7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Fri Feb 6 00:34:04 2026 -0600
[CI] Disallow installation of transformers 5.1.0 due to compatibility issues with DeepSpeed (#4982)
commit a0e5f265604356d2a107edccd920b5236582a3d7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 5 13:42:37 2026 -0600
Deprecate parameters in `DPOConfig` (#4969)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit f0a738d954775e50d4bd4a4df4fc5d1826e2f0b4
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Thu Feb 5 20:12:15 2026 +0100
Simplify instructions of installation of OpenEnv (#4980)
commit a92d14336e5380f1ce2b8cbca78ebd224933d2ba
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 5 13:05:32 2026 -0600
Replace `torch.allclose` with `torch.testing.assert_close` (#4977)
commit b0b798a82953094fb4d254915e4c360789ac2838
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 19:53:14 2026 +0100
Support truncated completions in GRPO multi-turn training (#4976)
commit ac194a917b2a8b7d514097d34078ec191ef4a0e3
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 16:56:50 2026 +0100
Fix add_column in test_train_with_chat_template_kwargs (#4979)
commit 3a76b7a8690e25838f2332b81d7efbfed2615277
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Thu Feb 5 16:24:29 2026 +0100
Set specific OpenEnv version when installed (#4978)
commit 0113ad7022118e4a7afe62b4e190f0b9aee4cadf
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 13:28:21 2026 +0100
Remove truncation from tokenizer calls if no max_length (#4972)
commit eee98f77a25bb386a0aba85dcb93ad50511224f9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 13:27:01 2026 +0100
Remove padding_value from experimental CPO and use pad_token_id (#4962)
commit 1354860c5c33bdee7a89a8845474ac4074094ed6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 5 06:07:05 2026 -0600
Fix test_train_with_chat_template_kwargs (#4971)
commit 1bd2a52ec2d8344050af736d60cdc735181ae4b8
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Feb 5 04:18:38 2026 -0600
Revert change in GRPO from NeMo-Gym Integration (#4970)
commit 22ad7e6b3f2ec7dfc2567fa0535955812fe69a42
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 08:14:54 2026 +0100
Remove max_prompt_length from experimental ORPO (#4966)
commit 657babd9300007308c7b9ad329790f0564daf94d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 08:13:10 2026 +0100
Remove max_prompt_length from experimental CPO (#4965)
commit 50e35de16578daa9aee75758fad8bdc0f707e37d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 08:09:25 2026 +0100
Remove max_prompt_length from experimental BCO (#4964)
commit 35bcab1d4da9fca092f9cefbc611fcd9eddaaf42
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Feb 5 08:02:34 2026 +0100
Remove max_prompt_length from experimental PRM (#4963)
commit e4995b2d26122879c03605f8ee136bcb241b4171
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 4 14:17:22 2026 -0600
Add test for training with `compute_metrics` in `RewardTrainer` (#4958)
commit cb5a73bfd97a3cb36712cd540545c2512cfa96a6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Feb 4 14:14:14 2026 -0600
Add test for tool call data in `RewardTrainer` (#4959)
commit 90b875c575b816e9015670ad812db43c8ab9a0e3
Author: cmunley1 <cmunley@nvidia.com>
Date: Wed Feb 4 08:56:55 2026 -0800
NeMo-Gym Integration (#4848)
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Lawrence Lane <llane@nvidia.com>
commit 5cb7eee1548bc72ed6fd84080c200a0adf74add2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 3 11:53:35 2026 -0600
Remove access to `warnings_issued` (#4960)
commit 2a55ed701122f3d210669c70b441e3baff6184b6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Tue Feb 3 08:28:32 2026 -0600
Add test for training with `compute_metrics` in `SFTTrainer` (#4950)
commit 7b54e7253093610ab69bb8c32b2a4eb6926721ea
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Feb 3 10:43:40 2026 +0100
Minor fix docs style (#4953)
commit 2a9fb3f22a8bbad1412af3bb2526febd7160b85f
Author: mel3c <gaozh1988@live.com>
Date: Tue Feb 3 16:02:07 2026 +0800
Fix PPO run_name parameter not taking effect (#4945)
commit a03c2fcda3a328bde9af4abc5c02f6e7e942140f
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Mon Feb 2 16:27:47 2026 +0100
Update wordle.py example with masking of env tokens (#4895)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit 68bc37700d2b66e1fbfa49282495f5419dd8abeb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 2 15:25:02 2026 +0100
Remove ref_model_init_kwargs from experimental BCO (#4946)
commit 239c74d9ffb8ca67a9a667fb7a2a91576d554f28
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Mon Feb 2 09:25:41 2026 +0100
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 (#4942)
commit 035c3ff151b953ca72cdfe0ee966bc1469a26fde
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Jan 29 14:08:08 2026 -0600
[GRPO] Add parquet logging for completions with individual rewards (#4818)
Co-authored-by: Daniel van Strien <davanstrien@gmail.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
commit 414e60f557eb0d0888db841c5e0e8f568e7607a8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Jan 29 18:57:12 2026 +0100
Set default top_k to 0 in VLLMClient (#4927)
commit df332dc924e1bdc75bcfc5573950a17648db2eb4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Jan 29 09:03:32 2026 -0600
Fix import statement for import_utils in vllm_client.py (#4932)
commit 27998e9584df0102b849878506be7d4808486771
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Jan 29 15:51:19 2026 +0100
Fix profiling of VLLMGeneration.sync_weights (#4931)
commit 43fb8d310633448a0c4c731a2efe9c1ca55e6184
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Jan 29 15:11:24 2026 +0100
Set model dtype to float32 in experimental tests of trainers (#4925)
commit 5a7481ec9340dfad5f23c54f90e15a139c1dff85
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Jan 29 14:35:56 2026 +0100
Move VLLMClient to generation module (#4928)
commit 21a0d70400179e4047c60183d7fb61988a249989
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Jan 29 14:32:59 2026 +0100
Require transformers<5 with PairRMJudge (#4926)
commit 4348375ab2c6bad36ef90e1061b804b0449148f1
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Thu Jan 29 14:30:13 2026 +0100
Set model dtype to float32 in tests of trainers (#4924)
commit a6cbf279d7d3bc4024e6e6273d967509e7221e83
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Jan 29 07:06:06 2026 -0600
Support tool call data in `is_conversational` (#4923)
commit ad91c6ffa91073684c4cf6dc2008e2994dd940e7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 11:47:15 2026 -0600
Add validation for `sync_ref_model` in `GRPOTrainer` and `RLOOTrainer` when using PEFT models (#4912)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit 04717ffca8fd91a0fa5ee610fbdc75ef8f3c5a22
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 11:16:12 2026 -0600
Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests (#4914)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit b322d9ba8092399b956882f61978ab3e90868c77
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 10:54:04 2026 -0600
Remove chat template setup in dpo_vlm.py (#4906)
commit a70b4e014756dc8595ac226d833deaba9784f756
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 09:55:19 2026 -0600
Fix extra EOS appended in DPO preprocessing for conversational data (#4908)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit 8464b0e4b22c571bbf565a03ee154a5692c8d056
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Jan 28 16:03:53 2026 +0100
Fix CI ValueError for 0 temperature (#4916)
commit 5461a74bc622660039e2038b6b0e5a43bdc712ae
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Jan 28 15:58:12 2026 +0100
Fix CI AssertionError: assert not True (#4921)
commit d54381a4a90cb18152842158c62aad9895022448
Author: Boyi Zhang <68804418+billycrapediem@users.noreply.github.com>
Date: Wed Jan 28 09:54:29 2026 -0500
docs: add DoRA (2402.09353) to Paper Index (#4892)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit f2f6b32bdc3688b124d72caa412d60a8f12d80c0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 08:52:58 2026 -0600
Remove gradient checkpointing option from various training scripts (#4905)
commit 6cbc102f5fe94804e5a7579ff1aec270b97e4f5f
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 08:35:23 2026 -0600
Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer (#4913)
commit f40edf9328adbe6c85acfb9dd9745e9c1393197e
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 08:17:35 2026 -0600
`device_map` init consistency in GRPO/RLOO/KTO (#4909)
commit a7070f940e8e0565adfbe9bbedd68b7850334b03
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 08:14:30 2026 -0600
Fix help text formatting for `max_length` in `RewardConfig` and `SFTConfig` (#4910)
commit 66efc0e52e55d77c2edf3e67c6c1f08e274ac9f8
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 28 08:08:11 2026 -0600
Rearrange variable assignments in `DataCollatorForVisionLanguageModeling` (#4911)
commit e9a2f16004a00a50e69e5779f58bf0bc24937de7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Jan 28 11:43:26 2026 +0100
Fix CI TypeError in llm-blender tests (#4919)
commit 4f8232098c10c98ad7febe971da4eb362d13433c
Author: adityachallapally <avasanthc@gmail.com>
Date: Wed Jan 28 01:10:29 2026 -0800
Created new PTT integration docs as requested (#4907)
Co-authored-by: Aditya Challapally <adchalla@microsoft.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
commit 0eb66d8f2fc63b3d00d8dbc18f99c3f48750bd16
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Jan 27 16:53:29 2026 +0100
Refactor vLLM generation [1/N]: Extract vLLM generation (#4700)
commit 226ef57192b49801c3be8c55c798c6d5b134b080
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Jan 27 14:34:11 2026 +0100
Fix CI AssertionError: Parameter has not changed (#4904)
commit 956986ebd53ff0d8dfa688e9d1033488dcad55d6
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Tue Jan 27 14:33:56 2026 +0100
Fix CI NotImplementedError for bfloat16 (#4902)
commit 4322778d7f696a4fc1fc33612b02eeb5ec700109
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Mon Jan 26 12:57:43 2026 -0600
Transformers v5 release: extend xfail condition for `TestGRPOTrainer.test_training_vlm_and_liger` and update version checks (#4898)
commit e106972dd6d839f4a3d3fcaffc1f386b4fbe66bf
Author: Cola Chan (SII) <57797863+141forever@users.noreply.github.com>
Date: Mon Jan 26 17:56:38 2026 +0800
GOLD training speed up (#4888)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
commit c477e88e05023dbcd45211c1a802788650598909
Author: Yi-Chen Li <ychenli.X@gmail.com>
Date: Fri Jan 23 21:25:36 2026 +0800
Fix RewardTrainer's results not reproducible (#4887)
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit ba053232324b207554116f806edbb2ec8b6ab9f5
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Jan 22 08:15:31 2026 -0600
Fix import path for `get_open_port` based on vLLM version (#4883)
commit e66a138438a3beba08756543fa41b7a90054ee8c
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Thu Jan 22 08:14:48 2026 -0600
Mark ZeRO 2 as xfail in distributed tests due to current failure (#4885)
commit a60d75aa1efa6ac5330649aafd425859da685a63
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 21 15:44:01 2026 -0600
Test distributed training for `RewardTrainer`, `RLOOTrainer` and `GRPOTrainer` (#4823)
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
commit 60e46742576876209658446b50f144541873301b
Author: Wing Lian <wing@axolotl.ai>
Date: Wed Jan 21 16:15:03 2026 -0500
Enable vLLM sleep mode for generation in Online DPO (#4882)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit 0a881bcee992a25e2fc0e980cc43a7428ce17373
Author: Kirill Dubovikov <dubovikov.kirill@gmail.com>
Date: Thu Jan 22 01:09:34 2026 +0400
Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) (#4873)
Co-authored-by: Kirill Dubovikov <kirill.dubivokov@mbzuai.ac.ae>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
commit 16b090302b8fd408870baa7452b5c3a29e03c346
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date: Wed Jan 21 03:00:04 2026 -0600
Fix SFT training for prompt-completion type and transformers v5 (#4880)
commit b080a4c27a60988be213354f551e26d3a4b2eef9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date: Wed Jan 21 08:07:42 2026 +0100
Remove label_pad_token_id from experimental trainers (#4878)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is confusion around whether data should be pre-processed before being passed to
GRPOTrainer. This adds a clear, actionable error message instead of the crypticSee #5064
Closes #4870
Closes #4746
Closes #4451
Closes #5041