Add validation for conversational prompts in multimodal training by qgallouedec · Pull Request #5067 · huggingface/trl

qgallouedec · 2026-02-10T21:22:01Z

There is confusion around whether data should be pre-processed before being passed to GRPOTrainer. This adds a clear, actionable error message instead of the cryptic

TypeError: string indices must be integers, not 'str'.

See #5064

Closes #4870
Closes #4746
Closes #4451
Closes #5041

HuggingFaceDocBuilderDev · 2026-02-10T21:24:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

akshan-main · 2026-02-10T21:46:37Z

This addresses the error message for string prompts, but the dtype mismatch (expected scalar type BFloat16 but found Float in layer_norm) reported in #4451 comments is a separate crash that still happens even with conversational prompts. I have a fix for that want me to open a separate PR for it, or should I add it here?

qgallouedec · 2026-02-10T23:10:49Z

Yes please open a separate PR 🙏

akshan-main · 2026-02-11T04:50:51Z

on it

qgallouedec · 2026-02-11T19:27:08Z

@codex review

qgallouedec · 2026-02-11T19:34:58Z

@codex review

albertvillanova

Thanks.

commit 489331e703e1e8d39534957f465fadce7f00ff99 Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Tue Mar 3 14:50:42 2026 +0000 Replace deprecated asyncio.iscoroutinefunction with inspect.iscoroutinefunction in RLOO/GRPO trainers commit 484c1c1acf0b437c20e230d5e135613daf1a59fa Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Mar 3 08:42:04 2026 -0600 CI: Add Qwen 3.5 tiny model to tests (#5204) commit 7eebb294a9175ea2f0ffbf20cf759f772491d815 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Mar 3 07:35:22 2026 +0100 Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn (#5122) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: swappy <59965507+rycerzes@users.noreply.github.com> commit 0bf875c0cbb879c4b264f66a6e556769d42e2f52 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Mar 2 15:54:35 2026 +0100 Mark CI test_training_vlm_and_liger as xfail (#5202) commit 7544c3a784147dbfc53bb1314558137320ecc3ed Author: Michael Royzen <45830328+michaelroyzen@users.noreply.github.com> Date: Fri Feb 27 14:42:57 2026 -0500 Support sequence sampling in Liger Kernel and pass importance_samplin… (#5190) Co-authored-by: Michael Royzen <michaelroyzen@mac.mynetworksettings.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit 5cffd59a8a814b9132c6d08e5aa88347a41c66e3 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 27 15:43:33 2026 +0100 Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM (#5197) commit eb8b8a510b3ee0e7e83e33f8cfbb6eada8eb7f34 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 27 08:11:51 2026 -0600 Re-add liger-kernel to dev deps (#5164) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 68f807b5e2ba4994898a7ef21ba631b64fb7c4b5 Author: Zhenkun Cai <zekucai@gmail.com> Date: Fri Feb 27 05:11:51 2026 -0800 Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer (#5180) commit e53c98feb463c0897451b307432360c1616a8905 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 27 13:34:17 2026 +0100 Fix CI tests patching BaseTrainer (#5192) commit bd2d21e02cc722221c0c7f91f4ddc7cbd9d271fa Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 27 08:27:22 2026 +0100 Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports (#5187) commit e63cd79c68fc62edf63f01904ee02b0e63ab4336 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 27 08:25:30 2026 +0100 Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports (#5186) commit e941ff58121d382b470f8c8011dd76088192c46b Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 27 08:23:38 2026 +0100 Fix deprecation warning of fork in multi-threaded process (#5185) commit b9263efa25e05ebf1c8c1525a9d5a6a7e94efbb2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 27 08:22:58 2026 +0100 Fix deprecation warning of create_reference_model (#5184) commit 410c00bfaead36b0048921a123739bd0cb4c3e7c Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 26 10:38:56 2026 -0600 Align documentation with the intended public API (#5162) commit 519225384f9aaa7acf3959fbf6a218c2490d4a0e Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Feb 26 15:44:50 2026 +0100 Add minimal CARLA example script (#5161) commit 64b47513982e2845c8cb6f4d5d611037f605d9bf Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 26 11:11:52 2026 +0100 Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser (#5170) commit f00379fa221689d67a3736c44eaf07137c11d5f9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 26 10:02:45 2026 +0100 Make _BaseConfig and _BaseTrainer explicitly private (#5169) commit eb973af2d1109c84600c7fdddf259e06a547f583 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 26 09:11:09 2026 +0100 Document parameters with differing default values in core configs (#5168) commit b2b3045dfe3a3b6a0c52785b055b60e9a1a0e73b Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 26 07:56:32 2026 +0100 Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError (#5178) commit 27e3e2ff68929b25045caf8af32799b2e1dc3965 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 25 16:03:45 2026 -0600 ⬆️ Bump dev version (#5182) commit d24e19424da2837d435a7884c0b307b605413829 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 25 15:56:26 2026 -0600 Release: v0.29 (#5181) commit 70cf097fb8a39b8ad86aa6e27d49f081e96da4a5 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Wed Feb 25 22:01:16 2026 +0100 feature: Configurable num logprobs in vLLM generation (#5107) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 57d749336487d7ece06e58b941e4180f13649d8f Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 25 11:17:13 2026 -0600 Rename input keys in `RewardTrainer` collator from `chosen/rejected_input_ids` to `chosen/rejected_ids` (#5179) commit a0d7d8e1257dea15fae6df434285958d22ce9c4e Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 17:31:50 2026 +0100 Update upstream tracking info about CI PyTorch JIT deprecation warnings (#5166) commit 51fdc53e08b0ee39b65ba699fb49281d183701ce Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 17:16:55 2026 +0100 Document parameters with differing default values in experimental configs (#5172) commit dd15cbb04a47c8efb4c8ed13e315dc4f2e1f853e Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 17:07:45 2026 +0100 Fix default learning_rate in BCO according to paper (#5173) commit 0b2cd5c04e26e13358413c00e98a56e2c2914eb9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 16:42:43 2026 +0100 Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies (#5176) commit 95cedba36e5e015f9402bb997529337d6c90b0bb Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 16:39:36 2026 +0100 Fix default learning_rate in PPO according to paper (#5174) commit 6d78858d176b9fb385b6d0f332d369e1ee2e27fb Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 16:26:16 2026 +0100 Fix experimental TestUpdateWithReplayBuffer: ValueError: `train_dataset` is required (#5171) commit 0efaec33fbd3445eb1142c306e797940fad4de28 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 25 09:25:39 2026 -0600 Revert changes in vLLM client/server (#5165) commit e540d687f8df6f3596fa6eb3cc50116b41d58f42 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 10:48:43 2026 +0100 Refactor CLI [3/N]: Self-contain VllmServeCommand argument parsing (#5160) commit 9cc95a97927e59c3532ce2be3babcfd8a35adcd9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 09:57:53 2026 +0100 Refactor CLI [2/N]: Move accelerate concerns into TrainingCommand (#5159) commit 827457ce5845c5a5b02dab164e12f55cd1c4c532 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 09:34:28 2026 +0100 Raise ValueError for None train_dataset in core trainers (#5157) commit 8b3934ce1681c9f959167804692d6d94fbb36eb0 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 25 07:57:00 2026 +0100 Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX (#5153) commit 4cd198e856b98cae6ed6d0632ab86ca22b432e23 Author: Blake Ledden <47259830+bledden@users.noreply.github.com> Date: Tue Feb 24 19:41:05 2026 -0800 fix: wake up vLLM weights before sync to prevent writes to freed memory (#5147) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 1149b746db9f39dc28859e426b16f0e6557db240 Author: ehofm <ella@rilix.ai> Date: Tue Feb 24 20:36:07 2026 -0500 Fix structured_outputs handling and tool normalization in vLLM backend (#5155) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit ea2b4958d0165e01a11b6f07ec024ee8c1d1835d Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 24 18:10:55 2026 -0600 Fix CI by removing liger-kernel from dev deps (#5163) commit cfbdd3bea4448cde878c0da0de49551f553c61fe Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 23 22:27:02 2026 -0600 Fix `SFTTrainer` support for single-image data (#5132) commit fa313fd57244008953753047795c954a782f9cfc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 23 16:06:04 2026 +0100 Add support for Python 3.14 (#4225) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit bc4edf6f02e6f07549d43b3543cb54d597cd3d91 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 23 15:06:58 2026 +0100 Fix type of TrainingArguments.logging_steps in docs (#5149) commit 5269393f4269462ce5d4a9227a97af6911da7939 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 23 15:06:14 2026 +0100 Use BaseConfig in all experimental configs (#5148) commit ef08730432721d67139e273775acf14846fa95d9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 23 15:04:11 2026 +0100 Fix PPOTrainer.save_model (#5151) commit ae97f06954b274f582f82ac60e444897f73f14c3 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 23 08:00:44 2026 -0600 Fix wording in DPO and SFT trainer documentation for clarity (#5140) commit f150780cda7b0a82a4840c44b7026732ee17c4bb Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 23 09:06:38 2026 +0100 Move common fields from stable trainer configs to BaseConfig (#5136) commit 93f2e480daa0ee9962a8a5deff6ea3da347fe911 Author: casinca <47400729+casinca@users.noreply.github.com> Date: Sun Feb 22 20:09:44 2026 +0100 refactor(gkd_trainer): small optim (#5143) commit 8067ea7558ed4477afce710bbf2f8a1a79973ba7 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 20 11:16:55 2026 -0600 Add `environment_factory` to `GRPOTrainer` (#5093) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 7a4156a1bb3224a3c7f5861d39ef76367273a26b Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 20 16:05:55 2026 +0100 Fix `trl <command> --help` TypeError caused by unescaped `%` in `TrainingArguments` help strings (#5135) commit c3ead5b556d9ea588b4a95cae1775913118ddbc6 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 20 11:09:02 2026 +0100 Fix NameError: name 'importlib' is not defined (#5134) commit b7fa6bf17322f03d5ec47d12efc142de9ea5981a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 20 09:46:30 2026 +0100 Fix import latency [2/N]: Implement native _is_package_available (#5129) commit bb147645fad777c01ce1ccd2f10350b3cc50fceb Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 20 09:45:59 2026 +0100 Fix import latency [1/N]: Extract _LazyModule to dedicated module (#5128) commit e3b7897c873f94c26bf1a661df19e428239be114 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 19 20:53:15 2026 -0600 Refactor DPO (#3906) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit a68fb896f008f58a3d37abf11ae665357b7c679a Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 19 13:15:42 2026 -0600 Remove revision references in dataset loading for toolcall tests (#5133) commit 699b8420cd6601474788effdab063d3d5e7bbc3b Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 19 17:53:30 2026 +0100 Refactor TRL CLI into modular command architecture (#5124) commit b46614e235f126c1c8d0fd9f41f4d217a8299c34 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 19 17:08:14 2026 +0100 Implement Agent Skills [4/N]: Create skills CLI (#5103) commit f8181886c6a59f5f8c2a2bf31bb4cb7bda225d39 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 18 11:59:13 2026 -0600 Update tool handling to support JSON string schemas in trainers (#5118) commit 9fc9a7dcebe3938a273e18ca3ed5b2cfdb6c0839 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Feb 18 18:11:42 2026 +0100 Add Tiny Aya tool calling examples (script/notebook) (#5123) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 27431134e3447181821ffaf94c405a44d87d1bc1 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 18 10:10:14 2026 -0600 Add GLM-4.5 model to tests (#5114) commit 0e531bdd1eb654bed32d12b474ea998285cb1253 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 18 09:22:06 2026 -0600 Add check for `None` in `get_trackio_space_url()` to prevent errors (#5115) commit 8b082bb2d4d599d66cf36df0b754c2f4de0371de Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 18 09:18:02 2026 -0600 Fix Qwen3 schema (#5111) commit 269217f92092e4497260e34bd535756cb9e76f64 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 18 08:27:27 2026 -0600 Add test for Cohere2 models (#5116) commit 57df014377bef538c87c616379d4c11aeaf05b30 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 18 06:45:38 2026 -0600 Add more tests for `get_training_chat_template` (#5108) commit 70efa963f1c9bb88ec3144b051db6fdf4ffc10a4 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 17 10:51:47 2026 -0600 Update version check for transformers to 5.2.0 in online_dpo_trainer.py (#5110) commit 269ed992dca0858f290e08b5ebae271a15df8aa6 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 17 06:49:34 2026 -0600 Add validation for conversational prompts in multimodal training (#5067) commit 997536a2b56d4a0824bb55f9265e6561c3fd1e43 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 17 10:30:06 2026 +0100 Implement Agent Skills [3/N]: Create skills installer (#5100) commit 8b9b972878243505d26b3dc69945613ff5ddc98b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 16 15:50:59 2026 -0600 Remove outdated liger-kernel compatibility checks and warnings in tests and SFTTrainer (#5105) commit 8c232f64b5bb00ef854bff157f6857241d415fe0 Author: Harikrishna KP <harikp2002@gmail.com> Date: Tue Feb 17 02:09:35 2026 +0530 Fix SFT loss type rewards being overwritten in dpo_loss() (#5079) commit 99b26fb2e6f241195fc9b378ee8d50a6219083b2 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 16 14:34:54 2026 -0600 Add Trackio integration for model card visualization (#5101) commit c94c032129af436c55764fef66389f30856df3d0 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 16 21:13:09 2026 +0100 Fix style (#5106) commit 3d1c785762ce87892a7eaf18d1c0fb8771a74bc3 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 16 19:17:55 2026 +0100 Implement Agent Skills [2/N]: Create skills module (#5097) commit 1702fc07b2d0c8ba23ad3299879d4edaaccb3b30 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Mon Feb 16 19:09:59 2026 +0100 feature: top_k selective_log_softmax (#5104) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 29ace1ad72152f4d648bf23763a319ce2400b9e6 Author: flutist <30485581+flutist@users.noreply.github.com> Date: Tue Feb 17 00:56:52 2026 +0800 Fix DPO and RLOO incompatibility with FSDP2 (#4838) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 3d2e898ce4d538b5502f735a9c5116c8b45aaa46 Author: Yuki Uehara <74698040+yukiu00@users.noreply.github.com> Date: Tue Feb 17 01:28:43 2026 +0900 Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss (#5031) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit b6957fc4c04a100e6829cb56e28f20668e4ad1ae Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 16 17:02:42 2026 +0100 Implement Agent Skills [1/N]: Create training skill (MVP) (#5096) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit abf6033b99fc75eb9d58458b44b81f7f7faebdc1 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Feb 16 04:49:30 2026 -0800 docs: Unify model examples to use trl-lib namespace (#4431) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 7694e0e878361c0de6c6cb566c123f46af13d91d Author: Nabin Oli <107109731+nabin2004@users.noreply.github.com> Date: Sat Feb 14 00:36:27 2026 +0545 docs: add Multi-Node Training subsection (#4384) (#5091) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 28fc3f2c336bb7f734aab49c1ad073e152dccf61 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 10:37:15 2026 -0800 docs: Add MPO paper (2411.10442) to paper index (#5089) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 051b52fbf2c68edcae092357d0d4118b35a5f60b Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 12 18:53:17 2026 +0100 Validate reward model has 1 num_labels (#5087) commit a558fba8a5700933207c3963a1dab8a28291f2f1 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 12 18:04:13 2026 +0100 Fix BFD packing for SFT datasets (#5076) commit 0073db963788d6cc77d51789f4b3d2c34930cfdc Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:55:06 2026 -0800 docs: Add PPO paper (1707.06347) to paper index (#5085) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 29ed9cb4ff346100bee004acd1ce7cc97554f064 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:45:57 2026 -0800 docs: Add T5 packing paper (1910.10683) to paper index (#5084) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit d0e06fcda40607ad7bd1a3b639a3018c3bb4bfca Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:38:35 2026 -0800 docs: Add PRM paper (2211.14275) to paper index (#5083) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit ee979a9d2f23c100cb9d4010b4ad99275a6c726c Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:29:29 2026 -0800 docs: Add GKD paper (2306.13649) to paper index (#5082) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit ff84817d27241643abfe3f7691448e509e093320 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:23:04 2026 -0800 docs: Add CPO paper (2401.08417) to paper index (#5081) Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com> commit fe890df6e2a84345a29be849bb5f27ca72052034 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:05:10 2026 -0800 docs: Add ORPO paper (2403.07691) to paper index (#5080) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit c88f8c550137ddd1ddde56baebae2d9b97b9d54d Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Feb 12 01:01:37 2026 -0800 docs: Add TR-DPO paper (2404.09656) to paper index (#5078) commit 0562c3fa26c1bc827aff83800b046f9a2af925a6 Author: Logan Vegna <logan.vegna@shopify.com> Date: Wed Feb 11 16:17:04 2026 -0500 [SFT] Fix high vRAM consumption during eval with liger kernel (#5069) Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit 6b38db6ad85cf67ce1b7d4f037e5e5840d474587 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 11 19:22:00 2026 +0100 Fix CI ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) (#5074) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 060fbfebbddf1e539e5dcee456bef643c29036d3 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 11 10:44:54 2026 -0600 Update model from SequenceClassification to CausalLM in `RewardTrainer` tests (#5060) commit 0933b7fc5ddb933c632708bba5936b99238168d8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 11 17:18:12 2026 +0100 Fix logging warning suppression for transformers 4.56.2 (#5077) commit a07fb82b9a4333ea91cfe289a697b9b178d99021 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 11 09:21:48 2026 -0600 fix: Set `num_labels` to 1 in causal model initialization for RewardTrainer (#5066) commit 29fe68205caf4acbf888307487e8423d692ee496 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Feb 11 15:08:11 2026 +0100 Fix GRPO multi-turn training with liger kernels (#4975) commit 68399dfa6a03e4dea6ea5087c4d181b3b400cab5 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 11 07:39:27 2026 -0600 fix: Use `launch_args` for all trainers (#5059) commit d1b066fdc4a8a7d0bde59e3bf1aeaac8803746d1 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 11 07:37:54 2026 -0600 Fix logging warning suppression with scoped override for seq-clf head key (#5058) commit 0c3d33b955730308bab3d28ba2ef6eebe704c7f8 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Feb 11 02:41:07 2026 -0800 docs: Add SimPO paper (2405.14734) to paper index (#5071) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f23e3a775155f458108cb01ebcfde085ecca4733 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Feb 11 02:33:11 2026 -0800 docs: Add RPO paper (2405.16436) to paper index (#5070) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit e46005c07c1a8194f4ff71b749dceb44f17d7eb7 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Feb 11 02:22:52 2026 -0800 docs: Add XPO (2405.21046) to Paper Index (#5068) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 6d9bba1b3b9f181d9cf53eb9092a3d52b66de93b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Feb 11 02:15:59 2026 -0800 docs: Add REINFORCE++ (2501.03262) to Paper Index (#5062) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit b992e9284aac5979ab5716d14587067857663398 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Feb 11 01:37:06 2026 -0800 docs: Add INTELLECT-2 (2505.07291) to Paper Index (#5061) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit c985dbadd3a499dc049f8c39c61034525d381006 Author: Jen Wei <45276133+JenWei0312@users.noreply.github.com> Date: Wed Feb 11 02:25:55 2026 -0700 docs: add DeepSeek-R1 training dynamics and GRPO example (#5053) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit d934eb757806501a5106b6e4374d920961dc4e9f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 19:14:40 2026 +0100 Remove deprecated mergekit_utils moved to experimental (#5057) commit 991fd0755aa1cce7a800d271dae4b525b6357bfd Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 10 11:57:11 2026 -0600 Remove duplicated tests for SFT and add gradient checkpointing tests (#5054) commit d42b23f63f164af241c34c79e6a855d1eb896d4d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 18:17:10 2026 +0100 Remove deprecated classes moved to experimental (#5044) commit e1a84cf626d249e9b55447d63eece88cbf92d100 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 18:06:05 2026 +0100 Remove deprecated RLOOConfig.max_prompt_length (#5056) commit fc560370d97042cb7f90a9bf0e2e30d29a304240 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 17:47:21 2026 +0100 Remove deprecated XPO after moved to experimental (#5055) commit 13bd37e1426eb81aca81cd68eb8d0efcbc6351b9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 17:36:36 2026 +0100 Remove deprecated PRM after moved to experimental (#5052) commit 0aea3144031abacb0efadd8aab5a3ca9fe6380e8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 17:33:24 2026 +0100 Remove deprecated PPO after moved to experimental (#5051) commit d705ac4d0f13168724f548c9ecbb1a586c187e16 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 17:06:11 2026 +0100 Remove deprecated ORPO after moved to experimental (#5050) commit b393c6bf6605d04f00a110085bf59feef59ffa6a Author: Salman Chishti <13schishti@gmail.com> Date: Tue Feb 10 15:29:23 2026 +0000 Upgrade GitHub Actions to latest versions (#4893) Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit ce2ea744c5f504a026aa9ea41815bfa382417a9d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 15:47:05 2026 +0100 Remove deprecated Judges after moved to experimental (#5048) commit 4620e91d21ad0dd3d885abe29446d7a52a9d368e Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 15:18:49 2026 +0100 Remove deprecated CPO after moved to experimental (#5046) commit 6e47225d012aba64efb16789356bee9d037ec171 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 15:18:33 2026 +0100 Remove deprecated BCO after moved to experimental (#5045) commit 17277e2d963611603eb2655af429975feade3b5c Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Tue Feb 10 15:13:24 2026 +0100 [GRPO] fix: remove SAPO temperature check (#5042) commit 7267b2d3589bcdccb46e8cfd51d63416d4378c76 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 14:42:18 2026 +0100 ⬆️ Bump dev version (#5049) commit 4aaaf064c15ad80bea91895c8f202d44ad17cdb4 Author: casinca <47400729+casinca@users.noreply.github.com> Date: Tue Feb 10 14:27:45 2026 +0100 [minor] docs: typo in `grpo_trainer.md` (#5047) commit 49ef33428c47235991acb4e185ea599b70c6dab4 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 14:20:48 2026 +0100 Release: 0.28 (#5043) commit a958acc1e92d9ccf8404d800bf073c4cc7e5dd85 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 10 03:43:23 2026 -0600 Add Online Direct Preference Optimization section to paper index (#5037) commit 8b935c6378b78adcb4fda9e66a944e05cc99b681 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 10 03:41:41 2026 -0600 Fix multiprocessing start method to 'spawn' for test compatibility with Python 3.12+ (#5036) commit 40fff2e3bab905c2e7096360f4a8f014aab4cd14 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 10 03:32:24 2026 -0600 Deprecate FDivergenceType in DPOConfig; update f_divergence_type to use string values (#5039) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit fe1949b4da60dc7adbcf3a0bb17a4f42280c5c28 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 10 03:22:43 2026 -0600 Deprecate string usage for `ref_model` in DPOTrainer (#5040) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 9f7c33600b7555a72234926a846ee65ff2508624 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 10 02:51:40 2026 -0600 Rename AOT loss type 'aot_pair' to 'aot_unpaired' in DPO (#5038) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 19c5f4460cd9b405a95fabb05322ceb98864e915 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 10 08:04:20 2026 +0100 Allow testing with transformers 5.1.0 via xfail marks (#5034) commit 442509524b4e7c8ee4d9f1d6f1f1087b5dcd1a0f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 23:45:11 2026 +0100 Fix CI FutureWarning: max_prompt_length is deprecated (#5019) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 0ef315a0f992a30f82792e87b5e3f0fab58a6107 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 22:45:23 2026 +0100 Filter max_prompt_length UserWarning in all test cases (#5035) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 8a27a17e583756b75ce71eb40a9ed295a610ade1 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 22:43:34 2026 +0100 Fix CI FutureWarning: tools is deprecated (#5015) commit ff55949cccaaa23a752d951d184c04ff579aad89 Author: Haseeb Asif <149416177+Haseebasif7@users.noreply.github.com> Date: Tue Feb 10 02:42:55 2026 +0500 Add length-unbiased GRPO loss (LUSPO) (#4988) Co-authored-by: Haseeb Asif <haseeb@Haseebs-MacBook-Air.local> Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 765e397ed83f344a8ba7082673d4c4616beeb3a2 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 9 15:00:49 2026 -0600 [CI] Silence PyTorch JIT and DataLoader deprecation warnings (#4999) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit b5bd2b98ed615676a7bee40c8ae17de62421bb4a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 21:12:57 2026 +0100 Mark Qwen3VL tests as xfail for transformers 5.0.x (#5029) commit 7189bc68d8ab0d19b67c0fc0b849c83679389b46 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 21:04:35 2026 +0100 Fix CI FutureWarning: use_logits_to_keep is deprecated (#5013) commit db0d95523e5b8039c94c50df1b6286ff8b7e29ce Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 15:53:56 2026 +0100 Fix CI FutureWarning: rpo_alpha is deprecated (#5011) commit fa06506f9d1c9546f63ae513cf7a5ba1be3247ae Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 15:52:31 2026 +0100 Fix typo in xfail test reason (#5028) commit 4abd67951f996b511f4b913ead2386fbe357061b Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 15:51:21 2026 +0100 Fix CI FutureWarning: generate_during_eval is deprecated (#5017) commit 9f1e7dd7fd58be3327748234fb952599c4bd4f09 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 15:48:59 2026 +0100 Pin transformers < 5 in judges extra due to incompatibility (#5024) commit 7c4e7f86047b82ad0e5ff7c8e3bb280b73024f31 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 15:22:02 2026 +0100 Fix vision model prompt truncation bug in DPOTrainer (#5023) commit a68c82a617be59086b83f5ce941175270926de3f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 15:00:22 2026 +0100 Fix typo in DPO max_prompt_length deprecation warning message (#5020) commit 5eb25938d44b781687061a69586a06501b11e915 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 14:35:44 2026 +0100 Fix CI FutureWarning: ref_model_init_kwargs is deprecated (#5009) commit 58f467babd998fe5fe41598b535ceacda690cef0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 9 07:34:48 2026 -0600 Add support for `nested_gather` in OnlineDPOTrainer for transformers v5.2.0 and above (#4981) commit 71a349335ce554180b2b4947d33594090f74d5cf Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 14:34:44 2026 +0100 Fix CI TRLExperimentalWarning in regular tests (#5007) commit a7333c8c68f564005ab74fd999ad246de405122f Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 9 14:34:15 2026 +0100 Filter CI SWIG deprecation warnings (#5004) commit 98b00171b81411cd8bf7d6a9135af70c9879aaee Author: Nabin Oli <107109731+nabin2004@users.noreply.github.com> Date: Mon Feb 9 18:59:18 2026 +0545 docs: add CGPO/Mixture of Judges (2409.20370) to Paper Index + link ref to AllTrueJudge (#5002) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 728b0e372fb7de141093aff5513697a4fa743137 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Feb 9 07:13:47 2026 -0600 [tests] Remove xfail for transformers version >= 5.0.0 due to upstream bug resolution (#5000) commit 637de450e748d1f612c0f6fed6be4df9cbbf1c39 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Sun Feb 8 14:36:10 2026 -0600 Add `sanitize_logprob` function for NaN handling in vLLM log probabilities (#5001) commit bfb94262b81fd28017a11d7b9ddc61e3095cc2b6 Author: Akshay Ballal <61191840+akshayballal95@users.noreply.github.com> Date: Sat Feb 7 14:48:58 2026 +0100 Fix GRPO tool calling for corrupted tool calls (#4890) commit 7a39ff3995f2f8b7cb4f8ca29a09390ac587a43d Author: casinca <47400729+casinca@users.noreply.github.com> Date: Fri Feb 6 23:05:14 2026 +0100 perf: Qwen SAPO loss optimization (#4956) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit bd206704f9fc2c08039c522b40b0f68654bb006f Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Feb 6 23:00:24 2026 +0100 Update sampling mode to token level for safety (#4989) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit aa7d457f9bec736c75439f78d081a6dc012ce353 Author: cmunley1 <cmunley@nvidia.com> Date: Fri Feb 6 13:37:53 2026 -0800 Update NeMo-Gym to use `env_mask` (#4986) Signed-off-by: Christian Munley <cmunley@nvidia.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 5db1c11c52bc95255ea73e7eae3840fbeeb293a2 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 11:27:47 2026 -0600 Add distributed smoke tests workflow for Transformers branch (#4996) commit 90a35d12c9c64129eb023b499a812c5e638db846 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 10:47:49 2026 -0600 Add GitHub Actions workflow for testing against Transformers branch (#4995) commit f11b4c3fdd511d9adfda74ceec02042cee65a0f3 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 09:46:14 2026 -0600 Fix ZeRO-3 + PEFT + gradient checkpointing (#4951) commit 27cbe98ac7487f326be51e180a1ee078c23b3836 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 6 16:14:18 2026 +0100 Fix post_init warning stacklevel to 3 (#4993) commit 57cac251bdde714f97458b39b24702cf624dec66 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 6 16:13:47 2026 +0100 Fix deprecation of DPOConfig.max_completion_length (#4992) commit ce72c067f6b55d4352c71939d9be6f4dfdaf68a0 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 6 16:13:14 2026 +0100 Assert chat_template is applied in test_train_with_chat_template_kwargs (#4991) commit ffdaba3a97299c0c381512f807c57aa753f6314a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 6 16:11:40 2026 +0100 Fix import of AutoModelForCausalLMWithValueHead from experimental (#4990) commit 97a8a9672c0d5fbacb5c60934f40a7af404adecf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 07:10:06 2026 -0600 Use local variable instead of attribute in collator tests (#4957) commit 4e212bdeed6c7bf081e494960c65a84a35a85d6e Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 07:07:28 2026 -0600 Update dataset configuration name in toolcall dataset loading (#4984) commit c82f6aa4766f83b66ea3f37e7bbf3b30453a1cda Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 6 14:03:31 2026 +0100 Fix passing tokenizer in test_train_with_chat_template_kwargs (#4987) commit c581c1e8829904c6838c38d70d7b5fa646e2f0fc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Feb 6 14:03:02 2026 +0100 Pin transformers!=5.1.0 in deepspeed extra due to incompatibility (#4985) commit 98aca7f4fdb2c2879c86aa9bd18ccddece112b70 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 06:54:48 2026 -0600 Replace `warmup_ratio` with `warmup_steps` (#4983) commit 032ee139d90d1279549e2b538f11a2c3b7c22aa7 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Fri Feb 6 00:34:04 2026 -0600 [CI] Disallow installation of transformers 5.1.0 due to compatibility issues with DeepSpeed (#4982) commit a0e5f265604356d2a107edccd920b5236582a3d7 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 5 13:42:37 2026 -0600 Deprecate parameters in `DPOConfig` (#4969) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit f0a738d954775e50d4bd4a4df4fc5d1826e2f0b4 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Feb 5 20:12:15 2026 +0100 Simplify instructions of installation of OpenEnv (#4980) commit a92d14336e5380f1ce2b8cbca78ebd224933d2ba Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 5 13:05:32 2026 -0600 Replace `torch.allclose` with `torch.testing.assert_close` (#4977) commit b0b798a82953094fb4d254915e4c360789ac2838 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 19:53:14 2026 +0100 Support truncated completions in GRPO multi-turn training (#4976) commit ac194a917b2a8b7d514097d34078ec191ef4a0e3 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 16:56:50 2026 +0100 Fix add_column in test_train_with_chat_template_kwargs (#4979) commit 3a76b7a8690e25838f2332b81d7efbfed2615277 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Feb 5 16:24:29 2026 +0100 Set specific OpenEnv version when installed (#4978) commit 0113ad7022118e4a7afe62b4e190f0b9aee4cadf Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 13:28:21 2026 +0100 Remove truncation from tokenizer calls if no max_length (#4972) commit eee98f77a25bb386a0aba85dcb93ad50511224f9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 13:27:01 2026 +0100 Remove padding_value from experimental CPO and use pad_token_id (#4962) commit 1354860c5c33bdee7a89a8845474ac4074094ed6 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 5 06:07:05 2026 -0600 Fix test_train_with_chat_template_kwargs (#4971) commit 1bd2a52ec2d8344050af736d60cdc735181ae4b8 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Feb 5 04:18:38 2026 -0600 Revert change in GRPO from NeMo-Gym Integration (#4970) commit 22ad7e6b3f2ec7dfc2567fa0535955812fe69a42 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 08:14:54 2026 +0100 Remove max_prompt_length from experimental ORPO (#4966) commit 657babd9300007308c7b9ad329790f0564daf94d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 08:13:10 2026 +0100 Remove max_prompt_length from experimental CPO (#4965) commit 50e35de16578daa9aee75758fad8bdc0f707e37d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 08:09:25 2026 +0100 Remove max_prompt_length from experimental BCO (#4964) commit 35bcab1d4da9fca092f9cefbc611fcd9eddaaf42 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Feb 5 08:02:34 2026 +0100 Remove max_prompt_length from experimental PRM (#4963) commit e4995b2d26122879c03605f8ee136bcb241b4171 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 4 14:17:22 2026 -0600 Add test for training with `compute_metrics` in `RewardTrainer` (#4958) commit cb5a73bfd97a3cb36712cd540545c2512cfa96a6 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Feb 4 14:14:14 2026 -0600 Add test for tool call data in `RewardTrainer` (#4959) commit 90b875c575b816e9015670ad812db43c8ab9a0e3 Author: cmunley1 <cmunley@nvidia.com> Date: Wed Feb 4 08:56:55 2026 -0800 NeMo-Gym Integration (#4848) Signed-off-by: Christian Munley <cmunley@nvidia.com> Signed-off-by: cmunley1 <cmunley@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Lawrence Lane <llane@nvidia.com> commit 5cb7eee1548bc72ed6fd84080c200a0adf74add2 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 3 11:53:35 2026 -0600 Remove access to `warnings_issued` (#4960) commit 2a55ed701122f3d210669c70b441e3baff6184b6 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Feb 3 08:28:32 2026 -0600 Add test for training with `compute_metrics` in `SFTTrainer` (#4950) commit 7b54e7253093610ab69bb8c32b2a4eb6926721ea Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Feb 3 10:43:40 2026 +0100 Minor fix docs style (#4953) commit 2a9fb3f22a8bbad1412af3bb2526febd7160b85f Author: mel3c <gaozh1988@live.com> Date: Tue Feb 3 16:02:07 2026 +0800 Fix PPO run_name parameter not taking effect (#4945) commit a03c2fcda3a328bde9af4abc5c02f6e7e942140f Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Mon Feb 2 16:27:47 2026 +0100 Update wordle.py example with masking of env tokens (#4895) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 68bc37700d2b66e1fbfa49282495f5419dd8abeb Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 2 15:25:02 2026 +0100 Remove ref_model_init_kwargs from experimental BCO (#4946) commit 239c74d9ffb8ca67a9a667fb7a2a91576d554f28 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Feb 2 09:25:41 2026 +0100 Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 (#4942) commit 035c3ff151b953ca72cdfe0ee966bc1469a26fde Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Jan 29 14:08:08 2026 -0600 [GRPO] Add parquet logging for completions with individual rewards (#4818) Co-authored-by: Daniel van Strien <davanstrien@gmail.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> commit 414e60f557eb0d0888db841c5e0e8f568e7607a8 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Jan 29 18:57:12 2026 +0100 Set default top_k to 0 in VLLMClient (#4927) commit df332dc924e1bdc75bcfc5573950a17648db2eb4 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Jan 29 09:03:32 2026 -0600 Fix import statement for import_utils in vllm_client.py (#4932) commit 27998e9584df0102b849878506be7d4808486771 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Jan 29 15:51:19 2026 +0100 Fix profiling of VLLMGeneration.sync_weights (#4931) commit 43fb8d310633448a0c4c731a2efe9c1ca55e6184 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Jan 29 15:11:24 2026 +0100 Set model dtype to float32 in experimental tests of trainers (#4925) commit 5a7481ec9340dfad5f23c54f90e15a139c1dff85 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Jan 29 14:35:56 2026 +0100 Move VLLMClient to generation module (#4928) commit 21a0d70400179e4047c60183d7fb61988a249989 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Jan 29 14:32:59 2026 +0100 Require transformers<5 with PairRMJudge (#4926) commit 4348375ab2c6bad36ef90e1061b804b0449148f1 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Jan 29 14:30:13 2026 +0100 Set model dtype to float32 in tests of trainers (#4924) commit a6cbf279d7d3bc4024e6e6273d967509e7221e83 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Jan 29 07:06:06 2026 -0600 Support tool call data in `is_conversational` (#4923) commit ad91c6ffa91073684c4cf6dc2008e2994dd940e7 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 11:47:15 2026 -0600 Add validation for `sync_ref_model` in `GRPOTrainer` and `RLOOTrainer` when using PEFT models (#4912) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 04717ffca8fd91a0fa5ee610fbdc75ef8f3c5a22 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 11:16:12 2026 -0600 Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests (#4914) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit b322d9ba8092399b956882f61978ab3e90868c77 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 10:54:04 2026 -0600 Remove chat template setup in dpo_vlm.py (#4906) commit a70b4e014756dc8595ac226d833deaba9784f756 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 09:55:19 2026 -0600 Fix extra EOS appended in DPO preprocessing for conversational data (#4908) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 8464b0e4b22c571bbf565a03ee154a5692c8d056 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Jan 28 16:03:53 2026 +0100 Fix CI ValueError for 0 temperature (#4916) commit 5461a74bc622660039e2038b6b0e5a43bdc712ae Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Jan 28 15:58:12 2026 +0100 Fix CI AssertionError: assert not True (#4921) commit d54381a4a90cb18152842158c62aad9895022448 Author: Boyi Zhang <68804418+billycrapediem@users.noreply.github.com> Date: Wed Jan 28 09:54:29 2026 -0500 docs: add DoRA (2402.09353) to Paper Index (#4892) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit f2f6b32bdc3688b124d72caa412d60a8f12d80c0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 08:52:58 2026 -0600 Remove gradient checkpointing option from various training scripts (#4905) commit 6cbc102f5fe94804e5a7579ff1aec270b97e4f5f Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 08:35:23 2026 -0600 Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer (#4913) commit f40edf9328adbe6c85acfb9dd9745e9c1393197e Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 08:17:35 2026 -0600 `device_map` init consistency in GRPO/RLOO/KTO (#4909) commit a7070f940e8e0565adfbe9bbedd68b7850334b03 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 08:14:30 2026 -0600 Fix help text formatting for `max_length` in `RewardConfig` and `SFTConfig` (#4910) commit 66efc0e52e55d77c2edf3e67c6c1f08e274ac9f8 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 28 08:08:11 2026 -0600 Rearrange variable assignments in `DataCollatorForVisionLanguageModeling` (#4911) commit e9a2f16004a00a50e69e5779f58bf0bc24937de7 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Jan 28 11:43:26 2026 +0100 Fix CI TypeError in llm-blender tests (#4919) commit 4f8232098c10c98ad7febe971da4eb362d13433c Author: adityachallapally <avasanthc@gmail.com> Date: Wed Jan 28 01:10:29 2026 -0800 Created new PTT integration docs as requested (#4907) Co-authored-by: Aditya Challapally <adchalla@microsoft.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 0eb66d8f2fc63b3d00d8dbc18f99c3f48750bd16 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Jan 27 16:53:29 2026 +0100 Refactor vLLM generation [1/N]: Extract vLLM generation (#4700) commit 226ef57192b49801c3be8c55c798c6d5b134b080 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Jan 27 14:34:11 2026 +0100 Fix CI AssertionError: Parameter has not changed (#4904) commit 956986ebd53ff0d8dfa688e9d1033488dcad55d6 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Jan 27 14:33:56 2026 +0100 Fix CI NotImplementedError for bfloat16 (#4902) commit 4322778d7f696a4fc1fc33612b02eeb5ec700109 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Jan 26 12:57:43 2026 -0600 Transformers v5 release: extend xfail condition for `TestGRPOTrainer.test_training_vlm_and_liger` and update version checks (#4898) commit e106972dd6d839f4a3d3fcaffc1f386b4fbe66bf Author: Cola Chan (SII) <57797863+141forever@users.noreply.github.com> Date: Mon Jan 26 17:56:38 2026 +0800 GOLD training speed up (#4888) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit c477e88e05023dbcd45211c1a802788650598909 Author: Yi-Chen Li <ychenli.X@gmail.com> Date: Fri Jan 23 21:25:36 2026 +0800 Fix RewardTrainer's results not reproducible (#4887) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit ba053232324b207554116f806edbb2ec8b6ab9f5 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Jan 22 08:15:31 2026 -0600 Fix import path for `get_open_port` based on vLLM version (#4883) commit e66a138438a3beba08756543fa41b7a90054ee8c Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Jan 22 08:14:48 2026 -0600 Mark ZeRO 2 as xfail in distributed tests due to current failure (#4885) commit a60d75aa1efa6ac5330649aafd425859da685a63 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 21 15:44:01 2026 -0600 Test distributed training for `RewardTrainer`, `RLOOTrainer` and `GRPOTrainer` (#4823) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> commit 60e46742576876209658446b50f144541873301b Author: Wing Lian <wing@axolotl.ai> Date: Wed Jan 21 16:15:03 2026 -0500 Enable vLLM sleep mode for generation in Online DPO (#4882) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 0a881bcee992a25e2fc0e980cc43a7428ce17373 Author: Kirill Dubovikov <dubovikov.kirill@gmail.com> Date: Thu Jan 22 01:09:34 2026 +0400 Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) (#4873) Co-authored-by: Kirill Dubovikov <kirill.dubivokov@mbzuai.ac.ae> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 16b090302b8fd408870baa7452b5c3a29e03c346 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Jan 21 03:00:04 2026 -0600 Fix SFT training for prompt-completion type and transformers v5 (#4880) commit b080a4c27a60988be213354f551e26d3a4b2eef9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Jan 21 08:07:42 2026 +0100 Remove label_pad_token_id from experimental trainers (#4878)

qgallouedec added 2 commits February 10, 2026 21:10

Add validation for conversational prompts in multimodal training

ca31894

style

a00fcad

qgallouedec mentioned this pull request Feb 10, 2026

Fix GRPO VLM prompt handling for string prompts #5064

Open

5 tasks

qgallouedec requested review from albertvillanova, edbeeching, kashif, lewtun and sergiopaniego February 10, 2026 21:25

akshan-main mentioned this pull request Feb 11, 2026

Cast multimodal forward_kwargs to compute dtype for bf16/fp16 training #5073

Open

5 tasks

qgallouedec added 4 commits February 12, 2026 09:32

Merge branch 'main' into actionable-error-grpo-vlm

f6df004

Merge branch 'main' into actionable-error-grpo-vlm

9dacd16

Merge branch 'main' into actionable-error-grpo-vlm

f97d7c3

Merge branch 'main' into actionable-error-grpo-vlm

ada9914

albertvillanova approved these changes Feb 17, 2026

View reviewed changes

qgallouedec merged commit 269ed99 into main Feb 17, 2026
13 of 14 checks passed

qgallouedec deleted the actionable-error-grpo-vlm branch February 17, 2026 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validation for conversational prompts in multimodal training#5067

Add validation for conversational prompts in multimodal training#5067
qgallouedec merged 6 commits intomainfrom
actionable-error-grpo-vlm

qgallouedec commented Feb 10, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2026

Uh oh!

akshan-main commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

akshan-main commented Feb 11, 2026

Uh oh!

qgallouedec commented Feb 11, 2026

Uh oh!

qgallouedec commented Feb 11, 2026

Uh oh!

albertvillanova left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qgallouedec commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2026

Uh oh!

akshan-main commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

akshan-main commented Feb 11, 2026

Uh oh!

qgallouedec commented Feb 11, 2026

Uh oh!

qgallouedec commented Feb 11, 2026

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qgallouedec commented Feb 10, 2026 •

edited

Loading