Skip to content

Add validation for conversational prompts in multimodal training#5067

Merged
qgallouedec merged 6 commits intomainfrom
actionable-error-grpo-vlm
Feb 17, 2026
Merged

Add validation for conversational prompts in multimodal training#5067
qgallouedec merged 6 commits intomainfrom
actionable-error-grpo-vlm

Conversation

@qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented Feb 10, 2026

There is confusion around whether data should be pre-processed before being passed to GRPOTrainer. This adds a clear, actionable error message instead of the cryptic

TypeError: string indices must be integers, not 'str'.

See #5064

Closes #4870
Closes #4746
Closes #4451
Closes #5041

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@akshan-main
Copy link

This addresses the error message for string prompts, but the dtype mismatch (expected scalar type BFloat16 but found Float in layer_norm) reported in #4451 comments is a separate crash that still happens even with conversational prompts. I have a fix for that want me to open a separate PR for it, or should I add it here?

@qgallouedec
Copy link
Member Author

Yes please open a separate PR 🙏

@akshan-main
Copy link

on it

@qgallouedec
Copy link
Member Author

@codex review

1 similar comment
@qgallouedec
Copy link
Member Author

@codex review

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@qgallouedec qgallouedec merged commit 269ed99 into main Feb 17, 2026
13 of 14 checks passed
@qgallouedec qgallouedec deleted the actionable-error-grpo-vlm branch February 17, 2026 12:49
qgallouedec added a commit to kansalaman/trl that referenced this pull request Mar 3, 2026
commit 489331e703e1e8d39534957f465fadce7f00ff99
Author: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Date:   Tue Mar 3 14:50:42 2026 +0000

    Replace deprecated asyncio.iscoroutinefunction with inspect.iscoroutinefunction in RLOO/GRPO trainers

commit 484c1c1acf0b437c20e230d5e135613daf1a59fa
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Mar 3 08:42:04 2026 -0600

    CI: Add Qwen 3.5 tiny model to tests (#5204)

commit 7eebb294a9175ea2f0ffbf20cf759f772491d815
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Mar 3 07:35:22 2026 +0100

    Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn (#5122)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: swappy <59965507+rycerzes@users.noreply.github.com>

commit 0bf875c0cbb879c4b264f66a6e556769d42e2f52
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Mar 2 15:54:35 2026 +0100

    Mark CI test_training_vlm_and_liger as xfail (#5202)

commit 7544c3a784147dbfc53bb1314558137320ecc3ed
Author: Michael Royzen <45830328+michaelroyzen@users.noreply.github.com>
Date:   Fri Feb 27 14:42:57 2026 -0500

    Support sequence sampling in Liger Kernel and pass importance_samplin… (#5190)

    Co-authored-by: Michael Royzen <michaelroyzen@mac.mynetworksettings.com>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 5cffd59a8a814b9132c6d08e5aa88347a41c66e3
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 27 15:43:33 2026 +0100

    Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM (#5197)

commit eb8b8a510b3ee0e7e83e33f8cfbb6eada8eb7f34
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 27 08:11:51 2026 -0600

    Re-add liger-kernel to dev deps (#5164)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit 68f807b5e2ba4994898a7ef21ba631b64fb7c4b5
Author: Zhenkun Cai <zekucai@gmail.com>
Date:   Fri Feb 27 05:11:51 2026 -0800

    Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer (#5180)

commit e53c98feb463c0897451b307432360c1616a8905
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 27 13:34:17 2026 +0100

    Fix CI tests patching BaseTrainer (#5192)

commit bd2d21e02cc722221c0c7f91f4ddc7cbd9d271fa
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 27 08:27:22 2026 +0100

    Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports (#5187)

commit e63cd79c68fc62edf63f01904ee02b0e63ab4336
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 27 08:25:30 2026 +0100

    Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports (#5186)

commit e941ff58121d382b470f8c8011dd76088192c46b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 27 08:23:38 2026 +0100

    Fix deprecation warning of fork in multi-threaded process (#5185)

commit b9263efa25e05ebf1c8c1525a9d5a6a7e94efbb2
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 27 08:22:58 2026 +0100

    Fix deprecation warning of create_reference_model (#5184)

commit 410c00bfaead36b0048921a123739bd0cb4c3e7c
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 26 10:38:56 2026 -0600

    Align documentation with the intended public API (#5162)

commit 519225384f9aaa7acf3959fbf6a218c2490d4a0e
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date:   Thu Feb 26 15:44:50 2026 +0100

    Add minimal CARLA example script (#5161)

commit 64b47513982e2845c8cb6f4d5d611037f605d9bf
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 26 11:11:52 2026 +0100

    Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser (#5170)

commit f00379fa221689d67a3736c44eaf07137c11d5f9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 26 10:02:45 2026 +0100

    Make _BaseConfig and _BaseTrainer explicitly private (#5169)

commit eb973af2d1109c84600c7fdddf259e06a547f583
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 26 09:11:09 2026 +0100

    Document parameters with differing default values in core configs (#5168)

commit b2b3045dfe3a3b6a0c52785b055b60e9a1a0e73b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 26 07:56:32 2026 +0100

    Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError (#5178)

commit 27e3e2ff68929b25045caf8af32799b2e1dc3965
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 25 16:03:45 2026 -0600

    ⬆️ Bump dev version (#5182)

commit d24e19424da2837d435a7884c0b307b605413829
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 25 15:56:26 2026 -0600

    Release: v0.29 (#5181)

commit 70cf097fb8a39b8ad86aa6e27d49f081e96da4a5
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Wed Feb 25 22:01:16 2026 +0100

    feature: Configurable num logprobs in vLLM generation (#5107)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 57d749336487d7ece06e58b941e4180f13649d8f
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 25 11:17:13 2026 -0600

    Rename input keys in `RewardTrainer` collator from `chosen/rejected_input_ids` to `chosen/rejected_ids` (#5179)

commit a0d7d8e1257dea15fae6df434285958d22ce9c4e
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 17:31:50 2026 +0100

    Update upstream tracking info about CI PyTorch JIT deprecation warnings (#5166)

commit 51fdc53e08b0ee39b65ba699fb49281d183701ce
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 17:16:55 2026 +0100

    Document parameters with differing default values in experimental configs (#5172)

commit dd15cbb04a47c8efb4c8ed13e315dc4f2e1f853e
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 17:07:45 2026 +0100

    Fix default learning_rate in BCO according to paper (#5173)

commit 0b2cd5c04e26e13358413c00e98a56e2c2914eb9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 16:42:43 2026 +0100

    Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies (#5176)

commit 95cedba36e5e015f9402bb997529337d6c90b0bb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 16:39:36 2026 +0100

    Fix default learning_rate in PPO according to paper (#5174)

commit 6d78858d176b9fb385b6d0f332d369e1ee2e27fb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 16:26:16 2026 +0100

    Fix experimental TestUpdateWithReplayBuffer: ValueError: `train_dataset` is required (#5171)

commit 0efaec33fbd3445eb1142c306e797940fad4de28
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 25 09:25:39 2026 -0600

    Revert changes in vLLM client/server (#5165)

commit e540d687f8df6f3596fa6eb3cc50116b41d58f42
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 10:48:43 2026 +0100

    Refactor CLI [3/N]: Self-contain VllmServeCommand argument parsing (#5160)

commit 9cc95a97927e59c3532ce2be3babcfd8a35adcd9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 09:57:53 2026 +0100

    Refactor CLI [2/N]: Move accelerate concerns into TrainingCommand (#5159)

commit 827457ce5845c5a5b02dab164e12f55cd1c4c532
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 09:34:28 2026 +0100

    Raise ValueError for None train_dataset in core trainers (#5157)

commit 8b3934ce1681c9f959167804692d6d94fbb36eb0
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 25 07:57:00 2026 +0100

    Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX (#5153)

commit 4cd198e856b98cae6ed6d0632ab86ca22b432e23
Author: Blake Ledden <47259830+bledden@users.noreply.github.com>
Date:   Tue Feb 24 19:41:05 2026 -0800

    fix: wake up vLLM weights before sync to prevent writes to freed memory (#5147)

    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 1149b746db9f39dc28859e426b16f0e6557db240
Author: ehofm <ella@rilix.ai>
Date:   Tue Feb 24 20:36:07 2026 -0500

    Fix structured_outputs handling and tool normalization in vLLM backend (#5155)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit ea2b4958d0165e01a11b6f07ec024ee8c1d1835d
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 24 18:10:55 2026 -0600

    Fix CI by removing liger-kernel from dev deps (#5163)

commit cfbdd3bea4448cde878c0da0de49551f553c61fe
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 23 22:27:02 2026 -0600

    Fix `SFTTrainer` support for single-image data (#5132)

commit fa313fd57244008953753047795c954a782f9cfc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 23 16:06:04 2026 +0100

    Add support for Python 3.14 (#4225)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit bc4edf6f02e6f07549d43b3543cb54d597cd3d91
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 23 15:06:58 2026 +0100

    Fix type of TrainingArguments.logging_steps in docs (#5149)

commit 5269393f4269462ce5d4a9227a97af6911da7939
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 23 15:06:14 2026 +0100

    Use BaseConfig in all experimental configs (#5148)

commit ef08730432721d67139e273775acf14846fa95d9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 23 15:04:11 2026 +0100

    Fix PPOTrainer.save_model (#5151)

commit ae97f06954b274f582f82ac60e444897f73f14c3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 23 08:00:44 2026 -0600

    Fix wording in DPO and SFT trainer documentation for clarity (#5140)

commit f150780cda7b0a82a4840c44b7026732ee17c4bb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 23 09:06:38 2026 +0100

    Move common fields from stable trainer configs to BaseConfig (#5136)

commit 93f2e480daa0ee9962a8a5deff6ea3da347fe911
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Sun Feb 22 20:09:44 2026 +0100

    refactor(gkd_trainer): small optim (#5143)

commit 8067ea7558ed4477afce710bbf2f8a1a79973ba7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 20 11:16:55 2026 -0600

    Add `environment_factory` to `GRPOTrainer` (#5093)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 7a4156a1bb3224a3c7f5861d39ef76367273a26b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 20 16:05:55 2026 +0100

    Fix `trl <command> --help` TypeError caused by unescaped `%` in `TrainingArguments` help strings (#5135)

commit c3ead5b556d9ea588b4a95cae1775913118ddbc6
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 20 11:09:02 2026 +0100

    Fix NameError: name 'importlib' is not defined (#5134)

commit b7fa6bf17322f03d5ec47d12efc142de9ea5981a
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 20 09:46:30 2026 +0100

    Fix import latency [2/N]: Implement native _is_package_available (#5129)

commit bb147645fad777c01ce1ccd2f10350b3cc50fceb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 20 09:45:59 2026 +0100

    Fix import latency [1/N]: Extract _LazyModule to dedicated module (#5128)

commit e3b7897c873f94c26bf1a661df19e428239be114
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 19 20:53:15 2026 -0600

    Refactor DPO (#3906)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit a68fb896f008f58a3d37abf11ae665357b7c679a
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 19 13:15:42 2026 -0600

    Remove revision references in dataset loading for toolcall tests (#5133)

commit 699b8420cd6601474788effdab063d3d5e7bbc3b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 19 17:53:30 2026 +0100

    Refactor TRL CLI into modular command architecture (#5124)

commit b46614e235f126c1c8d0fd9f41f4d217a8299c34
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 19 17:08:14 2026 +0100

    Implement Agent Skills [4/N]: Create skills CLI (#5103)

commit f8181886c6a59f5f8c2a2bf31bb4cb7bda225d39
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 18 11:59:13 2026 -0600

    Update tool handling to support JSON string schemas in trainers (#5118)

commit 9fc9a7dcebe3938a273e18ca3ed5b2cfdb6c0839
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date:   Wed Feb 18 18:11:42 2026 +0100

    Add Tiny Aya tool calling examples (script/notebook) (#5123)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 27431134e3447181821ffaf94c405a44d87d1bc1
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 18 10:10:14 2026 -0600

    Add GLM-4.5 model to tests (#5114)

commit 0e531bdd1eb654bed32d12b474ea998285cb1253
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 18 09:22:06 2026 -0600

    Add check for `None` in `get_trackio_space_url()` to prevent errors (#5115)

commit 8b082bb2d4d599d66cf36df0b754c2f4de0371de
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 18 09:18:02 2026 -0600

    Fix Qwen3 schema (#5111)

commit 269217f92092e4497260e34bd535756cb9e76f64
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 18 08:27:27 2026 -0600

    Add test for Cohere2 models (#5116)

commit 57df014377bef538c87c616379d4c11aeaf05b30
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 18 06:45:38 2026 -0600

    Add more tests for `get_training_chat_template` (#5108)

commit 70efa963f1c9bb88ec3144b051db6fdf4ffc10a4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 17 10:51:47 2026 -0600

    Update version check for transformers to 5.2.0 in online_dpo_trainer.py (#5110)

commit 269ed992dca0858f290e08b5ebae271a15df8aa6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 17 06:49:34 2026 -0600

    Add validation for conversational prompts in multimodal training (#5067)

commit 997536a2b56d4a0824bb55f9265e6561c3fd1e43
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 17 10:30:06 2026 +0100

    Implement Agent Skills [3/N]: Create skills installer (#5100)

commit 8b9b972878243505d26b3dc69945613ff5ddc98b
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 16 15:50:59 2026 -0600

    Remove outdated liger-kernel compatibility checks and warnings in tests and SFTTrainer (#5105)

commit 8c232f64b5bb00ef854bff157f6857241d415fe0
Author: Harikrishna KP <harikp2002@gmail.com>
Date:   Tue Feb 17 02:09:35 2026 +0530

    Fix SFT loss type rewards being overwritten in dpo_loss() (#5079)

commit 99b26fb2e6f241195fc9b378ee8d50a6219083b2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 16 14:34:54 2026 -0600

    Add Trackio integration for model card visualization (#5101)

commit c94c032129af436c55764fef66389f30856df3d0
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 16 21:13:09 2026 +0100

    Fix style (#5106)

commit 3d1c785762ce87892a7eaf18d1c0fb8771a74bc3
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 16 19:17:55 2026 +0100

    Implement Agent Skills [2/N]:  Create skills module (#5097)

commit 1702fc07b2d0c8ba23ad3299879d4edaaccb3b30
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Mon Feb 16 19:09:59 2026 +0100

    feature: top_k selective_log_softmax (#5104)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 29ace1ad72152f4d648bf23763a319ce2400b9e6
Author: flutist <30485581+flutist@users.noreply.github.com>
Date:   Tue Feb 17 00:56:52 2026 +0800

    Fix DPO and RLOO incompatibility with FSDP2 (#4838)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 3d2e898ce4d538b5502f735a9c5116c8b45aaa46
Author: Yuki Uehara <74698040+yukiu00@users.noreply.github.com>
Date:   Tue Feb 17 01:28:43 2026 +0900

    Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss (#5031)

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit b6957fc4c04a100e6829cb56e28f20668e4ad1ae
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 16 17:02:42 2026 +0100

    Implement Agent Skills [1/N]: Create training skill (MVP) (#5096)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit abf6033b99fc75eb9d58458b44b81f7f7faebdc1
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Mon Feb 16 04:49:30 2026 -0800

    docs: Unify model examples to use trl-lib namespace (#4431)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 7694e0e878361c0de6c6cb566c123f46af13d91d
Author: Nabin Oli <107109731+nabin2004@users.noreply.github.com>
Date:   Sat Feb 14 00:36:27 2026 +0545

    docs: add Multi-Node Training subsection (#4384) (#5091)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 28fc3f2c336bb7f734aab49c1ad073e152dccf61
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 10:37:15 2026 -0800

    docs: Add MPO paper (2411.10442) to paper index (#5089)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 051b52fbf2c68edcae092357d0d4118b35a5f60b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 12 18:53:17 2026 +0100

    Validate reward model has 1 num_labels (#5087)

commit a558fba8a5700933207c3963a1dab8a28291f2f1
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 12 18:04:13 2026 +0100

    Fix BFD packing for SFT datasets (#5076)

commit 0073db963788d6cc77d51789f4b3d2c34930cfdc
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:55:06 2026 -0800

    docs: Add PPO paper (1707.06347) to paper index (#5085)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 29ed9cb4ff346100bee004acd1ce7cc97554f064
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:45:57 2026 -0800

    docs: Add T5 packing paper (1910.10683) to paper index (#5084)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit d0e06fcda40607ad7bd1a3b639a3018c3bb4bfca
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:38:35 2026 -0800

    docs: Add PRM paper (2211.14275) to paper index (#5083)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit ee979a9d2f23c100cb9d4010b4ad99275a6c726c
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:29:29 2026 -0800

    docs: Add GKD paper (2306.13649) to paper index (#5082)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit ff84817d27241643abfe3f7691448e509e093320
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:23:04 2026 -0800

    docs: Add CPO paper (2401.08417) to paper index (#5081)

    Co-authored-by: sergiopaniego <sergiopaniegoblanco@gmail.com>

commit fe890df6e2a84345a29be849bb5f27ca72052034
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:05:10 2026 -0800

    docs: Add ORPO paper (2403.07691) to paper index (#5080)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit c88f8c550137ddd1ddde56baebae2d9b97b9d54d
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Thu Feb 12 01:01:37 2026 -0800

    docs: Add TR-DPO paper (2404.09656) to paper index (#5078)

commit 0562c3fa26c1bc827aff83800b046f9a2af925a6
Author: Logan Vegna <logan.vegna@shopify.com>
Date:   Wed Feb 11 16:17:04 2026 -0500

    [SFT] Fix high vRAM consumption during eval with liger kernel (#5069)

    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit 6b38db6ad85cf67ce1b7d4f037e5e5840d474587
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 11 19:22:00 2026 +0100

    Fix CI ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) (#5074)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 060fbfebbddf1e539e5dcee456bef643c29036d3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 11 10:44:54 2026 -0600

    Update model from SequenceClassification to CausalLM in `RewardTrainer` tests (#5060)

commit 0933b7fc5ddb933c632708bba5936b99238168d8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 11 17:18:12 2026 +0100

    Fix logging warning suppression for transformers 4.56.2 (#5077)

commit a07fb82b9a4333ea91cfe289a697b9b178d99021
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 11 09:21:48 2026 -0600

    fix: Set `num_labels` to 1 in causal model initialization for RewardTrainer (#5066)

commit 29fe68205caf4acbf888307487e8423d692ee496
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Feb 11 15:08:11 2026 +0100

    Fix GRPO multi-turn training with liger kernels (#4975)

commit 68399dfa6a03e4dea6ea5087c4d181b3b400cab5
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 11 07:39:27 2026 -0600

    fix: Use `launch_args` for all trainers (#5059)

commit d1b066fdc4a8a7d0bde59e3bf1aeaac8803746d1
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 11 07:37:54 2026 -0600

    Fix logging warning suppression with scoped override for seq-clf head key (#5058)

commit 0c3d33b955730308bab3d28ba2ef6eebe704c7f8
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Wed Feb 11 02:41:07 2026 -0800

    docs: Add SimPO paper (2405.14734) to paper index (#5071)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit f23e3a775155f458108cb01ebcfde085ecca4733
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Wed Feb 11 02:33:11 2026 -0800

    docs: Add RPO paper (2405.16436) to paper index (#5070)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit e46005c07c1a8194f4ff71b749dceb44f17d7eb7
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Wed Feb 11 02:22:52 2026 -0800

    docs: Add XPO (2405.21046) to Paper Index (#5068)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 6d9bba1b3b9f181d9cf53eb9092a3d52b66de93b
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Wed Feb 11 02:15:59 2026 -0800

    docs: Add REINFORCE++ (2501.03262) to Paper Index (#5062)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit b992e9284aac5979ab5716d14587067857663398
Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com>
Date:   Wed Feb 11 01:37:06 2026 -0800

    docs: Add INTELLECT-2 (2505.07291) to Paper Index (#5061)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit c985dbadd3a499dc049f8c39c61034525d381006
Author: Jen Wei <45276133+JenWei0312@users.noreply.github.com>
Date:   Wed Feb 11 02:25:55 2026 -0700

    docs: add DeepSeek-R1 training dynamics and GRPO example (#5053)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit d934eb757806501a5106b6e4374d920961dc4e9f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 19:14:40 2026 +0100

    Remove deprecated mergekit_utils moved to experimental (#5057)

commit 991fd0755aa1cce7a800d271dae4b525b6357bfd
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 10 11:57:11 2026 -0600

    Remove duplicated tests for SFT and add gradient checkpointing tests (#5054)

commit d42b23f63f164af241c34c79e6a855d1eb896d4d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 18:17:10 2026 +0100

    Remove deprecated classes moved to experimental (#5044)

commit e1a84cf626d249e9b55447d63eece88cbf92d100
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 18:06:05 2026 +0100

    Remove deprecated RLOOConfig.max_prompt_length (#5056)

commit fc560370d97042cb7f90a9bf0e2e30d29a304240
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 17:47:21 2026 +0100

    Remove deprecated XPO after moved to experimental (#5055)

commit 13bd37e1426eb81aca81cd68eb8d0efcbc6351b9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 17:36:36 2026 +0100

    Remove deprecated PRM after moved to experimental (#5052)

commit 0aea3144031abacb0efadd8aab5a3ca9fe6380e8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 17:33:24 2026 +0100

    Remove deprecated PPO after moved to experimental (#5051)

commit d705ac4d0f13168724f548c9ecbb1a586c187e16
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 17:06:11 2026 +0100

    Remove deprecated ORPO after moved to experimental (#5050)

commit b393c6bf6605d04f00a110085bf59feef59ffa6a
Author: Salman Chishti <13schishti@gmail.com>
Date:   Tue Feb 10 15:29:23 2026 +0000

    Upgrade GitHub Actions to latest versions (#4893)

    Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit ce2ea744c5f504a026aa9ea41815bfa382417a9d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 15:47:05 2026 +0100

    Remove deprecated Judges after moved to experimental (#5048)

commit 4620e91d21ad0dd3d885abe29446d7a52a9d368e
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 15:18:49 2026 +0100

    Remove deprecated CPO after moved to experimental (#5046)

commit 6e47225d012aba64efb16789356bee9d037ec171
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 15:18:33 2026 +0100

    Remove deprecated BCO after moved to experimental (#5045)

commit 17277e2d963611603eb2655af429975feade3b5c
Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
Date:   Tue Feb 10 15:13:24 2026 +0100

    [GRPO] fix: remove SAPO temperature check (#5042)

commit 7267b2d3589bcdccb46e8cfd51d63416d4378c76
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 14:42:18 2026 +0100

    ⬆️  Bump dev version (#5049)

commit 4aaaf064c15ad80bea91895c8f202d44ad17cdb4
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Tue Feb 10 14:27:45 2026 +0100

    [minor] docs: typo in `grpo_trainer.md` (#5047)

commit 49ef33428c47235991acb4e185ea599b70c6dab4
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 14:20:48 2026 +0100

    Release: 0.28 (#5043)

commit a958acc1e92d9ccf8404d800bf073c4cc7e5dd85
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 10 03:43:23 2026 -0600

    Add Online Direct Preference Optimization section to paper index (#5037)

commit 8b935c6378b78adcb4fda9e66a944e05cc99b681
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 10 03:41:41 2026 -0600

    Fix multiprocessing start method to 'spawn' for test compatibility with Python 3.12+ (#5036)

commit 40fff2e3bab905c2e7096360f4a8f014aab4cd14
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 10 03:32:24 2026 -0600

    Deprecate FDivergenceType in DPOConfig; update f_divergence_type to use string values (#5039)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit fe1949b4da60dc7adbcf3a0bb17a4f42280c5c28
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 10 03:22:43 2026 -0600

    Deprecate string usage for `ref_model` in DPOTrainer (#5040)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit 9f7c33600b7555a72234926a846ee65ff2508624
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 10 02:51:40 2026 -0600

    Rename AOT loss type 'aot_pair' to 'aot_unpaired' in DPO (#5038)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit 19c5f4460cd9b405a95fabb05322ceb98864e915
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 10 08:04:20 2026 +0100

    Allow testing with transformers 5.1.0 via xfail marks (#5034)

commit 442509524b4e7c8ee4d9f1d6f1f1087b5dcd1a0f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 23:45:11 2026 +0100

    Fix CI FutureWarning: max_prompt_length is deprecated (#5019)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 0ef315a0f992a30f82792e87b5e3f0fab58a6107
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 22:45:23 2026 +0100

    Filter max_prompt_length UserWarning in all test cases (#5035)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 8a27a17e583756b75ce71eb40a9ed295a610ade1
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 22:43:34 2026 +0100

    Fix CI FutureWarning: tools is deprecated (#5015)

commit ff55949cccaaa23a752d951d184c04ff579aad89
Author: Haseeb Asif <149416177+Haseebasif7@users.noreply.github.com>
Date:   Tue Feb 10 02:42:55 2026 +0500

    Add length-unbiased GRPO loss (LUSPO) (#4988)

    Co-authored-by: Haseeb Asif <haseeb@Haseebs-MacBook-Air.local>
    Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 765e397ed83f344a8ba7082673d4c4616beeb3a2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 9 15:00:49 2026 -0600

    [CI] Silence PyTorch JIT and DataLoader deprecation warnings (#4999)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit b5bd2b98ed615676a7bee40c8ae17de62421bb4a
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 21:12:57 2026 +0100

    Mark Qwen3VL tests as xfail for transformers 5.0.x (#5029)

commit 7189bc68d8ab0d19b67c0fc0b849c83679389b46
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 21:04:35 2026 +0100

    Fix CI FutureWarning: use_logits_to_keep is deprecated (#5013)

commit db0d95523e5b8039c94c50df1b6286ff8b7e29ce
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 15:53:56 2026 +0100

    Fix CI FutureWarning: rpo_alpha is deprecated (#5011)

commit fa06506f9d1c9546f63ae513cf7a5ba1be3247ae
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 15:52:31 2026 +0100

    Fix typo in xfail test reason (#5028)

commit 4abd67951f996b511f4b913ead2386fbe357061b
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 15:51:21 2026 +0100

    Fix CI FutureWarning: generate_during_eval is deprecated (#5017)

commit 9f1e7dd7fd58be3327748234fb952599c4bd4f09
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 15:48:59 2026 +0100

    Pin transformers < 5 in judges extra due to incompatibility (#5024)

commit 7c4e7f86047b82ad0e5ff7c8e3bb280b73024f31
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 15:22:02 2026 +0100

    Fix vision model prompt truncation bug in DPOTrainer (#5023)

commit a68c82a617be59086b83f5ce941175270926de3f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 15:00:22 2026 +0100

    Fix typo in DPO max_prompt_length deprecation warning message (#5020)

commit 5eb25938d44b781687061a69586a06501b11e915
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 14:35:44 2026 +0100

    Fix CI FutureWarning: ref_model_init_kwargs is deprecated (#5009)

commit 58f467babd998fe5fe41598b535ceacda690cef0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 9 07:34:48 2026 -0600

    Add support for `nested_gather` in OnlineDPOTrainer for transformers v5.2.0 and above (#4981)

commit 71a349335ce554180b2b4947d33594090f74d5cf
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 14:34:44 2026 +0100

    Fix CI TRLExperimentalWarning in regular tests (#5007)

commit a7333c8c68f564005ab74fd999ad246de405122f
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 9 14:34:15 2026 +0100

    Filter CI SWIG deprecation warnings (#5004)

commit 98b00171b81411cd8bf7d6a9135af70c9879aaee
Author: Nabin Oli <107109731+nabin2004@users.noreply.github.com>
Date:   Mon Feb 9 18:59:18 2026 +0545

    docs: add CGPO/Mixture of Judges (2409.20370) to Paper Index + link ref to AllTrueJudge (#5002)

    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>

commit 728b0e372fb7de141093aff5513697a4fa743137
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Feb 9 07:13:47 2026 -0600

    [tests] Remove xfail for transformers version >= 5.0.0 due to upstream bug resolution (#5000)

commit 637de450e748d1f612c0f6fed6be4df9cbbf1c39
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Sun Feb 8 14:36:10 2026 -0600

    Add `sanitize_logprob` function for NaN handling in vLLM log probabilities (#5001)

commit bfb94262b81fd28017a11d7b9ddc61e3095cc2b6
Author: Akshay Ballal <61191840+akshayballal95@users.noreply.github.com>
Date:   Sat Feb 7 14:48:58 2026 +0100

    Fix GRPO tool calling for corrupted tool calls (#4890)

commit 7a39ff3995f2f8b7cb4f8ca29a09390ac587a43d
Author: casinca <47400729+casinca@users.noreply.github.com>
Date:   Fri Feb 6 23:05:14 2026 +0100

    perf: Qwen SAPO loss optimization (#4956)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit bd206704f9fc2c08039c522b40b0f68654bb006f
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date:   Fri Feb 6 23:00:24 2026 +0100

    Update sampling mode to token level for safety (#4989)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit aa7d457f9bec736c75439f78d081a6dc012ce353
Author: cmunley1 <cmunley@nvidia.com>
Date:   Fri Feb 6 13:37:53 2026 -0800

    Update NeMo-Gym to use `env_mask` (#4986)

    Signed-off-by: Christian Munley <cmunley@nvidia.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 5db1c11c52bc95255ea73e7eae3840fbeeb293a2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 11:27:47 2026 -0600

    Add distributed smoke tests workflow for Transformers branch (#4996)

commit 90a35d12c9c64129eb023b499a812c5e638db846
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 10:47:49 2026 -0600

    Add GitHub Actions workflow for testing against Transformers branch (#4995)

commit f11b4c3fdd511d9adfda74ceec02042cee65a0f3
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 09:46:14 2026 -0600

    Fix ZeRO-3 + PEFT + gradient checkpointing (#4951)

commit 27cbe98ac7487f326be51e180a1ee078c23b3836
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 6 16:14:18 2026 +0100

    Fix post_init warning stacklevel to 3 (#4993)

commit 57cac251bdde714f97458b39b24702cf624dec66
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 6 16:13:47 2026 +0100

    Fix deprecation of DPOConfig.max_completion_length (#4992)

commit ce72c067f6b55d4352c71939d9be6f4dfdaf68a0
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 6 16:13:14 2026 +0100

    Assert chat_template is applied in test_train_with_chat_template_kwargs (#4991)

commit ffdaba3a97299c0c381512f807c57aa753f6314a
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 6 16:11:40 2026 +0100

    Fix import of AutoModelForCausalLMWithValueHead from experimental (#4990)

commit 97a8a9672c0d5fbacb5c60934f40a7af404adecf
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 07:10:06 2026 -0600

    Use local variable instead of attribute in collator tests (#4957)

commit 4e212bdeed6c7bf081e494960c65a84a35a85d6e
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 07:07:28 2026 -0600

    Update dataset configuration name in toolcall dataset loading (#4984)

commit c82f6aa4766f83b66ea3f37e7bbf3b30453a1cda
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 6 14:03:31 2026 +0100

    Fix passing tokenizer in test_train_with_chat_template_kwargs (#4987)

commit c581c1e8829904c6838c38d70d7b5fa646e2f0fc
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Fri Feb 6 14:03:02 2026 +0100

    Pin transformers!=5.1.0 in deepspeed extra due to incompatibility (#4985)

commit 98aca7f4fdb2c2879c86aa9bd18ccddece112b70
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 06:54:48 2026 -0600

    Replace `warmup_ratio` with `warmup_steps` (#4983)

commit 032ee139d90d1279549e2b538f11a2c3b7c22aa7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Fri Feb 6 00:34:04 2026 -0600

    [CI] Disallow installation of transformers 5.1.0 due to compatibility issues with DeepSpeed (#4982)

commit a0e5f265604356d2a107edccd920b5236582a3d7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 5 13:42:37 2026 -0600

    Deprecate parameters in `DPOConfig` (#4969)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit f0a738d954775e50d4bd4a4df4fc5d1826e2f0b4
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date:   Thu Feb 5 20:12:15 2026 +0100

    Simplify instructions of installation of OpenEnv  (#4980)

commit a92d14336e5380f1ce2b8cbca78ebd224933d2ba
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 5 13:05:32 2026 -0600

    Replace `torch.allclose` with `torch.testing.assert_close` (#4977)

commit b0b798a82953094fb4d254915e4c360789ac2838
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 19:53:14 2026 +0100

    Support truncated completions in GRPO multi-turn training (#4976)

commit ac194a917b2a8b7d514097d34078ec191ef4a0e3
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 16:56:50 2026 +0100

    Fix add_column in test_train_with_chat_template_kwargs (#4979)

commit 3a76b7a8690e25838f2332b81d7efbfed2615277
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date:   Thu Feb 5 16:24:29 2026 +0100

    Set specific OpenEnv version when installed (#4978)

commit 0113ad7022118e4a7afe62b4e190f0b9aee4cadf
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 13:28:21 2026 +0100

    Remove truncation from tokenizer calls if no max_length (#4972)

commit eee98f77a25bb386a0aba85dcb93ad50511224f9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 13:27:01 2026 +0100

    Remove padding_value from experimental CPO and use pad_token_id (#4962)

commit 1354860c5c33bdee7a89a8845474ac4074094ed6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 5 06:07:05 2026 -0600

    Fix test_train_with_chat_template_kwargs (#4971)

commit 1bd2a52ec2d8344050af736d60cdc735181ae4b8
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Feb 5 04:18:38 2026 -0600

    Revert change in GRPO from NeMo-Gym Integration (#4970)

commit 22ad7e6b3f2ec7dfc2567fa0535955812fe69a42
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 08:14:54 2026 +0100

    Remove max_prompt_length from experimental ORPO (#4966)

commit 657babd9300007308c7b9ad329790f0564daf94d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 08:13:10 2026 +0100

    Remove max_prompt_length from experimental CPO (#4965)

commit 50e35de16578daa9aee75758fad8bdc0f707e37d
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 08:09:25 2026 +0100

    Remove max_prompt_length from experimental BCO (#4964)

commit 35bcab1d4da9fca092f9cefbc611fcd9eddaaf42
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Feb 5 08:02:34 2026 +0100

    Remove max_prompt_length from experimental PRM (#4963)

commit e4995b2d26122879c03605f8ee136bcb241b4171
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 4 14:17:22 2026 -0600

    Add test for training with `compute_metrics` in `RewardTrainer` (#4958)

commit cb5a73bfd97a3cb36712cd540545c2512cfa96a6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Feb 4 14:14:14 2026 -0600

    Add test for tool call data in `RewardTrainer` (#4959)

commit 90b875c575b816e9015670ad812db43c8ab9a0e3
Author: cmunley1 <cmunley@nvidia.com>
Date:   Wed Feb 4 08:56:55 2026 -0800

    NeMo-Gym Integration (#4848)

    Signed-off-by: Christian Munley <cmunley@nvidia.com>
    Signed-off-by: cmunley1 <cmunley@nvidia.com>
    Signed-off-by: Lawrence Lane <llane@nvidia.com>
    Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
    Co-authored-by: Lawrence Lane <llane@nvidia.com>

commit 5cb7eee1548bc72ed6fd84080c200a0adf74add2
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 3 11:53:35 2026 -0600

    Remove access to `warnings_issued` (#4960)

commit 2a55ed701122f3d210669c70b441e3baff6184b6
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Tue Feb 3 08:28:32 2026 -0600

    Add test for training with `compute_metrics` in `SFTTrainer` (#4950)

commit 7b54e7253093610ab69bb8c32b2a4eb6926721ea
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Feb 3 10:43:40 2026 +0100

    Minor fix docs style (#4953)

commit 2a9fb3f22a8bbad1412af3bb2526febd7160b85f
Author: mel3c <gaozh1988@live.com>
Date:   Tue Feb 3 16:02:07 2026 +0800

    Fix PPO run_name parameter not taking effect (#4945)

commit a03c2fcda3a328bde9af4abc5c02f6e7e942140f
Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date:   Mon Feb 2 16:27:47 2026 +0100

    Update wordle.py example with masking of env tokens (#4895)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 68bc37700d2b66e1fbfa49282495f5419dd8abeb
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 2 15:25:02 2026 +0100

    Remove ref_model_init_kwargs from experimental BCO (#4946)

commit 239c74d9ffb8ca67a9a667fb7a2a91576d554f28
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Mon Feb 2 09:25:41 2026 +0100

    Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 (#4942)

commit 035c3ff151b953ca72cdfe0ee966bc1469a26fde
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jan 29 14:08:08 2026 -0600

    [GRPO] Add parquet logging for completions with individual rewards (#4818)

    Co-authored-by: Daniel van Strien <davanstrien@gmail.com>
    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
    Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

commit 414e60f557eb0d0888db841c5e0e8f568e7607a8
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Jan 29 18:57:12 2026 +0100

    Set default top_k to 0 in VLLMClient (#4927)

commit df332dc924e1bdc75bcfc5573950a17648db2eb4
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jan 29 09:03:32 2026 -0600

    Fix import statement for import_utils in vllm_client.py (#4932)

commit 27998e9584df0102b849878506be7d4808486771
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Jan 29 15:51:19 2026 +0100

    Fix profiling of VLLMGeneration.sync_weights (#4931)

commit 43fb8d310633448a0c4c731a2efe9c1ca55e6184
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Jan 29 15:11:24 2026 +0100

    Set model dtype to float32 in experimental tests of trainers (#4925)

commit 5a7481ec9340dfad5f23c54f90e15a139c1dff85
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Jan 29 14:35:56 2026 +0100

    Move VLLMClient to generation module (#4928)

commit 21a0d70400179e4047c60183d7fb61988a249989
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Jan 29 14:32:59 2026 +0100

    Require transformers<5 with PairRMJudge (#4926)

commit 4348375ab2c6bad36ef90e1061b804b0449148f1
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Thu Jan 29 14:30:13 2026 +0100

    Set model dtype to float32 in tests of trainers (#4924)

commit a6cbf279d7d3bc4024e6e6273d967509e7221e83
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jan 29 07:06:06 2026 -0600

    Support tool call data in `is_conversational` (#4923)

commit ad91c6ffa91073684c4cf6dc2008e2994dd940e7
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 11:47:15 2026 -0600

    Add validation for `sync_ref_model` in `GRPOTrainer` and `RLOOTrainer` when using PEFT models (#4912)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit 04717ffca8fd91a0fa5ee610fbdc75ef8f3c5a22
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 11:16:12 2026 -0600

    Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests (#4914)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit b322d9ba8092399b956882f61978ab3e90868c77
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 10:54:04 2026 -0600

    Remove chat template setup in dpo_vlm.py (#4906)

commit a70b4e014756dc8595ac226d833deaba9784f756
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 09:55:19 2026 -0600

    Fix extra EOS appended in DPO preprocessing for conversational data (#4908)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit 8464b0e4b22c571bbf565a03ee154a5692c8d056
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Jan 28 16:03:53 2026 +0100

    Fix CI ValueError for 0 temperature (#4916)

commit 5461a74bc622660039e2038b6b0e5a43bdc712ae
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Jan 28 15:58:12 2026 +0100

    Fix CI AssertionError: assert not True (#4921)

commit d54381a4a90cb18152842158c62aad9895022448
Author: Boyi Zhang <68804418+billycrapediem@users.noreply.github.com>
Date:   Wed Jan 28 09:54:29 2026 -0500

    docs: add DoRA (2402.09353) to Paper Index (#4892)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit f2f6b32bdc3688b124d72caa412d60a8f12d80c0
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 08:52:58 2026 -0600

    Remove gradient checkpointing option from various training scripts  (#4905)

commit 6cbc102f5fe94804e5a7579ff1aec270b97e4f5f
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 08:35:23 2026 -0600

    Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer (#4913)

commit f40edf9328adbe6c85acfb9dd9745e9c1393197e
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 08:17:35 2026 -0600

    `device_map` init consistency in GRPO/RLOO/KTO (#4909)

commit a7070f940e8e0565adfbe9bbedd68b7850334b03
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 08:14:30 2026 -0600

    Fix help text formatting for `max_length` in `RewardConfig` and `SFTConfig` (#4910)

commit 66efc0e52e55d77c2edf3e67c6c1f08e274ac9f8
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 28 08:08:11 2026 -0600

    Rearrange variable assignments in `DataCollatorForVisionLanguageModeling` (#4911)

commit e9a2f16004a00a50e69e5779f58bf0bc24937de7
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Jan 28 11:43:26 2026 +0100

    Fix CI TypeError in llm-blender tests (#4919)

commit 4f8232098c10c98ad7febe971da4eb362d13433c
Author: adityachallapally <avasanthc@gmail.com>
Date:   Wed Jan 28 01:10:29 2026 -0800

    Created new PTT integration docs as requested (#4907)

    Co-authored-by: Aditya Challapally <adchalla@microsoft.com>
    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

commit 0eb66d8f2fc63b3d00d8dbc18f99c3f48750bd16
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Jan 27 16:53:29 2026 +0100

    Refactor vLLM generation [1/N]: Extract vLLM generation (#4700)

commit 226ef57192b49801c3be8c55c798c6d5b134b080
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Jan 27 14:34:11 2026 +0100

    Fix CI AssertionError: Parameter has not changed (#4904)

commit 956986ebd53ff0d8dfa688e9d1033488dcad55d6
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Tue Jan 27 14:33:56 2026 +0100

    Fix CI NotImplementedError for bfloat16 (#4902)

commit 4322778d7f696a4fc1fc33612b02eeb5ec700109
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Mon Jan 26 12:57:43 2026 -0600

    Transformers v5 release: extend xfail condition for `TestGRPOTrainer.test_training_vlm_and_liger` and update version checks (#4898)

commit e106972dd6d839f4a3d3fcaffc1f386b4fbe66bf
Author: Cola Chan (SII) <57797863+141forever@users.noreply.github.com>
Date:   Mon Jan 26 17:56:38 2026 +0800

    GOLD training speed up (#4888)

    Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

commit c477e88e05023dbcd45211c1a802788650598909
Author: Yi-Chen Li <ychenli.X@gmail.com>
Date:   Fri Jan 23 21:25:36 2026 +0800

    Fix RewardTrainer's results not reproducible (#4887)

    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit ba053232324b207554116f806edbb2ec8b6ab9f5
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jan 22 08:15:31 2026 -0600

    Fix import path for `get_open_port` based on vLLM version (#4883)

commit e66a138438a3beba08756543fa41b7a90054ee8c
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Thu Jan 22 08:14:48 2026 -0600

    Mark ZeRO 2 as xfail in distributed tests due to current failure (#4885)

commit a60d75aa1efa6ac5330649aafd425859da685a63
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 21 15:44:01 2026 -0600

    Test distributed training for `RewardTrainer`, `RLOOTrainer` and `GRPOTrainer` (#4823)

    Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

commit 60e46742576876209658446b50f144541873301b
Author: Wing Lian <wing@axolotl.ai>
Date:   Wed Jan 21 16:15:03 2026 -0500

    Enable vLLM sleep mode for generation in Online DPO (#4882)

    Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
    Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 0a881bcee992a25e2fc0e980cc43a7428ce17373
Author: Kirill Dubovikov <dubovikov.kirill@gmail.com>
Date:   Thu Jan 22 01:09:34 2026 +0400

    Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) (#4873)

    Co-authored-by: Kirill Dubovikov <kirill.dubivokov@mbzuai.ac.ae>
    Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

commit 16b090302b8fd408870baa7452b5c3a29e03c346
Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Date:   Wed Jan 21 03:00:04 2026 -0600

    Fix SFT training for prompt-completion type and transformers v5 (#4880)

commit b080a4c27a60988be213354f551e26d3a4b2eef9
Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Date:   Wed Jan 21 08:07:42 2026 +0100

    Remove label_pad_token_id from experimental trainers (#4878)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants