Change online training verifier#1371
Draft
jacklanchantin wants to merge 66 commits intoonline_trainingfrom
Draft
Change online training verifier#1371jacklanchantin wants to merge 66 commits intoonline_trainingfrom
jacklanchantin wants to merge 66 commits intoonline_trainingfrom
Conversation
…airseq2 into jacklanchantin/drgrpo
added 10 commits
October 23, 2025 03:57
…2 into jacklanchantin/drgrpo
9090bd5 to
14d2571
Compare
…q2 into jacklanchantin/drgrpo
* do not throttle client-server port * maybe_sync_fix * move tokenizer * ppl/logp reward * remove prepare_preference_batch func (intended for online dpo * keep an noop prepare_preference_batch method to instantiate * make logging clearer. * 1. add additional logic to add whitespace if needed. 2. not reusing prefix token from fs2. * pass prefix text rather than tokens in rm (it would also support rm tokenizer different from policy model) * clear up string_input flag. --------- Co-authored-by: uralik <kulikov@cs.nyu.edu> Co-authored-by: jacklanchantin <jacklanchantin@gmail.com>
…airseq2 into jacklanchantin/drgrpo
f967068 to
a6013ce
Compare
…ks w/o think in rollout. (#1488) fix the training pipeline bug: update_avg_think_rollout_length is not always called and later causes synchronization issue in ALLGATHER called by sync_and_compute_collection (in metrics/_bag.py)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do? Please describe:
adv_std_normarlization(for DrGRPO)loss_token_meanfor normalizing over all tokenstis_imp_ratio_capto use truncated importance sampling correctionFixes #{issue number}
Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.
Check list: