Conversation
…q2 into swarna/skyworkv2
…q2 into swarna/skyworkv2
| ) in self._config.loss_config.validation_vllm_sampling_params.items(): | ||
| policy_sampling_params.__setattr__(k, v) | ||
|
|
||
| # For a pairwise RM, need to sample at least two judgments |
There was a problem hiding this comment.
this is 2 rollouts per prompt, but i assume there are 2 copies of the prompt (2 different orders)? so shouldn't you only need 1 rollout for each order?
There was a problem hiding this comment.
sorry, this is a typo! I meant 2 "rollouts" instead of "judgments".
How many judgment prompts are created out of those two rollouts (i.e., 2 with (a,b) and (b,a) in both orders) will be automatically handled with the "all_pairs" setting in the reward config. So, if it's all_pairs, we'll have N * (N-1) judgments and N = 2, so we'll have 2 judgments. Let me know if this makes sense!
There was a problem hiding this comment.
yep, makes sense now
|
@swarnaHub Is this ready for review? You can change the status from draft if so? |
84e7bd7 to
4ea811d
Compare
fce332b to
6703f1b
Compare
What does this PR do? Please describe:
This is a PR to enable usage of generative RMs in GRPO training. Below is a summary of the main changes:
This is how a typical reward config would look like:
reward:
name: "generative_pairwise_verifier"
config:
prompt_key: prompt_raw
tokenizer: /datasets/pretrained-llms/Qwen3-8B/
judgment_extractor: "j1_pairwise_score_extractor"
pair_type: "all_pairs"
If you want to use any pairwise RM, set name as "generative_pairwise_verifier" or for pointwise RMs, use "generative_pointwise_verifier". Currently, there is no option for doing k-wise judgments but I am adding support for it now.
The field "judgment_extractor" refers to a class that implements how we (1) prompt the RM, (2) extract scores/judgments and (3) aggregate multiple scores (if doing SC or something) from the judgment CoTs.
Whether you want to use GenRMs in a reference-free or a reference-based manner (i.e., for math, you might have access to reference answers), this need not be explicitly stated in the config. If the input file has a field for reference answer, it'll be used. All extractors now take reference answer as an argument (and can be empty).
Finally, "pair_type" refers to the particular setting a pairwise RM will be used in. These have three options now: (1) "pivot" (all rollouts are judged against a reference rollout), (2) "random_pairs" (we randomly sample N pairwise comparisons), and (3) "all_pairs" (all N*(N-1) pairs are constructed for judgment). See details in the GenerativePairwiseVerifier class in _rewards.py.
Unrelated change: This PR also added support for Skywork v2 RM.