GRPO grouping in multi-turn agent RL: is it valid to mix samples with different prompts in the same group?

While debugging the calx example, we observed that as the number of training iterations increases, the agent tends to stop calling tools.
When using GRPO for multi-turn agent RL training (e.g., tool-calling scenarios), a single rollout can produce multiple training samples — one per conversation turn — where each turn has a different prompt (because earlier turns' tool calls and tool responses get appended to the context).
For example, with rollout_n=4 for one question:
Sample | Prompt | Response | Reward
-- | -- | -- | --
Rollout1-Turn0 | [system, user] | "Let me calculate... <tool_call>..." | 1.0
Rollout1-Turn1 | [system, user, assistant, tool_response] | "The answer is 40" | 1.0
Rollout2-Turn0 | [system, user] | "The answer is π square cm" (no tool) | 0.0
Rollout3-Turn0 | [system, user] | "I'll use the calculator... <tool_call>..." | 1.0
Rollout3-Turn1 | [system, user, assistant, tool_response] | "### ANSWER: 40" | 1.0
Rollout4-Turn0 | [system, user] | "The answer is 35" (no tool) | 0.0

All 6 samples share the same uid (derived from the same original question), so GRPO computes advantages within this single group. Standard GRPO assumes all samples in a group are generated from the same prompt. However, in multi-turn agent scenarios, Turn0 samples have prompt [system, user], while Turn1 samples have prompt [system, user, assistant, tool_response] — these are fundamentally different inputs. Is it theoretically valid to compute group-normalized advantages across samples with different prompts?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO grouping in multi-turn agent RL: is it valid to mix samples with different prompts in the same group? #489

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sample	Prompt	Response	Reward
Rollout1-Turn0	[system, user]	"Let me calculate... <tool_call>..."	1.0
Rollout1-Turn1	[system, user, assistant, tool_response]	"The answer is 40"	1.0
Rollout2-Turn0	[system, user]	"The answer is π square cm" (no tool)	0.0
Rollout3-Turn0	[system, user]	"I'll use the calculator... <tool_call>..."	1.0
Rollout3-Turn1	[system, user, assistant, tool_response]	"### ANSWER: 40"	1.0
Rollout4-Turn0	[system, user]	"The answer is 35" (no tool)	0.0

GRPO grouping in multi-turn agent RL: is it valid to mix samples with different prompts in the same group? #489

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions