Skip to content

GRPO grouping in multi-turn agent RL: is it valid to mix samples with different prompts in the same group? #489

@yangdongdong2000

Description

@yangdongdong2000

While debugging the calx example, we observed that as the number of training iterations increases, the agent tends to stop calling tools.
When using GRPO for multi-turn agent RL training (e.g., tool-calling scenarios), a single rollout can produce multiple training samples — one per conversation turn — where each turn has a different prompt (because earlier turns' tool calls and tool responses get appended to the context).
For example, with rollout_n=4 for one question:

Sample Prompt Response Reward
Rollout1-Turn0 [system, user] "Let me calculate... <tool_call>..." 1.0
Rollout1-Turn1 [system, user, assistant, tool_response] "The answer is 40" 1.0
Rollout2-Turn0 [system, user] "The answer is π square cm" (no tool) 0.0
Rollout3-Turn0 [system, user] "I'll use the calculator... <tool_call>..." 1.0
Rollout3-Turn1 [system, user, assistant, tool_response] "### ANSWER: 40" 1.0
Rollout4-Turn0 [system, user] "The answer is 35" (no tool) 0.0

All 6 samples share the same uid (derived from the same original question), so GRPO computes advantages within this single group. Standard GRPO assumes all samples in a group are generated from the same prompt. However, in multi-turn agent scenarios, Turn0 samples have prompt [system, user], while Turn1 samples have prompt [system, user, assistant, tool_response] — these are fundamentally different inputs. Is it theoretically valid to compute group-normalized advantages across samples with different prompts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions