-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
While debugging the calx example, we observed that as the number of training iterations increases, the agent tends to stop calling tools.
When using GRPO for multi-turn agent RL training (e.g., tool-calling scenarios), a single rollout can produce multiple training samples — one per conversation turn — where each turn has a different prompt (because earlier turns' tool calls and tool responses get appended to the context).
For example, with rollout_n=4 for one question:
| Sample | Prompt | Response | Reward |
|---|---|---|---|
| Rollout1-Turn0 | [system, user] | "Let me calculate... <tool_call>..." | 1.0 |
| Rollout1-Turn1 | [system, user, assistant, tool_response] | "The answer is 40" | 1.0 |
| Rollout2-Turn0 | [system, user] | "The answer is π square cm" (no tool) | 0.0 |
| Rollout3-Turn0 | [system, user] | "I'll use the calculator... <tool_call>..." | 1.0 |
| Rollout3-Turn1 | [system, user, assistant, tool_response] | "### ANSWER: 40" | 1.0 |
| Rollout4-Turn0 | [system, user] | "The answer is 35" (no tool) | 0.0 |
All 6 samples share the same uid (derived from the same original question), so GRPO computes advantages within this single group. Standard GRPO assumes all samples in a group are generated from the same prompt. However, in multi-turn agent scenarios, Turn0 samples have prompt [system, user], while Turn1 samples have prompt [system, user, assistant, tool_response] — these are fundamentally different inputs. Is it theoretically valid to compute group-normalized advantages across samples with different prompts?