-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Problem
Agent Lightning's training loop (RL, SFT, prompt optimization) optimizes toward a reward signal. This works well for tasks with stable reward landscapes, but introduces a structural risk: trained agents lose the ability to discover novel behaviors that weren't represented in the reward signal.
This isn't hypothetical — it's a known failure mode in RL literature (reward hacking, policy collapse, mode collapse in generative models), but Agent Lightning doesn't currently surface any mechanisms to detect or mitigate it.
Concrete Engineering Gaps
- No exploration decay monitoring
Once an agent's policy converges, its trajectory diversity drops. Currently there's no metric tracking behavioral diversity over training runs — e.g., how many distinct tool-call sequences the agent explores at epoch N vs epoch 0. Without this, you can't distinguish "agent got better" from "agent got narrower."
Suggested metric: Trajectory entropy over a sliding window. If entropy drops below a threshold while reward plateaus, the agent has likely collapsed to a fixed policy rather than genuinely mastering the task space.
- No reward function staleness detection
The reward signal is assumed to be correct for the duration of training. But in real deployments, the environment changes — APIs evolve, data distributions shift, new capabilities become available. A reward function that was correct last month may be optimizing for a stale objective today.
Suggested mechanism: Periodic reward function auditing. Track correlation between reward signal and downstream task success (measured independently). When correlation degrades, flag the reward function for review rather than continuing to train against it.
- No mechanism for productive multi-objective tension
Many real agent tasks involve competing objectives that shouldn't be collapsed into a single scalar reward. For example:
Speed vs. thoroughness in research agents
Cost vs. quality in code generation
Exploitation of known-good strategies vs. exploration of potentially-better ones
Agent Lightning's current architecture assumes a single reward signal per trajectory. This forces the user to pre-resolve tensions that might be more valuable left as competing objectives.
Suggested approach: Support Pareto-front tracking across multiple reward dimensions. Let the trainer surface the trade-off frontier rather than collapsing it into a weighted sum. This preserves optionality and lets downstream consumers choose their operating point.
- No "policy dissolution" mechanism for stale behaviors
Once a behavior is trained into an agent, there's no structured way to detect that it has become counterproductive and should be unlearned. Fine-tuned behaviors accrete — the agent gets layers of learned policy but never sheds outdated ones.
Suggested mechanism: Attach TTL metadata or validity conditions to trained behaviors. For example: "this prompt optimization was trained against GPT-4o pricing as of 2025-08. Re-evaluate if pricing model changes." The trainer could periodically re-validate policies against their stated conditions and flag stale ones.
- Trajectory capture doesn't distinguish novel vs. routine behavior
agl.emit_*() captures what the agent did and what reward it got. It doesn't capture whether the behavior was novel (first time the agent tried this approach) or routine (repeating a known-good policy). This distinction matters because:
High reward + novel behavior = genuine discovery (should be studied, not just reinforced)
High reward + routine behavior = exploitation (fine, but not learning)
Low reward + novel behavior = exploration (might be valuable signal, currently just negative reward)
Suggested approach: Add an optional novelty flag or auto-compute trajectory similarity to historical traces. Novel high-reward trajectories could be surfaced to the user for analysis rather than just folded into the training batch.
Why This Matters at Scale
The README highlights 128-GPU training and production deployments (Tencent's Youtu-Agent). At that scale, these gaps compound:
Policy collapse is harder to detect across distributed training
Reward staleness affects all replicas simultaneously
Novel behaviors are drowned out by volume of routine trajectories
Multi-objective tensions get resolved implicitly by whoever designed the reward, with no visibility into what was lost
Context
These observations come from building an MCP server where AI agents autonomously discover DeFi strategies. We found that the environment design (what signals agents see, what ambiguity is preserved, what tensions are left unresolved) matters as much as agent optimization for producing useful novel behavior. The failure mode we kept hitting: agents that were "better" by any metric were also more predictable and less likely to find strategies we hadn't anticipated.