-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
architectureArchitecture and design decisionsArchitecture and design decisionsenhancementNew feature or requestNew feature or requesthigh-priorityHigh priorityHigh priority
Description
Summary
With the framework adapter strategy (#74), Agent Arena no longer controls the LLM call pipeline. The current inspector (TraceStore/ReasoningTrace) tries to capture both game-side events and LLM internals. It should be refocused to only capture what frameworks can't see — the game world.
Current State
TraceStore captures the full pipeline:
- Observation → Prompt built → LLM request → Raw response → Parsed decision
- TraceStepName enum: observation, retrieved, prompt, llm_request, response, decision
- Episode-based JSONL persistence
- Web UI for viewing traces
Problem
When a framework (LangGraph, Claude SDK, etc.) owns the LLM call:
- We don't have access to the exact prompt, raw response, or token counts
- Frameworks have their own observability (LangSmith, Anthropic Console, etc.) that is more comprehensive than our inspector
- Trying to intercept framework internals is fragile and framework-specific
Proposed Refactor
Keep (game-side debugging — only we can provide this)
- Observation logging: what the agent perceived each tick (nearby resources, hazards, health)
- Tool execution logging: which tools called, what params, success/failure, duration
- Episode boundaries and game events (resource clusters, damage, crafting)
- Outcome tracking: score, damage taken, resources collected, exploration %
- Spatial context: "agent remembered berries from 30 ticks ago"
- Web UI for episode replay
Remove (delegate to framework observability)
- Prompt capture (TraceStepName.prompt)
- LLM request/response logging (TraceStepName.llm_request, TraceStepName.response)
- Token counting
- Raw response parsing traces
Add
- Framework trace linking: Store a reference (e.g., LangSmith run URL) per tick so users can click from game replay → LLM trace for that specific decision
- Tool call/result pairs: Log both what the agent requested and what the game returned
- Decision quality annotations: Was this a good/bad decision based on outcome?
User Experience
┌─────────────────────────────────────────────┐
│ AGENT ARENA INSPECTOR (game-side) │
│ │
│ Tick 47: Saw {3 berries, 1 fire} │
│ Tick 47: Agent called move_to(10, 0, 5) │
│ Tick 52: Arrived, collected berry │
│ Tick 52: Score: 4 resources, 0 damage │
│ [View LLM reasoning →] (opens LangSmith) │
│ │
├─────────────────────────────────────────────┤
│ LANGSMITH / FRAMEWORK TOOLS (LLM-side) │
│ │
│ Full prompt, model thinking, token usage │
│ (handled by framework, not us) │
└─────────────────────────────────────────────┘
Dependencies
- Add framework adapter system for LangGraph, Claude Agent SDK, and other agent frameworks #74 (framework adapters) — defines how frameworks integrate
- Add tool completion callbacks from Godot to Python #71 (tool completion callbacks) — needed for tool result logging
Estimated Effort
2-3 days
Context
Identified during strategic architecture discussion about framework integration. See docs/framework_integration_strategy.md for full context.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
architectureArchitecture and design decisionsArchitecture and design decisionsenhancementNew feature or requestNew feature or requesthigh-priorityHigh priorityHigh priority
Type
Projects
Status
Backlog