Skip to content

Refactor inspector into game-side debugger, delegate LLM tracing to frameworks #75

@justinmadison

Description

@justinmadison

Summary

With the framework adapter strategy (#74), Agent Arena no longer controls the LLM call pipeline. The current inspector (TraceStore/ReasoningTrace) tries to capture both game-side events and LLM internals. It should be refocused to only capture what frameworks can't see — the game world.

Current State

TraceStore captures the full pipeline:

  • Observation → Prompt built → LLM request → Raw response → Parsed decision
  • TraceStepName enum: observation, retrieved, prompt, llm_request, response, decision
  • Episode-based JSONL persistence
  • Web UI for viewing traces

Problem

When a framework (LangGraph, Claude SDK, etc.) owns the LLM call:

  • We don't have access to the exact prompt, raw response, or token counts
  • Frameworks have their own observability (LangSmith, Anthropic Console, etc.) that is more comprehensive than our inspector
  • Trying to intercept framework internals is fragile and framework-specific

Proposed Refactor

Keep (game-side debugging — only we can provide this)

  • Observation logging: what the agent perceived each tick (nearby resources, hazards, health)
  • Tool execution logging: which tools called, what params, success/failure, duration
  • Episode boundaries and game events (resource clusters, damage, crafting)
  • Outcome tracking: score, damage taken, resources collected, exploration %
  • Spatial context: "agent remembered berries from 30 ticks ago"
  • Web UI for episode replay

Remove (delegate to framework observability)

  • Prompt capture (TraceStepName.prompt)
  • LLM request/response logging (TraceStepName.llm_request, TraceStepName.response)
  • Token counting
  • Raw response parsing traces

Add

  • Framework trace linking: Store a reference (e.g., LangSmith run URL) per tick so users can click from game replay → LLM trace for that specific decision
  • Tool call/result pairs: Log both what the agent requested and what the game returned
  • Decision quality annotations: Was this a good/bad decision based on outcome?

User Experience

┌─────────────────────────────────────────────┐
│  AGENT ARENA INSPECTOR (game-side)          │
│                                             │
│  Tick 47: Saw {3 berries, 1 fire}           │
│  Tick 47: Agent called move_to(10, 0, 5)    │
│  Tick 52: Arrived, collected berry           │
│  Tick 52: Score: 4 resources, 0 damage      │
│  [View LLM reasoning →] (opens LangSmith)  │
│                                             │
├─────────────────────────────────────────────┤
│  LANGSMITH / FRAMEWORK TOOLS (LLM-side)     │
│                                             │
│  Full prompt, model thinking, token usage   │
│  (handled by framework, not us)             │
└─────────────────────────────────────────────┘

Dependencies

Estimated Effort

2-3 days

Context

Identified during strategic architecture discussion about framework integration. See docs/framework_integration_strategy.md for full context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions