Refactor inspector into game-side debugger, delegate LLM tracing to frameworks

## Summary

With the framework adapter strategy (#74), Agent Arena no longer controls the LLM call pipeline. The current inspector (TraceStore/ReasoningTrace) tries to capture both game-side events and LLM internals. It should be refocused to only capture what frameworks can't see — the game world.

## Current State

TraceStore captures the full pipeline:
- Observation → Prompt built → LLM request → Raw response → Parsed decision
- TraceStepName enum: observation, retrieved, prompt, llm_request, response, decision
- Episode-based JSONL persistence
- Web UI for viewing traces

## Problem

When a framework (LangGraph, Claude SDK, etc.) owns the LLM call:
- We don't have access to the exact prompt, raw response, or token counts
- Frameworks have their own observability (LangSmith, Anthropic Console, etc.) that is more comprehensive than our inspector
- Trying to intercept framework internals is fragile and framework-specific

## Proposed Refactor

### Keep (game-side debugging — only we can provide this)
- Observation logging: what the agent perceived each tick (nearby resources, hazards, health)
- Tool execution logging: which tools called, what params, success/failure, duration
- Episode boundaries and game events (resource clusters, damage, crafting)
- Outcome tracking: score, damage taken, resources collected, exploration %
- Spatial context: "agent remembered berries from 30 ticks ago"
- Web UI for episode replay

### Remove (delegate to framework observability)
- Prompt capture (TraceStepName.prompt)
- LLM request/response logging (TraceStepName.llm_request, TraceStepName.response)
- Token counting
- Raw response parsing traces

### Add
- **Framework trace linking**: Store a reference (e.g., LangSmith run URL) per tick so users can click from game replay → LLM trace for that specific decision
- **Tool call/result pairs**: Log both what the agent requested and what the game returned
- **Decision quality annotations**: Was this a good/bad decision based on outcome?

## User Experience

```
┌─────────────────────────────────────────────┐
│  AGENT ARENA INSPECTOR (game-side)          │
│                                             │
│  Tick 47: Saw {3 berries, 1 fire}           │
│  Tick 47: Agent called move_to(10, 0, 5)    │
│  Tick 52: Arrived, collected berry           │
│  Tick 52: Score: 4 resources, 0 damage      │
│  [View LLM reasoning →] (opens LangSmith)  │
│                                             │
├─────────────────────────────────────────────┤
│  LANGSMITH / FRAMEWORK TOOLS (LLM-side)     │
│                                             │
│  Full prompt, model thinking, token usage   │
│  (handled by framework, not us)             │
└─────────────────────────────────────────────┘
```

## Dependencies
- #74 (framework adapters) — defines how frameworks integrate
- #71 (tool completion callbacks) — needed for tool result logging

## Estimated Effort
2-3 days

## Context
Identified during strategic architecture discussion about framework integration. See `docs/framework_integration_strategy.md` for full context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor inspector into game-side debugger, delegate LLM tracing to frameworks #75

Summary

Current State

Problem

Proposed Refactor

Keep (game-side debugging — only we can provide this)

Remove (delegate to framework observability)

Add

User Experience

Dependencies

Estimated Effort

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor inspector into game-side debugger, delegate LLM tracing to frameworks #75

Description

Summary

Current State

Problem

Proposed Refactor

Keep (game-side debugging — only we can provide this)

Remove (delegate to framework observability)

Add

User Experience

Dependencies

Estimated Effort

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions