Implement persistent cross-episode memory for agent learning across runs

## Summary

Implement persistent cross-episode memory so agents learn and improve from run to run. This is one of Agent Arena's most unique features -- no agent framework provides game-episode-aware persistent learning out of the box.

An agent that dies to fire in Episode 1 should avoid that area in Episode 2. An agent that discovers a berry cluster should go there first next time. An agent that learns "craft torch before exploring" should do that automatically in future episodes.

## Why This Matters

Cross-episode learning is:
- **Our killer feature**: No framework (LangGraph, CrewAI, etc.) provides this. It is uniquely valuable in a simulation environment.
- **The most compelling AI learning opportunity**: Memory management, pattern recognition, strategy evaluation, and reward attribution are core agent challenges.
- **Visually rewarding**: Watching an agent visibly improve across episodes is powerful for learners.
- **Framework-compatible**: Persistent memory is exposed as query tools -- frameworks do not need to know about persistence.

## Architecture

### How It Fits the Three-Category Model

Cross-episode memory surfaces through query tools. The framework does not need to know or care that results come from a persistent store:

```
CONTEXT (this tick observation -- current episode only)
  "You see 2 berries nearby, health=100, explored 0%"

QUERY TOOLS (now include persistent data from past episodes)
  get_episode_summary(count=3)     -> "Last run: berry cluster NE, died to fire at center"
  query_spatial_memory(pos, radius) -> includes locations from previous episodes with confidence
  get_strategy_notes()              -> "Craft torch before exploring (confirmed 2x)"
  recall_location("workbench")      -> "(5, 0, 8) -- seen in 3 previous episodes"

ACTION TOOLS (this tick decision -- unchanged)
  move_to(...)  |  collect(...)  |  craft_item(...)  |  explore(...)
```

### Per-Episode Lifecycle

```
Episode starts
    |
    v
Agent plays (ticks 1..N)
  - Current observations via context
  - Query tools return current + persistent memory
  - SpatialMemory tracks object locations this episode
  - EpisodeMemoryManager tracks key events this episode
    |
    v
Episode ends (agent dies, time runs out, objective met)
    |
    v
Post-Episode Processing
  1. Generate episode summary (score, key events, decisions)
  2. Merge spatial knowledge into persistent store (with confidence)
  3. Evaluate strategies (which decisions correlated with good outcomes?)
  4. Save to persistent store
    |
    v
Next episode starts with richer memory
```

### Persistent Store

Simple file-based storage per agent (no database needed at this scale):

```
persistent_memory/
  agent_001/
    episodes.json        # Episode summaries with scores and key events
    spatial_knowledge.json  # Aggregated object locations with confidence scores
    strategies.json      # Learned strategies with confirmation counts
```

## Three Layers of Learning (Implementation Phases)

### Layer 1: Factual Memory (what happened)
Store raw observations and events from past episodes.

- Episode summaries: "Collected 4 berries, took 15 damage, score 62"
- Object locations: "Berries seen at (10,0,5), (12,0,3) in Episode 3"
- Events: "Took fire damage at (5,0,5) on tick 23"

**AI concepts learned**: Memory storage, retrieval, recency weighting

### Layer 2: Pattern Recognition (what recurs)
Aggregate facts across episodes to identify reliable patterns.

- "Berries consistently spawn in NE quadrant (seen 4/5 episodes, high confidence)"
- "Fire always appears near center (seen 3/5 episodes)"
- Confidence scores decay if patterns are not confirmed in recent episodes

**AI concepts learned**: Statistical aggregation, confidence scoring, belief updating

### Layer 3: Strategy Learning (what works)
Track which decisions led to good outcomes across episodes.

- "Episodes where I crafted torch first scored 40% higher on average"
- "Exploring before collecting leads to better resource discovery"
- "Approaching fire from the east is safer (less damage taken)"

**AI concepts learned**: Reward attribution, strategy evaluation, counterfactual reasoning

## Existing Code to Build On

| Component | Status | What is Needed |
|-----------|--------|----------------|
| EpisodeMemoryManager | Exists -- tracks within-episode events, detects boundaries | Add persistence to disk between episodes |
| SpatialMemory | Exists -- grid-based index with staleness tracking | Add cross-episode merge with confidence decay |
| Episode summaries | Exists in EpisodeMemoryManager | Add post-episode summary generation and storage |
| Strategy notes | Does not exist | New: track decision-outcome correlations |
| Persistent store | Does not exist | New: JSON file per agent that survives restarts |
| Confidence scoring | Does not exist | New: track how many episodes confirm a pattern |

## Implementation Plan

### Phase 1: Persistence Layer (2 days)
- [ ] Create PersistentMemoryStore class (JSON file read/write per agent)
- [ ] Add episode boundary hooks: on_episode_start() loads, on_episode_end() saves
- [ ] Store episode summaries with score, key events, tick count, timestamp
- [ ] Expose get_episode_summary(count=N) as query tool returning past N episodes

### Phase 2: Spatial Persistence (2 days)
- [ ] On episode end, merge current SpatialMemory into persistent spatial store
- [ ] Add confidence scores: objects seen in multiple episodes get higher confidence
- [ ] Add recency decay: old observations lose confidence over time
- [ ] query_spatial_memory() returns both current and persistent results (flagged appropriately)
- [ ] recall_location() searches persistent store for named object types

### Phase 3: Strategy Learning with RAG (3-4 days)

Phases 1 and 2 use structured JSON storage because the data is well-defined (episode summaries, object coordinates, scores). Phase 3 is where **RAG becomes the right tool** -- strategy insights are unstructured, natural language observations that accumulate over many episodes:

- "When I explored north before crafting, I found more resources but took more damage"
- "Crafting a torch helped in episodes with fire but was wasted effort in episodes without"
- "Approaching the berry cluster from the east avoided the fire hazard"

These do not fit neatly into JSON. A question like "how do I craft a torch?" or "what should I do when health is low near fire?" needs semantic search across many past experiences.

**Implementation approach**: Ingest tick-by-tick experience logs and strategy notes into a RAG store (we already have Milvus + nomic-embed-text on the Ubuntu server). Expose via a query tool:

```python
ask_experience("how do I craft a torch?")
  -> "Need 2 wood. Must be near a workbench. Crafted successfully in Ep3 tick 45.
      Tip: gather wood first, workbench is usually near center of map."

ask_experience("what should I do when health is low?")
  -> "Avoid fire (center area). Move to NE corner where berries are safe.
      In Ep2, fleeing north when health dropped below 50 prevented death."
```

- [ ] Log tick-by-tick experience data (decisions, outcomes, context) during episodes
- [ ] Post-episode: ingest experience logs into RAG store (Milvus via existing infrastructure)
- [ ] Expose ask_experience(question) as query tool -- semantic search over past experiences
- [ ] Track key decisions per episode (first action, crafting order, exploration strategy)
- [ ] Correlate decisions with episode outcomes (score, damage, resources, survival time)
- [ ] Generate strategy notes: "When I did X, outcome was Y (N episodes)"
- [ ] Expose get_strategy_notes() as query tool (structured) alongside ask_experience() (RAG)
- [ ] Add confirmation counting: strategies confirmed across episodes gain weight

**Note**: Phases 1-2 (structured JSON) should be completed first. They solve 80% of cross-episode learning with 20% of the effort. RAG in Phase 3 handles the long tail of unstructured experiential knowledge that grows over many episodes.

### Phase 4: Integration with Framework Starters (1-2 days)
- [ ] LangGraph starter demonstrates cross-episode learning
- [ ] Tutorial explains: memory management, when to trust old data, confidence thresholds
- [ ] Show measurable improvement: graph score across episodes

## Example: Agent Improving Over 5 Episodes

```
Episode 1 (Score: 25)
  - Wanders randomly, finds 2 berries, walks into fire, dies
  - Saves: fire location, berry locations, "died to fire" event

Episode 2 (Score: 45)
  - Remembers fire location, avoids center
  - Finds berry cluster in NE corner
  - Saves: NE berry cluster (confidence: 1 episode), fire avoidance strategy

Episode 3 (Score: 62)
  - Goes directly to NE corner (remembered from Ep2)
  - Berry cluster confirmed (confidence: 2 episodes)
  - Discovers workbench, crafts torch
  - Saves: torch crafting improved exploration

Episode 4 (Score: 78)
  - Crafts torch first (strategy from Ep3)
  - Heads to NE corner (high confidence)
  - Explores more safely with torch
  - Strategy confirmed: "torch first" now has 2 confirmations

Episode 5 (Score: 91)
  - Optimal opening: craft torch -> head NE -> collect berries -> explore south
  - All strategies have high confidence
  - Agent is now consistently performing well
```

## Dependencies

- #71 (tool completion callbacks) -- needed to track decision outcomes
- #74 (framework adapters) -- persistent memory exposed as query tools
- Existing EpisodeMemoryManager and SpatialMemory as foundation

## Success Criteria

- Agent measurably improves over 5+ episodes on foraging scenario
- Persistent memory survives process restarts
- Query tools return cross-episode data transparently
- At least one starter demonstrates and teaches cross-episode learning
- Learner can inspect what the agent "learned" via memory files or inspector



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement persistent cross-episode memory for agent learning across runs #76

Summary

Why This Matters

Architecture

How It Fits the Three-Category Model

Per-Episode Lifecycle

Persistent Store

Three Layers of Learning (Implementation Phases)

Layer 1: Factual Memory (what happened)

Layer 2: Pattern Recognition (what recurs)

Layer 3: Strategy Learning (what works)

Existing Code to Build On

Implementation Plan

Phase 1: Persistence Layer (2 days)

Phase 2: Spatial Persistence (2 days)

Phase 3: Strategy Learning with RAG (3-4 days)

Phase 4: Integration with Framework Starters (1-2 days)

Example: Agent Improving Over 5 Episodes

Dependencies

Success Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Status	What is Needed
EpisodeMemoryManager	Exists -- tracks within-episode events, detects boundaries	Add persistence to disk between episodes
SpatialMemory	Exists -- grid-based index with staleness tracking	Add cross-episode merge with confidence decay
Episode summaries	Exists in EpisodeMemoryManager	Add post-episode summary generation and storage
Strategy notes	Does not exist	New: track decision-outcome correlations
Persistent store	Does not exist	New: JSON file per agent that survives restarts
Confidence scoring	Does not exist	New: track how many episodes confirm a pattern

Implement persistent cross-episode memory for agent learning across runs #76

Description

Summary

Why This Matters

Architecture

How It Fits the Three-Category Model

Per-Episode Lifecycle

Persistent Store

Three Layers of Learning (Implementation Phases)

Layer 1: Factual Memory (what happened)

Layer 2: Pattern Recognition (what recurs)

Layer 3: Strategy Learning (what works)

Existing Code to Build On

Implementation Plan

Phase 1: Persistence Layer (2 days)

Phase 2: Spatial Persistence (2 days)

Phase 3: Strategy Learning with RAG (3-4 days)

Phase 4: Integration with Framework Starters (1-2 days)

Example: Agent Improving Over 5 Episodes

Dependencies

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions