Skip to content

Implement persistent cross-episode memory for agent learning across runs #76

@justinmadison

Description

@justinmadison

Summary

Implement persistent cross-episode memory so agents learn and improve from run to run. This is one of Agent Arena's most unique features -- no agent framework provides game-episode-aware persistent learning out of the box.

An agent that dies to fire in Episode 1 should avoid that area in Episode 2. An agent that discovers a berry cluster should go there first next time. An agent that learns "craft torch before exploring" should do that automatically in future episodes.

Why This Matters

Cross-episode learning is:

  • Our killer feature: No framework (LangGraph, CrewAI, etc.) provides this. It is uniquely valuable in a simulation environment.
  • The most compelling AI learning opportunity: Memory management, pattern recognition, strategy evaluation, and reward attribution are core agent challenges.
  • Visually rewarding: Watching an agent visibly improve across episodes is powerful for learners.
  • Framework-compatible: Persistent memory is exposed as query tools -- frameworks do not need to know about persistence.

Architecture

How It Fits the Three-Category Model

Cross-episode memory surfaces through query tools. The framework does not need to know or care that results come from a persistent store:

CONTEXT (this tick observation -- current episode only)
  "You see 2 berries nearby, health=100, explored 0%"

QUERY TOOLS (now include persistent data from past episodes)
  get_episode_summary(count=3)     -> "Last run: berry cluster NE, died to fire at center"
  query_spatial_memory(pos, radius) -> includes locations from previous episodes with confidence
  get_strategy_notes()              -> "Craft torch before exploring (confirmed 2x)"
  recall_location("workbench")      -> "(5, 0, 8) -- seen in 3 previous episodes"

ACTION TOOLS (this tick decision -- unchanged)
  move_to(...)  |  collect(...)  |  craft_item(...)  |  explore(...)

Per-Episode Lifecycle

Episode starts
    |
    v
Agent plays (ticks 1..N)
  - Current observations via context
  - Query tools return current + persistent memory
  - SpatialMemory tracks object locations this episode
  - EpisodeMemoryManager tracks key events this episode
    |
    v
Episode ends (agent dies, time runs out, objective met)
    |
    v
Post-Episode Processing
  1. Generate episode summary (score, key events, decisions)
  2. Merge spatial knowledge into persistent store (with confidence)
  3. Evaluate strategies (which decisions correlated with good outcomes?)
  4. Save to persistent store
    |
    v
Next episode starts with richer memory

Persistent Store

Simple file-based storage per agent (no database needed at this scale):

persistent_memory/
  agent_001/
    episodes.json        # Episode summaries with scores and key events
    spatial_knowledge.json  # Aggregated object locations with confidence scores
    strategies.json      # Learned strategies with confirmation counts

Three Layers of Learning (Implementation Phases)

Layer 1: Factual Memory (what happened)

Store raw observations and events from past episodes.

  • Episode summaries: "Collected 4 berries, took 15 damage, score 62"
  • Object locations: "Berries seen at (10,0,5), (12,0,3) in Episode 3"
  • Events: "Took fire damage at (5,0,5) on tick 23"

AI concepts learned: Memory storage, retrieval, recency weighting

Layer 2: Pattern Recognition (what recurs)

Aggregate facts across episodes to identify reliable patterns.

  • "Berries consistently spawn in NE quadrant (seen 4/5 episodes, high confidence)"
  • "Fire always appears near center (seen 3/5 episodes)"
  • Confidence scores decay if patterns are not confirmed in recent episodes

AI concepts learned: Statistical aggregation, confidence scoring, belief updating

Layer 3: Strategy Learning (what works)

Track which decisions led to good outcomes across episodes.

  • "Episodes where I crafted torch first scored 40% higher on average"
  • "Exploring before collecting leads to better resource discovery"
  • "Approaching fire from the east is safer (less damage taken)"

AI concepts learned: Reward attribution, strategy evaluation, counterfactual reasoning

Existing Code to Build On

Component Status What is Needed
EpisodeMemoryManager Exists -- tracks within-episode events, detects boundaries Add persistence to disk between episodes
SpatialMemory Exists -- grid-based index with staleness tracking Add cross-episode merge with confidence decay
Episode summaries Exists in EpisodeMemoryManager Add post-episode summary generation and storage
Strategy notes Does not exist New: track decision-outcome correlations
Persistent store Does not exist New: JSON file per agent that survives restarts
Confidence scoring Does not exist New: track how many episodes confirm a pattern

Implementation Plan

Phase 1: Persistence Layer (2 days)

  • Create PersistentMemoryStore class (JSON file read/write per agent)
  • Add episode boundary hooks: on_episode_start() loads, on_episode_end() saves
  • Store episode summaries with score, key events, tick count, timestamp
  • Expose get_episode_summary(count=N) as query tool returning past N episodes

Phase 2: Spatial Persistence (2 days)

  • On episode end, merge current SpatialMemory into persistent spatial store
  • Add confidence scores: objects seen in multiple episodes get higher confidence
  • Add recency decay: old observations lose confidence over time
  • query_spatial_memory() returns both current and persistent results (flagged appropriately)
  • recall_location() searches persistent store for named object types

Phase 3: Strategy Learning with RAG (3-4 days)

Phases 1 and 2 use structured JSON storage because the data is well-defined (episode summaries, object coordinates, scores). Phase 3 is where RAG becomes the right tool -- strategy insights are unstructured, natural language observations that accumulate over many episodes:

  • "When I explored north before crafting, I found more resources but took more damage"
  • "Crafting a torch helped in episodes with fire but was wasted effort in episodes without"
  • "Approaching the berry cluster from the east avoided the fire hazard"

These do not fit neatly into JSON. A question like "how do I craft a torch?" or "what should I do when health is low near fire?" needs semantic search across many past experiences.

Implementation approach: Ingest tick-by-tick experience logs and strategy notes into a RAG store (we already have Milvus + nomic-embed-text on the Ubuntu server). Expose via a query tool:

ask_experience("how do I craft a torch?")
  -> "Need 2 wood. Must be near a workbench. Crafted successfully in Ep3 tick 45.
      Tip: gather wood first, workbench is usually near center of map."

ask_experience("what should I do when health is low?")
  -> "Avoid fire (center area). Move to NE corner where berries are safe.
      In Ep2, fleeing north when health dropped below 50 prevented death."
  • Log tick-by-tick experience data (decisions, outcomes, context) during episodes
  • Post-episode: ingest experience logs into RAG store (Milvus via existing infrastructure)
  • Expose ask_experience(question) as query tool -- semantic search over past experiences
  • Track key decisions per episode (first action, crafting order, exploration strategy)
  • Correlate decisions with episode outcomes (score, damage, resources, survival time)
  • Generate strategy notes: "When I did X, outcome was Y (N episodes)"
  • Expose get_strategy_notes() as query tool (structured) alongside ask_experience() (RAG)
  • Add confirmation counting: strategies confirmed across episodes gain weight

Note: Phases 1-2 (structured JSON) should be completed first. They solve 80% of cross-episode learning with 20% of the effort. RAG in Phase 3 handles the long tail of unstructured experiential knowledge that grows over many episodes.

Phase 4: Integration with Framework Starters (1-2 days)

  • LangGraph starter demonstrates cross-episode learning
  • Tutorial explains: memory management, when to trust old data, confidence thresholds
  • Show measurable improvement: graph score across episodes

Example: Agent Improving Over 5 Episodes

Episode 1 (Score: 25)
  - Wanders randomly, finds 2 berries, walks into fire, dies
  - Saves: fire location, berry locations, "died to fire" event

Episode 2 (Score: 45)
  - Remembers fire location, avoids center
  - Finds berry cluster in NE corner
  - Saves: NE berry cluster (confidence: 1 episode), fire avoidance strategy

Episode 3 (Score: 62)
  - Goes directly to NE corner (remembered from Ep2)
  - Berry cluster confirmed (confidence: 2 episodes)
  - Discovers workbench, crafts torch
  - Saves: torch crafting improved exploration

Episode 4 (Score: 78)
  - Crafts torch first (strategy from Ep3)
  - Heads to NE corner (high confidence)
  - Explores more safely with torch
  - Strategy confirmed: "torch first" now has 2 confirmations

Episode 5 (Score: 91)
  - Optimal opening: craft torch -> head NE -> collect berries -> explore south
  - All strategies have high confidence
  - Agent is now consistently performing well

Dependencies

Success Criteria

  • Agent measurably improves over 5+ episodes on foraging scenario
  • Persistent memory survives process restarts
  • Query tools return cross-episode data transparently
  • At least one starter demonstrates and teaches cross-episode learning
  • Learner can inspect what the agent "learned" via memory files or inspector

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions