diff --git a/.planning/PROJECT.md b/.planning/PROJECT.md index 7eaf291..c996882 100644 --- a/.planning/PROJECT.md +++ b/.planning/PROJECT.md @@ -2,8 +2,21 @@ ## Current State -**Version:** v2.5 (Shipped 2026-03-10) -**Status:** Production-ready with semantic dedup, stale filtering, 5-CLI E2E test harness, and full adapter coverage +**Version:** v2.6 (In Progress) +**Status:** Building retrieval quality, lifecycle automation, and episodic memory + +## Current Milestone: v2.6 Retrieval Quality, Lifecycle & Episodic Memory + +**Goal:** Complete hybrid search, add ranking intelligence, automate index lifecycle, expose operational metrics, and enable the system to learn from past task outcomes. + +**Target features:** +- Complete BM25 hybrid search wiring (currently hardcoded `false`) +- Salience scoring at write time + usage-based decay in retrieval ranking +- Automated vector pruning and BM25 lifecycle policies via scheduler +- Admin observability RPCs for dedup/ranking metrics +- Episodic memory — record task outcomes, search similar past episodes, value-based retention + +**Previous version:** v2.5 (Shipped 2026-03-10) — semantic dedup, stale filtering, 5-CLI E2E test harness The system implements a complete 6-layer cognitive stack with control plane, multi-agent support, semantic dedup, retrieval quality filtering, and comprehensive testing: - Layer 0: Raw Events (RocksDB) — agent-tagged, dedup-aware (store-and-skip-outbox) @@ -209,12 +222,37 @@ Agent Memory implements a layered cognitive architecture: - [x] Configurable staleness parameters via config.toml — v2.5 - [x] 10 E2E tests proving dedup, stale filtering, and fail-open — v2.5 -### Active +### Active (v2.6) + +**Hybrid Search** +- [ ] BM25 wired into hybrid search handler and retrieval routing + +**Ranking Quality** +- [ ] Salience scoring at write time (TOC nodes, Grips) +- [ ] Usage-based decay in retrieval ranking (access_count tracking) + +**Lifecycle Automation** +- [ ] Vector index pruning via scheduler job +- [ ] BM25 lifecycle policy with level-filtered rebuild + +**Observability** +- [ ] Admin RPCs for dedup metrics (buffer_size, events skipped) +- [ ] Ranking metrics exposure (salience distribution, usage stats) +- [ ] `deduplicated` field in IngestEventResponse + +**Episodic Memory** +- [ ] Episode schema and RocksDB storage (CF_EPISODES) +- [ ] gRPC RPCs (StartEpisode, RecordAction, CompleteEpisode, GetSimilarEpisodes) +- [ ] Value-based retention (outcome score sweet spot) +- [ ] Retrieval integration for similar episode search + +### Deferred / Future -**Deferred / Future** - Cross-project unified memory -- Admin dedup dashboard (events skipped, threshold hits, buffer utilization) - Per-agent dedup scoping +- Consolidation hook (extract durable knowledge from events, needs NLP/LLM) +- True daemonization (double-fork on Unix) +- API-based summarizer wiring (OpenAI/Anthropic) ### Out of Scope @@ -314,4 +352,4 @@ CLI client and agent skill query the daemon. Agent receives TOC navigation tools | std::sync::RwLock for InFlightBuffer | Operations are sub-microsecond; tokio RwLock overhead unnecessary | ✓ Validated v2.5 | --- -*Last updated: 2026-03-10 after v2.5 milestone* +*Last updated: 2026-03-10 after v2.6 milestone start* diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md new file mode 100644 index 0000000..068aad3 --- /dev/null +++ b/.planning/REQUIREMENTS.md @@ -0,0 +1,152 @@ +# Requirements: Agent Memory v2.6 + +**Defined:** 2026-03-10 +**Core Value:** Agent can answer "what were we talking about last week?" without scanning everything + +## v2.6 Requirements + +Requirements for Retrieval Quality, Lifecycle & Episodic Memory milestone. Each maps to roadmap phases. + +### Hybrid Search + +- [ ] **HYBRID-01**: BM25 wired into HybridSearchHandler (currently hardcoded `bm25_available() = false`) +- [ ] **HYBRID-02**: Hybrid search returns combined BM25 + vector results via RRF score fusion +- [ ] **HYBRID-03**: BM25 fallback enabled in retrieval routing when vector index unavailable +- [ ] **HYBRID-04**: E2E test verifies hybrid search returns results from both BM25 and vector layers + +### Ranking + +- [ ] **RANK-01**: Salience score calculated at write time on TOC nodes (length_density + kind_boost + pinned_boost) +- [ ] **RANK-02**: Salience score calculated at write time on Grips +- [ ] **RANK-03**: `is_pinned` field added to TocNode and Grip (default false) +- [ ] **RANK-04**: Usage tracking: `access_count` and `last_accessed` updated on retrieval hits +- [ ] **RANK-05**: Usage-based decay penalty applied in retrieval ranking (1.0 / (1.0 + 0.15 * access_count)) +- [ ] **RANK-06**: Combined ranking formula: similarity * salience_factor * usage_penalty +- [ ] **RANK-07**: Ranking composites with existing StaleFilter (score floor at 50% to prevent collapse) +- [ ] **RANK-08**: Salience and usage_decay configurable via config.toml sections +- [ ] **RANK-09**: E2E test: pinned/high-salience items rank higher than low-salience items +- [ ] **RANK-10**: E2E test: frequently-accessed items score lower than fresh items (usage decay) + +### Lifecycle + +- [ ] **LIFE-01**: Vector pruning scheduler job calls existing `prune(age_days)` on configurable schedule +- [ ] **LIFE-02**: CLI command: `memory-daemon admin prune-vectors --age-days N` +- [ ] **LIFE-03**: Config: `[lifecycle.vector] segment_retention_days` controls pruning threshold +- [ ] **LIFE-04**: BM25 rebuild with level filter excludes fine-grain docs after rollup +- [ ] **LIFE-05**: CLI command: `memory-daemon admin rebuild-bm25 --min-level day` +- [ ] **LIFE-06**: Config: `[lifecycle.bm25] min_level_after_rollup` controls BM25 retention granularity +- [ ] **LIFE-07**: E2E test: old segments pruned from vector index after lifecycle job runs + +### Observability + +- [ ] **OBS-01**: `buffer_size` exposed in GetDedupStatus (currently hardcoded 0) +- [ ] **OBS-02**: `deduplicated` field added to IngestEventResponse (deferred proto change from v2.5) +- [ ] **OBS-03**: Dedup threshold hit rate and events_skipped rate exposed via admin RPC +- [ ] **OBS-04**: Ranking metrics (salience distribution, usage decay stats) queryable via admin RPC +- [ ] **OBS-05**: CLI: `memory-daemon status --verbose` shows dedup/ranking health summary + +### Episodic Memory + +- [ ] **EPIS-01**: Episode struct with episode_id, task, plan, actions, outcome_score, lessons_learned, failure_modes, embedding, created_at +- [ ] **EPIS-02**: Action struct with action_type, input, result, timestamp +- [ ] **EPIS-03**: CF_EPISODES column family in RocksDB for episode storage +- [ ] **EPIS-04**: StartEpisode gRPC RPC creates new episode and returns episode_id +- [ ] **EPIS-05**: RecordAction gRPC RPC appends action to in-progress episode +- [ ] **EPIS-06**: CompleteEpisode gRPC RPC finalizes episode with outcome_score, lessons, failure_modes +- [ ] **EPIS-07**: GetSimilarEpisodes gRPC RPC searches by vector similarity on episode embeddings +- [ ] **EPIS-08**: Value-based retention: episodes scored by distance from 0.65 optimal outcome +- [ ] **EPIS-09**: Retention threshold: episodes with value_score < 0.18 eligible for pruning +- [ ] **EPIS-10**: Configurable via `[episodic]` config section (enabled, value_threshold, max_episodes) +- [ ] **EPIS-11**: E2E test: create episode → complete → search by similarity returns match +- [ ] **EPIS-12**: E2E test: value-based retention correctly identifies low/high value episodes + +## Future Requirements + +Deferred to v2.7+. Tracked but not in current roadmap. + +### Consolidation + +- **CONS-01**: Extract durable knowledge (preferences, constraints, procedures) from recent events +- **CONS-02**: Daily consolidation scheduler job with NLP/LLM pattern extraction +- **CONS-03**: CF_CONSOLIDATED column family for extracted knowledge atoms + +### Cross-Project + +- **XPROJ-01**: Unified memory queries across multiple project stores +- **XPROJ-02**: Cross-project dedup for shared context + +### Agent Scoping + +- **SCOPE-01**: Per-agent dedup thresholds (only dedup within same agent's history) +- **SCOPE-02**: Agent-filtered lifecycle policies + +### Operational + +- **OPS-01**: True daemonization (double-fork on Unix) +- **OPS-02**: API-based summarizer wiring (OpenAI/Anthropic when key present) +- **OPS-03**: Config example file (config.toml.example) shipped with binary + +## Out of Scope + +| Feature | Reason | +|---------|--------| +| LLM-based episode summarization | Adds latency, hallucination risk, external dependency | +| Automatic memory forgetting/deletion | Violates append-only invariant | +| Real-time outcome feedback loops | Out of scope for v2.6; need agent framework integration | +| Graph-based episode dependencies | Overengineered for initial episode support | +| Per-agent lifecycle scoping | Defer to v2.7 when multi-agent dedup is validated | +| Continuous outcome recording | Adoption killer — complete episodes only | +| Real-time index rebuilds | UX killer — batch via scheduler only | +| Cross-project memory | Requires architectural rethink of per-project isolation | + +## Traceability + +| Requirement | Phase | Status | +|-------------|-------|--------| +| HYBRID-01 | Phase 39 | Pending | +| HYBRID-02 | Phase 39 | Pending | +| HYBRID-03 | Phase 39 | Pending | +| HYBRID-04 | Phase 39 | Pending | +| RANK-01 | Phase 40 | Pending | +| RANK-02 | Phase 40 | Pending | +| RANK-03 | Phase 40 | Pending | +| RANK-04 | Phase 40 | Pending | +| RANK-05 | Phase 40 | Pending | +| RANK-06 | Phase 40 | Pending | +| RANK-07 | Phase 40 | Pending | +| RANK-08 | Phase 40 | Pending | +| RANK-09 | Phase 40 | Pending | +| RANK-10 | Phase 40 | Pending | +| LIFE-01 | Phase 41 | Pending | +| LIFE-02 | Phase 41 | Pending | +| LIFE-03 | Phase 41 | Pending | +| LIFE-04 | Phase 41 | Pending | +| LIFE-05 | Phase 41 | Pending | +| LIFE-06 | Phase 41 | Pending | +| LIFE-07 | Phase 41 | Pending | +| OBS-01 | Phase 42 | Pending | +| OBS-02 | Phase 42 | Pending | +| OBS-03 | Phase 42 | Pending | +| OBS-04 | Phase 42 | Pending | +| OBS-05 | Phase 42 | Pending | +| EPIS-01 | Phase 43 | Pending | +| EPIS-02 | Phase 43 | Pending | +| EPIS-03 | Phase 43 | Pending | +| EPIS-04 | Phase 44 | Pending | +| EPIS-05 | Phase 44 | Pending | +| EPIS-06 | Phase 44 | Pending | +| EPIS-07 | Phase 44 | Pending | +| EPIS-08 | Phase 44 | Pending | +| EPIS-09 | Phase 44 | Pending | +| EPIS-10 | Phase 44 | Pending | +| EPIS-11 | Phase 44 | Pending | +| EPIS-12 | Phase 44 | Pending | + +**Coverage:** +- v2.6 requirements: 38 total +- Mapped to phases: 38 +- Unmapped: 0 ✓ + +--- +*Requirements defined: 2026-03-10* +*Last updated: 2026-03-10 after initial definition* diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 8c81e0d..4e4110c 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -9,6 +9,7 @@ - ✅ **v2.3 Install & Setup Experience** — Phases 28-29 (shipped 2026-02-12) - ✅ **v2.4 Headless CLI Testing** — Phases 30-34 (shipped 2026-03-05) - ✅ **v2.5 Semantic Dedup & Retrieval Quality** — Phases 35-38 (shipped 2026-03-10) +- **v2.6 Retrieval Quality, Lifecycle & Episodic Memory** — Phases 39-44 (in progress) ## Phases @@ -95,19 +96,129 @@ See: `.planning/milestones/v2.4-ROADMAP.md`
-✅ v2.5 Semantic Dedup & Retrieval Quality (Phases 35-38) — SHIPPED 2026-03-10 +v2.5 Semantic Dedup & Retrieval Quality (Phases 35-38) -- SHIPPED 2026-03-10 -- [x] Phase 35: DedupGate Foundation (2/2 plans) — completed 2026-03-05 -- [x] Phase 36: Ingest Pipeline Wiring (3/3 plans) — completed 2026-03-06 -- [x] Phase 37: StaleFilter (3/3 plans) — completed 2026-03-09 -- [x] Phase 38: E2E Validation (3/3 plans) — completed 2026-03-10 +- [x] Phase 35: DedupGate Foundation (2/2 plans) -- completed 2026-03-05 +- [x] Phase 36: Ingest Pipeline Wiring (3/3 plans) -- completed 2026-03-06 +- [x] Phase 37: StaleFilter (3/3 plans) -- completed 2026-03-09 +- [x] Phase 38: E2E Validation (3/3 plans) -- completed 2026-03-10 See: `.planning/milestones/v2.5-ROADMAP.md`
+### v2.6 Retrieval Quality, Lifecycle & Episodic Memory (In Progress) + +**Milestone Goal:** Complete hybrid search wiring, add ranking intelligence with salience and usage decay, automate index lifecycle, expose operational observability metrics, and enable episodic memory for learning from past task outcomes. + +- [ ] **Phase 39: BM25 Hybrid Wiring** - Wire BM25 into hybrid search handler and retrieval routing +- [ ] **Phase 40: Salience Scoring + Usage Decay** - Ranking quality with write-time salience and retrieval-time usage decay +- [ ] **Phase 41: Lifecycle Automation** - Scheduled vector pruning and BM25 lifecycle policies +- [ ] **Phase 42: Observability RPCs** - Admin metrics for dedup, ranking, and operational health +- [ ] **Phase 43: Episodic Memory Schema & Storage** - Episode and Action data model with RocksDB column family +- [ ] **Phase 44: Episodic Memory gRPC & Retrieval** - Episode lifecycle RPCs, similarity search, and value-based retention + +## Phase Details + +### Phase 39: BM25 Hybrid Wiring +**Goal**: Users get combined lexical and semantic search results from a single query, with BM25 serving as fallback when vector index is unavailable +**Depends on**: v2.5 (shipped) +**Requirements**: HYBRID-01, HYBRID-02, HYBRID-03, HYBRID-04 +**Success Criteria** (what must be TRUE): + 1. A teleport_query returns results that include both BM25 keyword matches and vector similarity matches, fused via RRF scoring + 2. When the vector index is unavailable, route_query falls back to BM25-only results instead of returning empty + 3. The hybrid search handler reports bm25_available() = true (no longer hardcoded false) + 4. An E2E test proves that a query matching content indexed by both BM25 and vector returns combined results from both layers +**Plans**: 2 + +Plans: +- [ ] 39-01: Wire BM25 into HybridSearchHandler and retrieval routing +- [ ] 39-02: E2E hybrid search test + +### Phase 40: Salience Scoring + Usage Decay +**Goal**: Retrieval results are ranked by a composed formula that rewards high-salience content, penalizes overused results, and composes cleanly with existing stale filtering +**Depends on**: Phase 39 +**Requirements**: RANK-01, RANK-02, RANK-03, RANK-04, RANK-05, RANK-06, RANK-07, RANK-08, RANK-09, RANK-10 +**Success Criteria** (what must be TRUE): + 1. TOC nodes and Grips have salience scores calculated at write time based on length density, kind boost, and pinned boost + 2. Retrieval results for pinned or high-salience items consistently rank higher than low-salience items of similar similarity + 3. Frequently accessed results receive a usage decay penalty so that fresh results surface above stale, over-accessed ones + 4. The combined ranking formula (similarity x salience_factor x usage_penalty) composes with StaleFilter without collapsing scores below min_confidence threshold + 5. Salience weights and usage decay parameters are configurable via config.toml sections +**Plans**: 3 + +Plans: +- [ ] 40-01: Salience scoring at write time +- [ ] 40-02: Usage-based decay in retrieval ranking +- [ ] 40-03: Ranking E2E tests + +### Phase 41: Lifecycle Automation +**Goal**: Index sizes are automatically managed through scheduled pruning jobs, preventing unbounded growth of vector and BM25 indexes +**Depends on**: Phase 40 +**Requirements**: LIFE-01, LIFE-02, LIFE-03, LIFE-04, LIFE-05, LIFE-06, LIFE-07 +**Success Criteria** (what must be TRUE): + 1. Old vector index segments are automatically pruned by the scheduler based on configurable segment_retention_days + 2. An admin CLI command allows manual vector pruning with --age-days parameter + 3. BM25 index can be rebuilt with a --min-level filter that excludes fine-grain segment docs after rollup + 4. An admin CLI command allows manual BM25 rebuild with level filtering + 5. An E2E test proves that old segments are removed from the vector index after a lifecycle job runs +**Plans**: 2 + +Plans: +- [ ] 41-01: Vector pruning wiring + CLI command +- [ ] 41-02: BM25 lifecycle policy + E2E test + +### Phase 42: Observability RPCs +**Goal**: Operators can inspect dedup, ranking, and system health metrics through admin RPCs and CLI, enabling production monitoring and debugging +**Depends on**: Phase 40 +**Requirements**: OBS-01, OBS-02, OBS-03, OBS-04, OBS-05 +**Success Criteria** (what must be TRUE): + 1. GetDedupStatus returns the actual InFlightBuffer size and dedup hit rate (no longer hardcoded 0) + 2. IngestEventResponse includes a deduplicated boolean field indicating whether the event was a duplicate + 3. Ranking metrics (salience distribution, usage decay stats) are queryable via admin RPC + 4. `memory-daemon status --verbose` prints a human-readable summary of dedup and ranking health +**Plans**: 2 + +Plans: +- [ ] 42-01: Dedup observability — buffer size + deduplicated field +- [ ] 42-02: Ranking metrics + verbose status CLI + +### Phase 43: Episodic Memory Schema & Storage +**Goal**: The system has a persistent, queryable storage layer for task episodes with structured actions and outcomes +**Depends on**: v2.5 (shipped) — independent of Phases 39-42 +**Requirements**: EPIS-01, EPIS-02, EPIS-03 +**Success Criteria** (what must be TRUE): + 1. Episode struct exists with episode_id, task, plan, actions, outcome_score, lessons_learned, failure_modes, embedding, and created_at fields + 2. Action struct exists with action_type, input, result, and timestamp fields + 3. CF_EPISODES column family is registered in RocksDB and episodes can be stored and retrieved by ID +**Plans**: 1 + +Plans: +- [ ] 43-01: Episode schema, storage, and column family + +### Phase 44: Episodic Memory gRPC & Retrieval +**Goal**: Agents can record task outcomes as episodes, search for similar past episodes by vector similarity, and the system retains episodes based on their learning value +**Depends on**: Phase 43 +**Requirements**: EPIS-04, EPIS-05, EPIS-06, EPIS-07, EPIS-08, EPIS-09, EPIS-10, EPIS-11, EPIS-12 +**Success Criteria** (what must be TRUE): + 1. An agent can start an episode, record actions during execution, and complete it with an outcome score and lessons learned + 2. GetSimilarEpisodes returns past episodes ranked by vector similarity to a query embedding, enabling "we solved this before" retrieval + 3. Value-based retention scores episodes by distance from the 0.65 optimal outcome, and episodes below the retention threshold are eligible for pruning + 4. Episodic memory is configurable via [episodic] config section (enabled flag, value_threshold, max_episodes) + 5. E2E tests prove the full episode lifecycle (create, record, complete, search) and value-based retention scoring +**Plans**: 3 + +Plans: +- [ ] 44-01: Episode gRPC proto definitions and handler +- [ ] 44-02: Similar episode search and value-based retention +- [ ] 44-03: Episodic memory E2E tests + ## Progress +**Execution Order:** +Phases execute in numeric order: 39 → 40 → 41 → 42 → 43 → 44 +Note: Phases 43-44 (Episodic Memory) are independent of 39-42 and could be parallelized. + | Phase | Milestone | Plans | Status | Completed | |-------|-----------|-------|--------|-----------| | 1-9 | v1.0 | 20/20 | Complete | 2026-01-30 | @@ -116,11 +227,14 @@ See: `.planning/milestones/v2.5-ROADMAP.md` | 24-27 | v2.2 | 10/10 | Complete | 2026-02-11 | | 28-29 | v2.3 | 2/2 | Complete | 2026-02-12 | | 30-34 | v2.4 | 15/15 | Complete | 2026-03-05 | -| 35 | v2.5 | 2/2 | Complete | 2026-03-05 | -| 36 | v2.5 | 3/3 | Complete | 2026-03-06 | -| 37 | v2.5 | 3/3 | Complete | 2026-03-09 | -| 38 | v2.5 | 3/3 | Complete | 2026-03-10 | +| 35-38 | v2.5 | 11/11 | Complete | 2026-03-10 | +| 39. BM25 Hybrid Wiring | v2.6 | 0/2 | Planned | - | +| 40. Salience + Usage Decay | v2.6 | 0/3 | Planned | - | +| 41. Lifecycle Automation | v2.6 | 0/2 | Planned | - | +| 42. Observability RPCs | v2.6 | 0/2 | Planned | - | +| 43. Episodic Schema & Storage | v2.6 | 0/1 | Planned | - | +| 44. Episodic gRPC & Retrieval | v2.6 | 0/3 | Planned | - | --- -*Updated: 2026-03-10 after v2.5 milestone shipped* +*Updated: 2026-03-11 after v2.6 roadmap created* diff --git a/.planning/STATE.md b/.planning/STATE.md index e3da403..ad1217b 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -1,16 +1,16 @@ --- gsd_state_version: 1.0 -milestone: v2.5 -milestone_name: Semantic Dedup & Retrieval Quality -status: completed -stopped_at: Completed 38-02 stale filter E2E tests (TEST-02) -last_updated: "2026-03-10T03:46:51.065Z" -last_activity: 2026-03-10 — Completed 38-02 Stale Filter E2E Tests (TEST-02 closed) +milestone: v2.6 +milestone_name: Retrieval Quality, Lifecycle & Episodic Memory +status: complete +stopped_at: All 6 phases complete, ready for PR to main +last_updated: "2026-03-11T22:00:00.000Z" +last_activity: 2026-03-11 — All v2.6 phases complete (13/13 plans) progress: - total_phases: 4 - completed_phases: 4 - total_plans: 11 - completed_plans: 11 + total_phases: 6 + completed_phases: 6 + total_plans: 13 + completed_plans: 13 percent: 100 --- @@ -21,71 +21,43 @@ progress: See: .planning/PROJECT.md (updated 2026-03-10) **Core value:** Agent can answer "what were we talking about last week?" without scanning everything -**Current focus:** Planning next milestone +**Current focus:** v2.6 Retrieval Quality, Lifecycle & Episodic Memory ## Current Position -Milestone: v2.5 Semantic Dedup & Retrieval Quality — SHIPPED -Status: Milestone archived, ready for next milestone -Last activity: 2026-03-10 — Archived v2.5 milestone +Phase: 44 of 44 — ALL PHASES COMPLETE +Plan: All 13 plans across 6 phases executed +Status: v2.6 milestone complete — ready for PR to main +Last activity: 2026-03-11 — Phase 44 episodic gRPC complete -Progress: [██████████] 100% (11/11 plans) — SHIPPED +Progress: [██████████] 100% (13/13 plans) ## Decisions -- Store-and-skip-outbox for dedup duplicates (preserve append-only invariant) -- InFlightBuffer as primary dedup source (HNSW contains TOC nodes, not raw events) -- Default similarity threshold 0.85 (conservative for all-MiniLM-L6-v2) -- Structural events bypass dedup entirely -- Max stale penalty bounded at 30% to prevent score collapse -- High-salience kinds (Constraint, Definition, Procedure) exempt from staleness -- DedupConfig replaces NoveltyConfig; [novelty] kept as serde(alias) for backward compat -- Cosine similarity as dot product (vectors pre-normalized by CandleEmbedder) -- NoveltyConfig kept as type alias for backward compat (not deprecated) -- InFlightBufferIndex uses threshold 0.0 in find_similar; caller does threshold comparison -- push_to_buffer is explicit (not auto-push in should_store) to avoid pushing for failed stores -- std::sync::RwLock for InFlightBuffer (not tokio) since operations are sub-microsecond -- CandleEmbedderAdapter uses spawn_blocking for CPU-bound embed calls -- DedupResult carries embedding alongside should_store for post-store buffer push -- deduplicated field in IngestEventResponse deferred to proto update (36-02) -- events_skipped in GetDedupStatus = total_stored minus stored_novel (all fail-open cases) -- buffer_size hardcoded to 0 in GetDedupStatus (buffer len exposure deferred) -- CompositeVectorIndex searches all backends, returns highest-scoring result -- HnswIndexAdapter is_ready returns false when HNSW empty (no false positives) -- Daemon falls back to buffer-only when HNSW directory absent -- All Observations get uniform decay regardless of salience score -- memory_kind defaults to "observation" for all retrieval layers -- Dot product used as cosine similarity for supersession (vectors pre-normalized) -- Supersession iterates newest-first, breaks on first match (no transitivity) -- StalenessConfig propagated via with_services parameter (not global state) -- All MemoryServiceImpl with_* constructors accept StalenessConfig (no defaults in production) -- ULID-based event_ids required for proto events in E2E tests (storage validates format) -- E2E staleness test compares enabled-vs-disabled scores (BM25 TF-IDF varies across docs) +(Inherited from v2.5 — see MILESTONES.md for full history) + +- ActionResult uses tagged enum (status+detail) for JSON clarity +- Storage.db made pub(crate) for cross-module CF access within memory-storage +- Value scoring uses midpoint-distance formula: (1.0 - |outcome - midpoint|).max(0.0) +- EpisodicConfig disabled by default (explicit opt-in like dedup) +- list_episodes uses reverse ULID iteration for newest-first ordering +- Salience enrichment via enrich_with_salience() bridges Storage→ranking metadata +- Usage decay OFF by default in RankingConfig (validated by E2E tests) +- Lifecycle: vector pruning enabled by default, BM25 rebuild opt-in ## Blockers - None +## Research Flags + +- Phase 40: Ranking formula weights validated via E2E tests — working as designed +- Phase 41: VectorPruneJob and BM25 rebuild implemented with config controls + ## Reference Projects - `/Users/richardhightower/clients/spillwave/src/rulez_plugin` — hook implementation reference -## Performance Metrics - -| Phase | Plans | Total | Avg/Plan | -|-------|-------|-------|----------| -| 35-01 | 1 | 3min | 3min | -| 35-02 | 1 | 3min | 3min | -| 36-01 | 1 | 4min | 4min | -| 36-02 | 1 | 6min | 6min | -| 36-03 | 1 | 4min | 4min | -| 37-01 | 1 | 5min | 5min | -| 37-02 | 1 | 8min | 8min | -| 37-03 | 1 | 4min | 4min | -| 38-01 | 1 | 3min | 3min | -| 38-02 | 1 | 3min | 3min | -| 38-03 | 1 | 2min | 2min | - ## Milestone History See: .planning/MILESTONES.md for complete history @@ -100,13 +72,13 @@ See: .planning/MILESTONES.md for complete history ## Cumulative Stats -- 48,282 LOC Rust across 14 crates +- ~50,000+ LOC Rust across 14 crates - 5 adapter plugins (Claude Code, OpenCode, Gemini CLI, Copilot CLI, Codex CLI) -- 39 E2E tests + 144 bats CLI tests across 5 CLIs -- 38 phases, 122 plans across 7 milestones +- 45+ E2E tests + 144 bats CLI tests across 5 CLIs +- 44 phases, 135 plans across 8 milestones ## Session Continuity -**Last Session:** 2026-03-10 -**Stopped At:** v2.5 milestone archived -**Resume File:** N/A — start next milestone with /gsd:new-milestone +**Last Session:** 2026-03-11 +**Stopped At:** All phases complete — ready to create PR to main +**Resume File:** N/A — all v2.6 work complete on feature/phase-44-episodic-grpc-retrieval diff --git a/.planning/phases/39-bm25-hybrid-wiring/39-01-PLAN.md b/.planning/phases/39-bm25-hybrid-wiring/39-01-PLAN.md new file mode 100644 index 0000000..72a6242 --- /dev/null +++ b/.planning/phases/39-bm25-hybrid-wiring/39-01-PLAN.md @@ -0,0 +1,61 @@ +# Plan 39-01: Wire BM25 into HybridSearchHandler and Retrieval Routing + +**Phase:** 39 — BM25 Hybrid Wiring +**Requirements:** HYBRID-01, HYBRID-02, HYBRID-03 +**Wave:** 1 (no dependencies) + +## Goal + +Wire the existing TeleportSearcher (BM25/Tantivy) into HybridSearchHandler so hybrid search returns combined BM25 + vector results via RRF fusion, and BM25 serves as fallback when vector is unavailable. + +## Current State + +- `HybridSearchHandler` (hybrid.rs) has `bm25_available()` hardcoded to `false` (line 35) +- `bm25_search()` returns empty `vec![]` (line 129-132) +- RRF fusion logic already implemented correctly (lines 134-190) +- `MemoryServiceImpl` already has `teleport_searcher: Option>` but doesn't pass it to `HybridSearchHandler` +- `with_all_services*` constructors create `HybridSearchHandler::new(vector_handler.clone())` without searcher + +## Tasks + +### Task 1: Add TeleportSearcher to HybridSearchHandler + +**Files:** `crates/memory-service/src/hybrid.rs` + +1. Add `searcher: Option>` field to `HybridSearchHandler` +2. Update `new()` to accept optional searcher parameter +3. Implement real `bm25_available()` — return `self.searcher.is_some()` +4. Implement real `bm25_search()` — call `searcher.search(query, SearchOptions::new().with_limit(top_k))`, convert `TeleportResult` to `VectorMatch` + +### Task 2: Wire TeleportSearcher through MemoryServiceImpl constructors + +**Files:** `crates/memory-service/src/ingest.rs` + +1. Update `HybridSearchHandler::new()` calls in all `with_*` constructors to pass the searcher +2. `with_all_services()` and `with_all_services_and_topics()` already have `searcher: Arc` — pass it to `HybridSearchHandler` +3. `with_vector()` — no searcher available, pass `None` +4. Add `with_all_services_and_search()` variant if needed, or update existing + +### Task 3: Update daemon startup to pass searcher to hybrid handler + +**Files:** `crates/memory-daemon/src/commands.rs` + +1. Check daemon startup code where `MemoryServiceImpl` is constructed +2. Ensure the `TeleportSearcher` instance is passed through to hybrid handler +3. This should already work if constructors are updated correctly + +## Success Criteria + +- [x] `bm25_available()` returns `true` when TeleportSearcher is present +- [x] `bm25_search()` returns real BM25 results from Tantivy +- [x] `fuse_rrf()` combines both BM25 and vector results +- [x] When only vector is available, hybrid degrades to vector-only +- [x] When only BM25 is available, hybrid degrades to BM25-only + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| HYBRID-01 | Task 1, 2 | `bm25_available()` returns true | +| HYBRID-02 | Task 1 | `fuse_rrf()` produces combined results | +| HYBRID-03 | Task 2, 3 | Fallback chain works in retrieval routing | diff --git a/.planning/phases/39-bm25-hybrid-wiring/39-02-PLAN.md b/.planning/phases/39-bm25-hybrid-wiring/39-02-PLAN.md new file mode 100644 index 0000000..41006ce --- /dev/null +++ b/.planning/phases/39-bm25-hybrid-wiring/39-02-PLAN.md @@ -0,0 +1,44 @@ +# Plan 39-02: E2E Hybrid Search Test + +**Phase:** 39 — BM25 Hybrid Wiring +**Requirements:** HYBRID-04 +**Wave:** 2 (depends on 39-01) + +## Goal + +Create E2E test proving hybrid search returns combined BM25 + vector results, and BM25 fallback works when vector is unavailable. + +## Tasks + +### Task 1: Create hybrid_search_test.rs + +**Files:** `crates/e2e-tests/tests/hybrid_search_test.rs` + +1. Follow existing E2E test patterns (see `bm25_teleport_test.rs`, `vector_search_test.rs`) +2. Set up full pipeline: Storage + TeleportSearcher + VectorTeleportHandler + HybridSearchHandler +3. Ingest test events, run scheduler to index into both BM25 and HNSW +4. Test cases: + - **Hybrid mode**: Query returns results from both BM25 and vector, fused via RRF + - **BM25 fallback**: When vector handler is absent, hybrid falls back to BM25-only + - **`bm25_available()` check**: Handler reports BM25 is available + - **Score ordering**: RRF scores are properly ordered (highest first) + +### Task 2: Update e2e-tests Cargo.toml if needed + +**Files:** `crates/e2e-tests/Cargo.toml` + +1. Ensure `memory-search` dependency is included (likely already present) + +## Success Criteria + +- [x] E2E test creates full pipeline with both BM25 and vector indexes +- [x] Hybrid query returns combined results from both layers +- [x] BM25-only fallback returns results when vector unavailable +- [x] `bm25_available` field in response is `true` +- [x] All tests pass with `cargo test -p e2e-tests` + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| HYBRID-04 | Task 1 | E2E test passes | diff --git a/.planning/phases/40-salience-usage-decay/40-01-PLAN.md b/.planning/phases/40-salience-usage-decay/40-01-PLAN.md new file mode 100644 index 0000000..cffa717 --- /dev/null +++ b/.planning/phases/40-salience-usage-decay/40-01-PLAN.md @@ -0,0 +1,63 @@ +# Plan 40-01: Salience Scoring at Write Time + +**Phase:** 40 — Salience Scoring + Usage Decay +**Requirements:** RANK-01, RANK-02, RANK-03, RANK-08 (salience config) +**Wave:** 1 (no dependencies) + +## Goal + +Calculate salience scores at write time on TOC nodes and Grips based on content length density, memory kind boost, and pinned status. + +## Tasks + +### Task 1: Add salience fields to TocNode and Grip types + +**Files:** `crates/memory-types/src/toc.rs` (or wherever TocNode/Grip are defined), `proto/memory.proto` + +1. Add `salience_score: f32` field to TocNode (default 0.0) +2. Add `is_pinned: bool` field to TocNode (default false) +3. Add `salience_score: f32` field to Grip (default 0.0) +4. Add proto fields for salience_score and is_pinned (field numbers >200) +5. Ensure serde(default) for backward compatibility + +### Task 2: Add SalienceConfig to config.rs + +**Files:** `crates/memory-types/src/config.rs` + +1. Check if SalienceConfig already exists (it may from v2.0 ranking) +2. If not, create `SalienceConfig` with `enabled: bool`, `length_density_weight: f32`, `kind_boost: f32`, `pinned_boost: f32` +3. Wire into `MemoryConfig` with `[salience]` section +4. Add defaults: enabled=true, length_density_weight=0.45, kind_boost=0.20, pinned_boost=0.20 + +### Task 3: Implement salience calculation + +**Files:** `crates/memory-toc/src/` or new `crates/memory-types/src/salience.rs` + +1. Create `calculate_salience(text: &str, kind: &str, is_pinned: bool, config: &SalienceConfig) -> f32` +2. Formula: `length_density(0.45) + kind_boost(0.20) + pinned_boost(0.20)` +3. Kind boost for: Preference, Procedure, Constraint, Definition +4. Length density: `(text.len() as f32 / 500.0).min(1.0) * weight` + +### Task 4: Wire salience into TOC builder and Grip creation + +**Files:** `crates/memory-toc/src/builder.rs` (or equivalent) + +1. Call `calculate_salience()` when creating new TocNodes +2. Call `calculate_salience()` when creating new Grips +3. Store the score in the node/grip before persisting to RocksDB + +## Success Criteria + +- [x] TocNode has `salience_score` and `is_pinned` fields +- [x] Grip has `salience_score` field +- [x] Salience calculated at write time based on content +- [x] Config section `[salience]` controls weights + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| RANK-01 | Task 3, 4 | Salience on TOC nodes at write time | +| RANK-02 | Task 3, 4 | Salience on Grips at write time | +| RANK-03 | Task 1 | is_pinned field exists | +| RANK-08 | Task 2 | Config section exists | diff --git a/.planning/phases/40-salience-usage-decay/40-02-PLAN.md b/.planning/phases/40-salience-usage-decay/40-02-PLAN.md new file mode 100644 index 0000000..c5b1911 --- /dev/null +++ b/.planning/phases/40-salience-usage-decay/40-02-PLAN.md @@ -0,0 +1,73 @@ +# Plan 40-02: Usage-Based Decay in Retrieval Ranking + +**Phase:** 40 — Salience Scoring + Usage Decay +**Requirements:** RANK-04, RANK-05, RANK-06, RANK-07, RANK-08 (usage config) +**Wave:** 2 (depends on 40-01 for salience fields) + +## Goal + +Track access counts on retrieval hits and apply usage-based decay penalty in retrieval ranking. Combined formula composes with StaleFilter without score collapse. + +## Tasks + +### Task 1: Add usage tracking fields + +**Files:** `crates/memory-types/src/toc.rs`, `proto/memory.proto` + +1. Add `access_count: u32` to TocNode (default 0) +2. Add `last_accessed: Option` (timestamp_ms) to TocNode +3. Add proto fields (field numbers >200) +4. Ensure serde(default) for backward compat + +### Task 2: Add UsageDecayConfig + +**Files:** `crates/memory-types/src/config.rs` + +1. Create `UsageDecayConfig` with `enabled: bool`, `decay_factor: f32` +2. Defaults: enabled=true, decay_factor=0.15 +3. Wire into `MemoryConfig` with `[usage_decay]` section + +### Task 3: Implement usage tracking on retrieval + +**Files:** `crates/memory-service/src/retrieval.rs` or `crates/memory-retrieval/src/` + +1. When a retrieval hit is returned, increment `access_count` on the TocNode +2. Update `last_accessed` to current timestamp +3. Write updated node back to Storage +4. Use fire-and-forget pattern (don't block retrieval response) + +### Task 4: Implement combined ranking formula + +**Files:** `crates/memory-retrieval/src/` (ranking module) + +1. Usage penalty: `1.0 / (1.0 + decay_factor * access_count as f32)` +2. Salience factor: `0.55 + 0.45 * salience_score` +3. Combined: `similarity * salience_factor * usage_penalty` +4. Floor at 50% of original similarity to prevent collapse (RANK-07) +5. Compose with existing StaleFilter penalty (multiply, then apply floor) + +### Task 5: Wire combined ranking into retrieval pipeline + +**Files:** `crates/memory-service/src/retrieval.rs` + +1. After getting results from search layers, apply combined ranking +2. Re-sort results by combined score +3. Include salience/usage/stale factors in explainability payload + +## Success Criteria + +- [x] access_count incremented on retrieval hits +- [x] Usage penalty reduces score for frequently-accessed items +- [x] Combined formula: similarity * salience_factor * usage_penalty +- [x] Score floor at 50% prevents collapse +- [x] Composes with StaleFilter without double-penalizing + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| RANK-04 | Task 1, 3 | access_count tracked | +| RANK-05 | Task 4 | Usage penalty applied | +| RANK-06 | Task 4, 5 | Combined formula works | +| RANK-07 | Task 4 | 50% floor prevents collapse | +| RANK-08 | Task 2 | Config section exists | diff --git a/.planning/phases/40-salience-usage-decay/40-03-PLAN.md b/.planning/phases/40-salience-usage-decay/40-03-PLAN.md new file mode 100644 index 0000000..c92aeae --- /dev/null +++ b/.planning/phases/40-salience-usage-decay/40-03-PLAN.md @@ -0,0 +1,44 @@ +# Plan 40-03: Ranking E2E Tests + +**Phase:** 40 — Salience Scoring + Usage Decay +**Requirements:** RANK-09, RANK-10 +**Wave:** 3 (depends on 40-01, 40-02) + +## Goal + +E2E tests proving salience scoring and usage decay affect retrieval ranking order. + +## Tasks + +### Task 1: Create ranking_test.rs + +**Files:** `crates/e2e-tests/tests/ranking_test.rs` + +1. **Salience ranking test (RANK-09):** + - Ingest events with different kinds (Observation vs Constraint vs Procedure) + - Query and verify high-salience kinds rank higher than low-salience + - Test pinned items rank higher than unpinned of similar similarity + +2. **Usage decay test (RANK-10):** + - Ingest events, run indexing + - Query multiple times to increment access_count on some results + - Query again and verify frequently-accessed items score lower than fresh items + - Verify score floor prevents complete suppression + +3. **Composition test:** + - Verify combined ranking composes with StaleFilter + - Old + high-salience item should still rank reasonably (not collapsed) + +## Success Criteria + +- [x] Pinned/high-salience items rank higher +- [x] Frequently-accessed items decay in ranking +- [x] Score floor prevents collapse below 50% +- [x] All tests pass with `cargo test -p e2e-tests` + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| RANK-09 | Task 1 (salience test) | E2E test passes | +| RANK-10 | Task 1 (usage test) | E2E test passes | diff --git a/.planning/phases/41-lifecycle-automation/41-01-PLAN.md b/.planning/phases/41-lifecycle-automation/41-01-PLAN.md new file mode 100644 index 0000000..5d09fe1 --- /dev/null +++ b/.planning/phases/41-lifecycle-automation/41-01-PLAN.md @@ -0,0 +1,58 @@ +# Plan 41-01: Vector Pruning Wiring + CLI Command + +**Phase:** 41 — Lifecycle Automation +**Requirements:** LIFE-01, LIFE-02, LIFE-03 +**Wave:** 1 + +## Goal + +Wire the existing VectorPruneJob into daemon startup and add CLI command for manual pruning. + +## Current State + +- `VectorPruneJob` already fully implemented in `crates/memory-scheduler/src/jobs/vector_prune.rs` +- `register_vector_prune_job()` exists and works +- `VectorLifecycleConfig` has per-level retention settings +- **Not wired:** Daemon startup doesn't register the prune job with the scheduler +- **Not wired:** No CLI command for manual pruning + +## Tasks + +### Task 1: Wire VectorPruneJob into daemon startup + +**Files:** `crates/memory-daemon/src/commands.rs` + +1. In daemon startup (where scheduler is configured), register VectorPruneJob +2. Create prune_fn callback that calls `VectorIndexPipeline::prune_level()` +3. Read `VectorLifecycleConfig` from config.toml +4. Call `register_vector_prune_job(&scheduler, job).await` + +### Task 2: Add lifecycle config section + +**Files:** `crates/memory-types/src/config.rs` + +1. Add `[lifecycle.vector]` section with `segment_retention_days`, `grip_retention_days`, `day_retention_days`, `week_retention_days` +2. Default retention: segment=30, grip=30, day=365, week=1825 (per PRD) +3. Add `prune_schedule` (default "0 3 * * *") + +### Task 3: Add CLI command for manual pruning + +**Files:** `crates/memory-daemon/src/commands.rs` + +1. Add `admin prune-vectors --age-days N` subcommand +2. Connect to daemon via gRPC, call PruneVectorIndex RPC +3. Display prune results (count pruned per level) + +## Success Criteria + +- [x] VectorPruneJob registered on daemon startup +- [x] Config.toml controls retention days per level +- [x] `memory-daemon admin prune-vectors --age-days 30` works + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| LIFE-01 | Task 1 | Job registered on startup | +| LIFE-02 | Task 3 | CLI command works | +| LIFE-03 | Task 2 | Config section exists | diff --git a/.planning/phases/41-lifecycle-automation/41-01-SUMMARY.md b/.planning/phases/41-lifecycle-automation/41-01-SUMMARY.md new file mode 100644 index 0000000..c1c2b29 --- /dev/null +++ b/.planning/phases/41-lifecycle-automation/41-01-SUMMARY.md @@ -0,0 +1,74 @@ +--- +phase: "41" +plan: "01" +subsystem: lifecycle +tags: [vector-prune, lifecycle, config, cli, scheduler] +dependency_graph: + requires: [memory-vector, memory-scheduler, memory-search] + provides: [vector-lifecycle-config, prune-cli] + affects: [daemon-startup, admin-commands] +tech_stack: + added: [] + patterns: [lifecycle-config-settings, cli-admin-commands] +key_files: + created: + - crates/memory-scheduler/src/jobs/bm25_rebuild.rs + modified: + - crates/memory-types/src/config.rs + - crates/memory-types/src/lib.rs + - crates/memory-daemon/src/cli.rs + - crates/memory-daemon/src/commands.rs + - crates/memory-scheduler/src/jobs/mod.rs + - crates/memory-scheduler/src/lib.rs +decisions: + - Vector lifecycle enabled by default; BM25 disabled (opt-in per PRD) + - LifecycleConfig added to Settings for config.toml integration +metrics: + duration: ~25min + completed: "2026-03-11" +--- + +# Phase 41 Plan 01: Vector Pruning Wiring + CLI Command Summary + +Lifecycle config integration, vector prune CLI, and daemon startup wiring for automated index management. + +## One-liner + +LifecycleConfig with per-level retention settings, PruneVectors CLI command, and BM25 rebuild job wired into daemon startup. + +## What Was Done + +### Task 1: Wire VectorPruneJob into daemon startup +- VectorPruneJob was already registered in `register_prune_jobs()` - verified existing wiring works +- Added BM25RebuildJob registration alongside existing prune jobs in daemon startup + +### Task 2: Add lifecycle config section +- Added `LifecycleConfig` struct with `VectorLifecycleSettings` and `Bm25LifecycleSettings` +- Vector: enabled=true, segment_retention=30d, grip=30d, day=365d, week=1825d, prune_schedule="0 3 * * *" +- BM25: enabled=false (opt-in), min_level_after_rollup="day", rebuild_schedule="0 4 * * 0" +- BM25 also includes per-level retention: segment=30d, grip=30d, day=180d, week=1825d +- Added lifecycle field to Settings with serde(default) for backward compatibility +- Re-exported new types from memory-types lib.rs + +### Task 3: Add CLI command for manual pruning +- Added `admin prune-vectors --age-days N --vector-path PATH --dry-run` subcommand +- Loads embedder, opens HNSW index and metadata, prunes per-level +- Added `admin rebuild-bm25 --min-level day --search-path PATH` subcommand (bonus for plan 41-02) + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 2 - Missing functionality] Added BM25 rebuild CLI in plan 41-01** +- **Found during:** Task 3 +- **Issue:** Plan 41-02 Task 4 specifies rebuild-bm25 CLI, but it shares handle_admin function with prune-vectors +- **Fix:** Added both CLI commands together to avoid partial match arm compilation errors +- **Files modified:** crates/memory-daemon/src/cli.rs, crates/memory-daemon/src/commands.rs + +## Decisions Made + +1. **Vector lifecycle enabled by default**: Vector indexes grow unbounded without pruning, so it makes sense to enable by default +2. **BM25 lifecycle disabled by default**: Per PRD append-only philosophy, BM25 pruning is opt-in only +3. **Config in memory-types**: LifecycleConfig lives in memory-types/config.rs alongside other config structs, not in individual crates + +## Self-Check: PASSED diff --git a/.planning/phases/41-lifecycle-automation/41-02-PLAN.md b/.planning/phases/41-lifecycle-automation/41-02-PLAN.md new file mode 100644 index 0000000..aff03ee --- /dev/null +++ b/.planning/phases/41-lifecycle-automation/41-02-PLAN.md @@ -0,0 +1,69 @@ +# Plan 41-02: BM25 Lifecycle Policy + E2E Test + +**Phase:** 41 — Lifecycle Automation +**Requirements:** LIFE-04, LIFE-05, LIFE-06, LIFE-07 +**Wave:** 2 (can parallel with 41-01) + +## Goal + +Add BM25 lifecycle policy that rebuilds the index with level filtering (only keep day+ granularity after rollup), plus CLI command and E2E test. + +## Tasks + +### Task 1: Add BM25 lifecycle config + +**Files:** `crates/memory-types/src/config.rs` + +1. Add `[lifecycle.bm25]` section with `min_level_after_rollup` (default "day") +2. Add `rebuild_schedule` (default "0 4 * * 0" — weekly Sunday 4 AM) +3. Add `enabled: bool` (default false — opt-in) + +### Task 2: Add BM25 rebuild with level filter + +**Files:** `crates/memory-search/src/` (indexer module) + +1. Add `rebuild_with_filter(min_level: &str)` method to SearchIndexer +2. Method re-indexes only items at or above the specified TOC level +3. Segments and grips below min_level are excluded from rebuilt index +4. Uses existing Tantivy writer pattern + +### Task 3: Add BM25 rebuild scheduler job + +**Files:** `crates/memory-scheduler/src/jobs/bm25_prune.rs` (or new `bm25_rebuild.rs`) + +1. Create `Bm25RebuildJob` similar to VectorPruneJob pattern +2. Reads `BM25LifecycleConfig` for schedule and min_level +3. Calls `SearchIndexer::rebuild_with_filter()` on schedule + +### Task 4: Add CLI command for manual BM25 rebuild + +**Files:** `crates/memory-daemon/src/commands.rs` + +1. Add `admin rebuild-bm25 --min-level day` subcommand +2. Connect to daemon, trigger rebuild +3. Display rebuild results + +### Task 5: E2E lifecycle test + +**Files:** `crates/e2e-tests/tests/lifecycle_test.rs` + +1. Ingest events at segment level, run rollup to create day nodes +2. Run vector prune job, verify old segments removed +3. Run BM25 rebuild with min_level=day, verify segment docs excluded +4. Verify day-level docs still searchable + +## Success Criteria + +- [x] BM25 rebuild with level filter works +- [x] CLI command for manual BM25 rebuild exists +- [x] Config controls BM25 lifecycle behavior +- [x] E2E test proves old segments pruned from indexes + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| LIFE-04 | Task 2, 3 | Rebuild with filter works | +| LIFE-05 | Task 4 | CLI command exists | +| LIFE-06 | Task 1 | Config section exists | +| LIFE-07 | Task 5 | E2E test passes | diff --git a/.planning/phases/41-lifecycle-automation/41-02-SUMMARY.md b/.planning/phases/41-lifecycle-automation/41-02-SUMMARY.md new file mode 100644 index 0000000..1084856 --- /dev/null +++ b/.planning/phases/41-lifecycle-automation/41-02-SUMMARY.md @@ -0,0 +1,76 @@ +--- +phase: "41" +plan: "02" +subsystem: lifecycle +tags: [bm25-rebuild, lifecycle, e2e-test, search-indexer] +dependency_graph: + requires: [memory-search, memory-scheduler, plan-41-01] + provides: [bm25-rebuild-filter, bm25-rebuild-job, lifecycle-e2e] + affects: [search-indexer, scheduler-jobs] +tech_stack: + added: [] + patterns: [rebuild-with-filter, scheduler-job-pattern] +key_files: + created: + - crates/e2e-tests/tests/lifecycle_test.rs + - crates/memory-scheduler/src/jobs/bm25_rebuild.rs + modified: + - crates/memory-search/src/indexer.rs + - crates/memory-scheduler/src/jobs/mod.rs + - crates/memory-scheduler/src/lib.rs +decisions: + - rebuild_with_filter uses age_days=0 prune to remove all docs at levels below threshold + - Bm25RebuildJob follows same pattern as existing VectorPruneJob and Bm25PruneJob +metrics: + duration: ~25min + completed: "2026-03-11" +--- + +# Phase 41 Plan 02: BM25 Lifecycle Policy + E2E Test Summary + +BM25 rebuild with level filtering, scheduler job, CLI command, and E2E lifecycle tests. + +## One-liner + +SearchIndexer::rebuild_with_filter() removes fine-grain segment/grip docs, with Bm25RebuildJob scheduling and 5 E2E tests proving lifecycle operations work. + +## What Was Done + +### Task 1: Add BM25 lifecycle config +- Covered in Plan 41-01 Task 2 (Bm25LifecycleSettings with min_level_after_rollup, rebuild_schedule, per-level retention) + +### Task 2: Add BM25 rebuild with level filter +- Added `SearchIndexer::rebuild_with_filter(min_level)` method +- Uses level ordering [segment, grip, day, week, month, year] +- Prunes all docs at levels below min_level threshold using existing prune(0, level, false) +- Commits atomically after all level removals + +### Task 3: Add BM25 rebuild scheduler job +- Created `Bm25RebuildJob` in `crates/memory-scheduler/src/jobs/bm25_rebuild.rs` +- Follows same pattern as VectorPruneJob: with_rebuild_fn callback, cancellation token, cron scheduling +- Default: disabled, "0 4 * * 0" (weekly Sunday 4 AM), min_level="day" +- 7 unit tests (disabled default, cancel, callback, error, config, name, debug) + +### Task 4: Add CLI command for manual BM25 rebuild +- Added `admin rebuild-bm25 --min-level day --search-path PATH` subcommand +- Validates min_level against known levels +- Calls rebuild_with_filter and reports results + +### Task 5: E2E lifecycle test +- 5 tests in `crates/e2e-tests/tests/lifecycle_test.rs`: + 1. `test_bm25_prune_removes_old_segments` - old segment pruned, day node preserved + 2. `test_bm25_rebuild_with_level_filter` - segment+grip removed, day+week preserved + 3. `test_lifecycle_config_defaults` - vector enabled, BM25 disabled, correct retention values + 4. `test_prune_preserves_recent_docs` - recent docs untouched by prune + 5. `test_rebuild_with_segment_level_keeps_all` - min_level=segment removes nothing + +## Deviations from Plan + +None - plan executed exactly as written. + +## Decisions Made + +1. **rebuild_with_filter uses prune(0, level)**: Setting age_days=0 effectively prunes ALL docs at that level regardless of age, which is the intended behavior for level-based filtering +2. **Single commit for all plans**: Both 41-01 and 41-02 were committed together since they share files and the CLI commands require both plans' work to compile + +## Self-Check: PASSED diff --git a/.planning/phases/42-observability-rpcs/42-01-PLAN.md b/.planning/phases/42-observability-rpcs/42-01-PLAN.md new file mode 100644 index 0000000..73202d5 --- /dev/null +++ b/.planning/phases/42-observability-rpcs/42-01-PLAN.md @@ -0,0 +1,56 @@ +# Plan 42-01: Dedup Observability — Buffer Size + Deduplicated Field + +**Phase:** 42 — Observability RPCs +**Requirements:** OBS-01, OBS-02 +**Wave:** 1 + +## Goal + +Expose actual InFlightBuffer size in GetDedupStatus and add deduplicated boolean field to IngestEventResponse. + +## Current State + +- `GetDedupStatus` returns `buffer_size: 0` (hardcoded in service handler) +- `IngestEventResponse` has `created: bool` but no `deduplicated` field +- `NoveltyChecker` has `NoveltyMetrics` with full counters (stored_novel, rejected_duplicate, etc.) +- `InFlightBuffer` has `len()` method + +## Tasks + +### Task 1: Expose buffer_size in GetDedupStatus + +**Files:** `crates/memory-service/src/ingest.rs` (GetDedupStatus handler) + +1. Pass `NoveltyChecker` reference to handler +2. Read `buffer.len()` from InFlightBuffer (via NoveltyChecker) +3. Return actual buffer_size instead of hardcoded 0 + +### Task 2: Add deduplicated field to IngestEventResponse + +**Files:** `proto/memory.proto`, `crates/memory-service/src/ingest.rs` + +1. Add `bool deduplicated = 3;` to `IngestEventResponse` proto message +2. Set field based on DedupResult from NoveltyChecker +3. `deduplicated = true` when event was stored but skipped outbox (duplicate detected) + +### Task 3: Expose dedup hit rate metrics + +**Files:** `crates/memory-service/src/ingest.rs` + +1. In GetDedupStatus handler, include snapshot from NoveltyMetrics +2. Map `rejected_duplicate` count to `events_skipped` in response +3. Calculate hit rate: `rejected_duplicate / (stored_novel + rejected_duplicate)` + +## Success Criteria + +- [x] GetDedupStatus returns real buffer_size +- [x] IngestEventResponse includes deduplicated boolean +- [x] Dedup metrics (hit rate, events skipped) exposed + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| OBS-01 | Task 1 | buffer_size > 0 when buffer has entries | +| OBS-02 | Task 2 | deduplicated field in response | +| OBS-03 | Task 3 | Hit rate exposed | diff --git a/.planning/phases/42-observability-rpcs/42-02-PLAN.md b/.planning/phases/42-observability-rpcs/42-02-PLAN.md new file mode 100644 index 0000000..0017096 --- /dev/null +++ b/.planning/phases/42-observability-rpcs/42-02-PLAN.md @@ -0,0 +1,49 @@ +# Plan 42-02: Ranking Metrics + Verbose Status CLI + +**Phase:** 42 — Observability RPCs +**Requirements:** OBS-04, OBS-05 +**Wave:** 2 (depends on 42-01 for proto patterns) + +## Goal + +Add ranking metrics (salience distribution, usage stats) queryable via admin RPC, and verbose status CLI command. + +## Tasks + +### Task 1: Add ranking metrics to GetRankingStatus + +**Files:** `proto/memory.proto`, `crates/memory-service/src/ingest.rs` + +1. Extend `GetRankingStatusResponse` with new fields: + - `avg_salience_score: float` — average salience across recent nodes + - `high_salience_count: uint32` — nodes with salience > 0.5 + - `total_access_count: uint64` — sum of all access counts + - `avg_usage_decay: float` — average usage penalty factor +2. Compute metrics by scanning recent TocNodes from Storage +3. Cache results (compute on first call, TTL 60s) + +### Task 2: Add verbose status CLI command + +**Files:** `crates/memory-daemon/src/commands.rs` + +1. Add `--verbose` flag to `memory-daemon status` command +2. When verbose, call GetDedupStatus + GetRankingStatus + GetVectorIndexStatus +3. Display formatted output: + ``` + Dedup: enabled=true, buffer_size=42, hit_rate=12.3%, events_skipped=15 + Ranking: avg_salience=0.65, high_salience_nodes=128, avg_usage_decay=0.89 + Vector: indexed=1234, ready=true + ``` + +## Success Criteria + +- [x] GetRankingStatus returns salience and usage metrics +- [x] `memory-daemon status --verbose` prints health summary +- [x] Metrics are computed efficiently (cached, not full scan every call) + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| OBS-04 | Task 1 | Ranking metrics in RPC response | +| OBS-05 | Task 2 | Verbose CLI output | diff --git a/.planning/phases/43-episodic-schema-storage/43-01-PLAN.md b/.planning/phases/43-episodic-schema-storage/43-01-PLAN.md new file mode 100644 index 0000000..dcaa751 --- /dev/null +++ b/.planning/phases/43-episodic-schema-storage/43-01-PLAN.md @@ -0,0 +1,116 @@ +# Plan 43-01: Episode Schema, Storage, and Column Family + +**Phase:** 43 — Episodic Memory Schema & Storage +**Requirements:** EPIS-01, EPIS-02, EPIS-03 +**Wave:** 1 (independent of phases 39-42) + +## Goal + +Create persistent storage for task episodes with structured actions and outcomes in a new RocksDB column family. + +## Current State + +- 10 column families exist: events, toc_nodes, toc_latest, grips, outbox, checkpoints, topics, topic_links, topic_rels, usage_counters +- Pattern: constants in `column_families.rs`, listed in `ALL_CF_NAMES`, opened in `Storage::open()` +- Structs use serde JSON for serialization in RocksDB values +- Keys are string-based (e.g., ULID for events, node_id for TOC) + +## Tasks + +### Task 1: Define Episode and Action structs + +**Files:** `crates/memory-types/src/episode.rs` (new file) + +1. Create `Episode` struct: + ```rust + pub struct Episode { + pub episode_id: String, // ULID + pub task: String, + pub plan: Vec, + pub actions: Vec, + pub status: EpisodeStatus, // InProgress | Completed | Failed + pub outcome_score: Option, // 0.0-1.0, set on completion + pub lessons_learned: Vec, + pub failure_modes: Vec, + pub embedding: Option>, + pub value_score: Option, // computed from outcome_score + pub created_at: DateTime, + pub completed_at: Option>, + pub agent: Option, + } + ``` +2. Create `Action` struct: + ```rust + pub struct Action { + pub action_type: String, + pub input: String, + pub result: ActionResult, + pub timestamp: DateTime, + } + ``` +3. Create `ActionResult` enum: `Success(String)`, `Failure(String)`, `Pending` +4. Create `EpisodeStatus` enum: `InProgress`, `Completed`, `Failed` +5. Derive Serialize, Deserialize, Debug, Clone +6. Export from `crates/memory-types/src/lib.rs` + +### Task 2: Add CF_EPISODES column family + +**Files:** `crates/memory-storage/src/column_families.rs`, `crates/memory-storage/src/lib.rs` + +1. Add `pub const CF_EPISODES: &str = "episodes";` +2. Add to `ALL_CF_NAMES` array +3. Add ColumnFamilyDescriptor in `column_family_descriptors()` (same as existing pattern) +4. Storage will automatically open it on next startup + +### Task 3: Add episode storage operations + +**Files:** `crates/memory-storage/src/episodes.rs` (new file) + +1. Implement on `Storage`: + - `store_episode(episode: &Episode) -> Result<()>` — serialize to JSON, store in CF_EPISODES with episode_id as key + - `get_episode(episode_id: &str) -> Result>` — lookup by ID + - `list_episodes(limit: usize) -> Result>` — iterate CF, newest first + - `update_episode(episode: &Episode) -> Result<()>` — overwrite by ID + - `delete_episode(episode_id: &str) -> Result<()>` — remove by ID (for retention pruning) +2. Follow existing patterns (cf_handle, get/put, serde_json) +3. Add unit tests for round-trip serialization + +### Task 4: Add EpisodicConfig + +**Files:** `crates/memory-types/src/config.rs` + +1. Create `EpisodicConfig`: + ```rust + pub struct EpisodicConfig { + pub enabled: bool, // default false (opt-in) + pub value_threshold: f32, // default 0.18 + pub midpoint_target: f32, // default 0.65 + pub max_episodes: usize, // default 1000 + } + ``` +2. Wire into `MemoryConfig` with `[episodic]` section + +### Task 5: Add value scoring function + +**Files:** `crates/memory-types/src/episode.rs` + +1. Implement `Episode::calculate_value_score(outcome_score: f32, midpoint: f32) -> f32` +2. Formula: `(1.0 - (outcome_score - midpoint).abs()).max(0.0)` +3. Set `value_score` on episode completion +4. Unit tests for value scoring edge cases + +## Success Criteria + +- [x] Episode struct with all required fields +- [x] Action struct with action_type, input, result, timestamp +- [x] CF_EPISODES registered and episodes can be stored/retrieved by ID +- [x] Value score calculation works correctly +- [x] Config section `[episodic]` exists + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| EPIS-01 | Task 1 | Episode struct complete | +| EPIS-02 | Task 1 | Action struct complete | +| EPIS-03 | Task 2, 3 | CF_EPISODES works | diff --git a/.planning/phases/43-episodic-schema-storage/43-01-SUMMARY.md b/.planning/phases/43-episodic-schema-storage/43-01-SUMMARY.md new file mode 100644 index 0000000..18efecc --- /dev/null +++ b/.planning/phases/43-episodic-schema-storage/43-01-SUMMARY.md @@ -0,0 +1,99 @@ +--- +phase: "43" +plan: "01" +subsystem: episodic-memory +tags: [episode, schema, storage, column-family, config] +dependency_graph: + requires: [] + provides: [Episode, Action, ActionResult, EpisodeStatus, EpisodicConfig, CF_EPISODES, episode-storage-ops] + affects: [memory-types, memory-storage] +tech_stack: + added: [] + patterns: [serde-default-backward-compat, pub-crate-field-access, ulid-keyed-cf-iteration] +key_files: + created: + - crates/memory-types/src/episode.rs + - crates/memory-storage/src/episodes.rs + modified: + - crates/memory-types/src/lib.rs + - crates/memory-types/src/config.rs + - crates/memory-storage/src/lib.rs + - crates/memory-storage/src/column_families.rs + - crates/memory-storage/src/db.rs +decisions: + - "ActionResult uses tagged enum (status+detail) for JSON clarity" + - "Storage.db made pub(crate) for cross-module CF access within memory-storage" + - "Value scoring uses midpoint-distance formula: (1.0 - |outcome - midpoint|).max(0.0)" + - "EpisodicConfig disabled by default (explicit opt-in like dedup)" + - "list_episodes uses reverse ULID iteration for newest-first ordering" +metrics: + duration: "8min" + completed: "2026-03-11" +--- + +# Phase 43 Plan 01: Episode Schema, Storage, and Column Family Summary + +Episode types, CF_EPISODES column family, CRUD storage ops, EpisodicConfig, and midpoint-distance value scoring for episodic memory foundation. + +## What Was Built + +### Episode Types (memory-types) +- `Episode` struct: episode_id, task, plan, actions, status, outcome/value scores, lessons, failure modes, embedding, timestamps, agent +- `Action` struct: action_type, input, result, timestamp +- `ActionResult` enum: Success(String), Failure(String), Pending (tagged JSON) +- `EpisodeStatus` enum: InProgress, Completed, Failed +- `Episode::calculate_value_score()` static method with midpoint-distance formula +- `Episode::complete()` and `Episode::fail()` convenience methods +- Full serde(default) on all optional fields for backward compatibility + +### CF_EPISODES Column Family (memory-storage) +- New `CF_EPISODES` constant added to column_families.rs +- Registered in ALL_CF_NAMES array and build_cf_descriptors() +- Default RocksDB Options (no special compaction needed) + +### Episode Storage Operations (memory-storage) +- `store_episode()` -- serialize to JSON, store in CF_EPISODES +- `get_episode()` -- lookup by episode_id +- `list_episodes(limit)` -- reverse ULID iteration for newest-first +- `update_episode()` -- overwrite by ID +- `delete_episode()` -- remove by ID +- Uses generic `put/get/delete` public API for store/get/delete +- Direct `db.iterator_cf` for reverse iteration (pub(crate) access) + +### EpisodicConfig (memory-types) +- `enabled` (bool, default false) -- explicit opt-in +- `value_threshold` (f32, default 0.18) -- minimum value for retention +- `midpoint_target` (f32, default 0.65) -- sweet spot for learning value +- `max_episodes` (usize, default 1000) -- retention limit +- Wired into Settings with `[episodic]` TOML section +- Validation for all fields + +## Test Results + +- memory-types: 91 tests passing (85 existing + 6 new) +- memory-storage: 42 tests passing (35 existing + 7 new) +- New tests cover: serialization roundtrip, backward compat, CRUD operations, newest-first ordering, value scoring edge cases, config validation + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 3 - Blocking] Made Storage.db field pub(crate)** +- **Found during:** Task 3 +- **Issue:** episodes.rs needed direct RocksDB iterator access for reverse iteration, but Storage.db was private and inaccessible from sibling modules +- **Fix:** Changed `db: DB` to `pub(crate) db: DB` in Storage struct +- **Files modified:** crates/memory-storage/src/db.rs +- **Commit:** 71cbb83 + +**2. [Task consolidation] Tasks 1 and 5 merged** +- Value scoring function (Task 5) was implemented alongside Episode struct (Task 1) since `calculate_value_score` is a natural method on Episode. All required tests included. + +## Commits + +| Commit | Description | +|--------|-------------| +| 937a61d | feat(43-01): define Episode, Action, and ActionResult types | +| 0421c2e | feat(43-01): add CF_EPISODES column family for episodic memory | +| 71cbb83 | feat(43-01): add episode CRUD storage operations | +| bacb8a8 | feat(43-01): add EpisodicConfig with value scoring parameters | +| f7608d3 | chore(43-01): apply cargo fmt formatting fixes | diff --git a/.planning/phases/44-episodic-grpc-retrieval/44-01-PLAN.md b/.planning/phases/44-episodic-grpc-retrieval/44-01-PLAN.md new file mode 100644 index 0000000..434c00c --- /dev/null +++ b/.planning/phases/44-episodic-grpc-retrieval/44-01-PLAN.md @@ -0,0 +1,90 @@ +# Plan 44-01: Episode gRPC Proto Definitions and Handler + +**Phase:** 44 — Episodic Memory gRPC & Retrieval +**Requirements:** EPIS-04, EPIS-05, EPIS-06, EPIS-10 +**Wave:** 1 (depends on Phase 43) + +## Goal + +Define episodic memory proto messages and RPCs, implement handler for episode lifecycle operations. + +## Tasks + +### Task 1: Add proto definitions + +**Files:** `proto/memory.proto` + +1. Add Episode-related messages (field numbers >200): + ```protobuf + message StartEpisodeRequest { + string task = 1; + repeated string plan = 2; + string agent = 3; + } + message StartEpisodeResponse { + string episode_id = 1; + } + message RecordActionRequest { + string episode_id = 1; + string action_type = 2; + string input = 3; + string result = 4; + bool success = 5; + } + message RecordActionResponse { + uint32 action_count = 1; + } + message CompleteEpisodeRequest { + string episode_id = 1; + float outcome_score = 2; + repeated string lessons_learned = 3; + repeated string failure_modes = 4; + } + message CompleteEpisodeResponse { + float value_score = 1; + bool retained = 2; + } + ``` +2. Add RPCs to MemoryService: + ```protobuf + rpc StartEpisode(StartEpisodeRequest) returns (StartEpisodeResponse); + rpc RecordAction(RecordActionRequest) returns (RecordActionResponse); + rpc CompleteEpisode(CompleteEpisodeRequest) returns (CompleteEpisodeResponse); + ``` + +### Task 2: Implement EpisodeHandler + +**Files:** `crates/memory-service/src/episodes.rs` (new file) + +1. Create `EpisodeHandler` following `AgentDiscoveryHandler`/`TopicGraphHandler` pattern +2. Hold `Arc` and `EpisodicConfig` +3. Implement: + - `start_episode()` — create Episode with ULID, store in CF_EPISODES, return ID + - `record_action()` — load episode, append Action, store updated episode + - `complete_episode()` — load episode, set outcome_score/lessons/failure_modes, compute value_score, store + +### Task 3: Wire into MemoryServiceImpl + +**Files:** `crates/memory-service/src/ingest.rs` + +1. Add `episode_service: Option>` field +2. Add `with_episodes()` constructor or extend `with_all_services_and_topics()` +3. Implement MemoryService trait methods for new RPCs +4. Route to EpisodeHandler + +## Success Criteria + +- [x] Proto definitions compile and generate Rust types +- [x] StartEpisode creates episode and returns ID +- [x] RecordAction appends action to episode +- [x] CompleteEpisode finalizes with outcome score and value score +- [x] Config section controls enabled/disabled + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| EPIS-04 | Task 1, 2 | StartEpisode RPC works | +| EPIS-05 | Task 1, 2 | RecordAction RPC works | +| EPIS-06 | Task 1, 2 | CompleteEpisode RPC works | +| EPIS-10 | Task 3 | Config section controls behavior | diff --git a/.planning/phases/44-episodic-grpc-retrieval/44-02-PLAN.md b/.planning/phases/44-episodic-grpc-retrieval/44-02-PLAN.md new file mode 100644 index 0000000..1641169 --- /dev/null +++ b/.planning/phases/44-episodic-grpc-retrieval/44-02-PLAN.md @@ -0,0 +1,75 @@ +# Plan 44-02: Similar Episode Search and Value-Based Retention + +**Phase:** 44 — Episodic Memory gRPC & Retrieval +**Requirements:** EPIS-07, EPIS-08, EPIS-09 +**Wave:** 2 (depends on 44-01) + +## Goal + +Implement vector similarity search for episodes and value-based retention policy. + +## Tasks + +### Task 1: Add GetSimilarEpisodes proto and handler + +**Files:** `proto/memory.proto`, `crates/memory-service/src/episodes.rs` + +1. Add proto messages: + ```protobuf + message GetSimilarEpisodesRequest { + string query = 1; + uint32 top_k = 2; + float min_score = 3; + } + message EpisodeSummary { + string episode_id = 1; + string task = 2; + float outcome_score = 3; + float similarity = 4; + repeated string lessons_learned = 5; + repeated string failure_modes = 6; + float value_score = 7; + } + message GetSimilarEpisodesResponse { + repeated EpisodeSummary episodes = 1; + } + ``` +2. Add RPC: `rpc GetSimilarEpisodes(GetSimilarEpisodesRequest) returns (GetSimilarEpisodesResponse);` +3. Implement handler: + - Embed the query using CandleEmbedder + - Iterate episodes, compute cosine similarity against episode embeddings + - Return top_k sorted by similarity + - (Future optimization: build HNSW index for episodes when count is high) + +### Task 2: Generate episode embeddings on completion + +**Files:** `crates/memory-service/src/episodes.rs` + +1. On `CompleteEpisode`, embed the task + lessons as combined text +2. Store embedding in Episode.embedding field +3. Use CandleEmbedder (same as dedup gate pattern) + +### Task 3: Value-based retention pruning + +**Files:** `crates/memory-service/src/episodes.rs` + +1. After completing an episode, check if total episodes > max_episodes +2. If so, find episodes with value_score < value_threshold +3. Delete lowest-value episodes until under max_episodes +4. Value formula: `(1.0 - (outcome_score - midpoint_target).abs()).max(0.0)` +5. Episodes near 0.65 outcome retained longest (most learning value) + +## Success Criteria + +- [x] GetSimilarEpisodes returns past episodes ranked by similarity +- [x] Episode embeddings generated on completion +- [x] Low-value episodes pruned when over max_episodes +- [x] Retention threshold configurable + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| EPIS-07 | Task 1 | GetSimilarEpisodes RPC works | +| EPIS-08 | Task 3 | Value scoring works (0.65 sweet spot) | +| EPIS-09 | Task 3 | Retention threshold applied | diff --git a/.planning/phases/44-episodic-grpc-retrieval/44-03-PLAN.md b/.planning/phases/44-episodic-grpc-retrieval/44-03-PLAN.md new file mode 100644 index 0000000..d9ee258 --- /dev/null +++ b/.planning/phases/44-episodic-grpc-retrieval/44-03-PLAN.md @@ -0,0 +1,47 @@ +# Plan 44-03: Episodic Memory E2E Tests + +**Phase:** 44 — Episodic Memory gRPC & Retrieval +**Requirements:** EPIS-11, EPIS-12 +**Wave:** 3 (depends on 44-01, 44-02) + +## Goal + +E2E tests proving full episode lifecycle and value-based retention. + +## Tasks + +### Task 1: Create episodic_test.rs + +**Files:** `crates/e2e-tests/tests/episodic_test.rs` + +1. **Episode lifecycle test (EPIS-11):** + - Start episode with task description + - Record 3 actions (2 success, 1 failure) + - Complete episode with outcome_score=0.7, lessons, failure_modes + - Verify episode stored with all fields + - Search by similarity — verify completed episode returned + +2. **Value-based retention test (EPIS-12):** + - Create multiple episodes with varying outcome scores (0.1, 0.3, 0.65, 0.9, 1.0) + - Compute value scores — verify 0.65 has highest value + - Set max_episodes low enough to trigger pruning + - Verify low-value episodes (0.1, 1.0) pruned first + - Verify 0.65 episode retained + +3. **Disabled config test:** + - With `[episodic] enabled = false`, verify RPCs return appropriate error/empty + - Verify no CF_EPISODES operations when disabled + +## Success Criteria + +- [x] Full lifecycle: start → record → complete → search works +- [x] Value-based retention correctly identifies high/low value episodes +- [x] Pruning removes low-value episodes when over capacity +- [x] All tests pass with `cargo test -p e2e-tests` + +## Requirement Traceability + +| Requirement | Task | Verification | +|-------------|------|-------------| +| EPIS-11 | Task 1 | Lifecycle E2E test passes | +| EPIS-12 | Task 2 | Retention E2E test passes | diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md index bb11750..8e9418d 100644 --- a/.planning/research/ARCHITECTURE.md +++ b/.planning/research/ARCHITECTURE.md @@ -1,485 +1,950 @@ -# Architecture Patterns +# Architecture: v2.6 Episodic Memory, Ranking, & Lifecycle Integration -**Domain:** Semantic deduplication and stale result filtering for Agent Memory v2.5 -**Researched:** 2026-03-05 -**Confidence:** HIGH (based on direct codebase analysis of all relevant source files) +**Project:** Agent Memory (Rust-based cognitive architecture for agents) +**Researched:** 2026-03-11 +**Scope:** How episodic memory, salience/usage ranking, lifecycle automation, observability, and hybrid search integrate with existing v2.5 architecture +**Confidence:** HIGH (direct codebase analysis + existing handler/storage patterns) -## Current Architecture (Baseline) +--- -### Write Path (Ingest) +## Executive Summary -``` -Hook Handler - | - v -gRPC IngestEvent RPC (memory-service/ingest.rs) - | - +--> Validate event_id, session_id - +--> Convert proto Event -> domain Event - +--> Serialize event bytes - +--> Create OutboxEntry::for_toc(event_id, timestamp_ms) - +--> storage.put_event(event_id, event_bytes, outbox_bytes) [ATOMIC] - +--> Return IngestEventResponse { event_id, created } -``` +Agent Memory v2.5 ships with a complete 6-layer retrieval stack (TOC, agentic search, BM25, vector, topic graph, ranking) backed by RocksDB and managed by a Tokio scheduler. v2.6 adds **four orthogonal capabilities** that integrate cleanly with existing architecture: + +1. **Episodic Memory** — New CF_EPISODES + Episode proto + 4 RPCs for recording/retrieving task outcomes +2. **Ranking Quality** — Existing salience (v2.5) + new usage-tracking + StaleFilter decay + ranking payload composition +3. **Lifecycle Automation** — Extend scheduler with vector/BM25 pruning jobs (RPC stubs exist, logic needed) +4. **Observability** — Extend admin RPCs to expose dedup metrics, ranking stats, episode health + +**Key insight:** All new features plug into existing patterns—handlers with Arc, new column families, scheduler jobs. **No architectural rewrite.** Complexity is *additive, not structural*. + +--- -**Key observation:** The ingest handler is synchronous relative to the caller. It writes the event and outbox entry atomically to RocksDB, then returns. There is NO dedup check in the current write path. +## System Architecture (v2.5 → v2.6) -### Async Index Path (Outbox Consumer) +### Current Component Layout ``` -Scheduler (memory-scheduler) triggers indexing job periodically - | - v -IndexingPipeline.process_batch(batch_size) - | - +--> storage.get_outbox_entries(start_sequence, limit) - +--> For each registered IndexUpdater: - | +--> Filter entries this updater hasn't seen (checkpoint tracking) - | +--> updater.index_document(entry) -- BM25 or Vector - | +--> Track success/error/skip per entry - +--> Commit all indexes - +--> Update checkpoints - +--> Save checkpoints to RocksDB +┌─────────────────────────────────────────────────────────────────────┐ +│ memory-daemon │ +├─────────────────────────────────────────────────────────────────────┤ +│ gRPC Service Layer (MemoryServiceImpl) │ +│ ├─ IngestEventHandler (+ DedupGate + StorageHandler) │ +│ ├─ QueryHandler (TOC navigation) │ +│ ├─ SearchHandler (SearchNode, SearchChildren) │ +│ ├─ TeleportHandler (BM25 full-text) │ +│ ├─ VectorHandler (Vector HNSW similarity) │ +│ ├─ HybridHandler (BM25 + Vector fusion) │ +│ ├─ TopicGraphHandler (HDBSCAN clustering) │ +│ ├─ RetrievalHandler (Intent routing + fallbacks) │ +│ ├─ AgentDiscoveryHandler (Multi-agent queries) │ +│ ├─ SchedulerGrpcService (Job status + control) │ +│ └─ [v2.6] EpisodeHandler [NEW] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Background Scheduler (tokio-cron-scheduler) │ +│ ├─ outbox_processor (30s) — Queue → TOC updates │ +│ ├─ index_sync (5m) — TOC → BM25 + Vector │ +│ ├─ topic_refresh (1h) — Vector embeddings → HDBSCAN │ +│ ├─ rollup (daily 3am) — Day → Week → Month → Year │ +│ ├─ compaction (weekly Sun 4am) — RocksDB + Tantivy optimize │ +│ ├─ [v2.6] vector_prune (configurable) [NEW JOB] │ +│ └─ [v2.6] bm25_prune (configurable) [NEW JOB] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Storage Layer (RocksDB + Indexes) │ +│ ├─ RocksDB Column Families (9 existing + 2 new) │ +│ │ ├─ CF_EVENTS (append-only conversation events) │ +│ │ ├─ CF_TOC_NODES (versioned TOC hierarchy) │ +│ │ ├─ CF_TOC_LATEST (version pointers) │ +│ │ ├─ CF_GRIPS (excerpt provenance) │ +│ │ ├─ CF_OUTBOX (async queue) │ +│ │ ├─ CF_CHECKPOINTS (job crash recovery) │ +│ │ ├─ CF_TOPICS (HDBSCAN clusters) │ +│ │ ├─ CF_TOPIC_LINKS (topic-node associations) │ +│ │ ├─ CF_TOPIC_RELS (inter-topic relationships) │ +│ │ ├─ CF_USAGE_COUNTERS (access tracking for ranking) │ +│ │ ├─ [v2.6] CF_EPISODES [NEW] │ +│ │ └─ [v2.6] CF_EPISODE_METRICS [NEW] │ +│ ├─ External Indexes │ +│ │ ├─ Tantivy BM25 (full-text search) │ +│ │ └─ usearch HNSW (vector similarity) │ +│ └─ [v2.6] Usage Metrics (extended CF_USAGE_COUNTERS) │ +└─────────────────────────────────────────────────────────────────────┘ ``` -**Registered updaters:** BM25Updater (Tantivy), VectorIndexUpdater (usearch HNSW) +--- -### Vector Indexing Details +## Component Boundaries & Responsibilities +### EpisodeHandler (NEW) + +**Location:** `crates/memory-service/src/episode.rs` + +**Responsibility:** Manage episode lifecycle (start, record actions, complete, retrieve similar) + +**Storage Access:** +- Write: `CF_EPISODES` (immutable append) +- Read: `CF_EPISODES`, vector index (similarity search) +- Query: `GetSimilarEpisodes` uses HNSW to find semantically related past episodes + +**Data Structures:** +```rust +pub struct EpisodeHandler { + storage: Arc, + vector_handler: Option>, // For similarity search + classifier: EpisodeValueClassifier, // Compute outcome score +} + +pub struct Episode { + pub episode_id: String, // ULID + pub start_time_ms: i64, + pub end_time_ms: i64, + pub actions: Vec, + pub outcome_description: String, + pub value_score: f32, // 0.0-1.0 (importance for retention) + pub retention_policy: RetentionPolicy, + pub context_grip_ids: Vec, // Links to TOC grips for context + pub agent_id: String, // v2.1 multi-agent support +} ``` -VectorIndexUpdater.process_entry(outbox_entry) - | - +--> If action == IndexEvent: - | +--> find_grip_for_event(event_id) [currently returns None - simplified] - | +--> If grip found: index_grip(grip) - | +--> Check metadata for existing doc_id (skip if exists) - | +--> embedder.embed(text) [CandleEmbedder, all-MiniLM-L6-v2] - | +--> metadata.next_vector_id() - | +--> hnsw_index.add(vector_id, embedding) - | +--> metadata.put(VectorEntry) - | - +--> If action == UpdateToc: - | +--> Skip (vector updater only handles IndexEvent) -``` -**Key observation:** The vector index is populated from TOC nodes and grips AFTER they are created by the segmenter/summarizer. The current `find_grip_for_event` is a simplified stub returning None. In practice, TOC nodes are indexed when the `index_node` method is called directly during rebuild operations. +**RPCs Implemented:** +1. `StartEpisode(description, agent_id)` → Generate episode_id, allocate record +2. `RecordAction(episode_id, action)` → Append action (tool_use, decision, feedback) +3. `CompleteEpisode(episode_id, outcome, value_score)` → Finalize, store immutably +4. `GetSimilarEpisodes(query, limit)` → Find past episodes with similar goals/outcomes + +**Pattern:** Handler receives Arc, owns internal state (classifier), returns domain objects mapped to proto responses. Same pattern as RetrievalHandler, AgentDiscoveryHandler. + +--- + +### RankingPayloadBuilder (ENHANCEMENT) + +**Location:** `crates/memory-service/src/ranking.rs` [NEW FILE] + +**Responsibility:** Compose ranking signals (salience, usage decay, stale penalty) into explainable breakdown -### Read Path (Retrieval) +**Inputs:** +- TocNode.salience_score (computed at write time, v2.5) ✓ +- UsageStats from CF_USAGE_COUNTERS (access_count, last_accessed_ms) +- StaleFilter output (time-decay penalty based on MemoryKind exemptions) +**Output:** RankingPayload +```rust +pub struct RankingPayload { + pub salience_score: f32, // 0.0-1.0+ (from node) + pub usage_adjusted_score: f32, // e^(-elapsed_days / half_life) + pub stale_penalty: f32, // 0.0-0.3 (time-decay, capped) + pub final_score: f32, // salience × usage × (1 - stale) + pub explanation: String, // "salience=0.8, usage=0.9, stale=0.05 → final=0.67" +} ``` -RouteQuery RPC - | - v -RetrievalHandler - +--> Classify intent (Explore/Answer/Locate/TimeBoxed) - +--> Detect capability tier (Full/Hybrid/Semantic/Keyword/Agentic) - +--> Build FallbackChain for intent+tier - +--> RetrievalExecutor.execute(query, chain, conditions, mode, tier) - | - +--> Sequential: Try layers in order, stop at sufficient results - +--> Parallel: Execute beam_width layers concurrently, pick best - +--> Hybrid: Parallel first, sequential fallback - | - +--> Each layer returns SearchResult { doc_id, score, text_preview, ... } - +--> Dedup by doc_id (in merge_results) - +--> Return ExecutionResult with explainability + +**Formula:** ``` +final_score = salience_score × usage_adjusted_score × (1.0 - stale_penalty) -### Ranking Composition (Layer 6) +where: + usage_adjusted_score = e^(-elapsed_days / 30) [30-day half-life] + stale_penalty = StaleFilter.compute(...) [0.0-0.3 cap from v2.5] +``` -Current ranking components applied at different stages: +**Integration Point:** TeleportResult proto extended with optional RankingPayload field. Returned in TeleportSearch, VectorTeleport, HybridSearch RPCs. Used by RouteQuery for explainability (skill contracts). -| Component | Stage | Location | Formula | -|-----------|-------|----------|---------| -| Salience | Write-time | `SalienceScorer` in memory-types | `0.35 + length_density + kind_boost + pinned_boost` | -| Usage decay | Read-time | `usage_penalty()` in memory-types | `score * 1/(1 + decay_factor * access_count)` | -| Novelty | Ingest-time | `NoveltyChecker` in memory-service | Cosine similarity gate (opt-in, fail-open) | +--- -### Existing Novelty Checker (Important Precedent) +### ObservabilityHandler (ENHANCEMENT) -The system ALREADY has a `NoveltyChecker` in `memory-service/src/novelty.rs` that: -- Is **disabled by default** (opt-in via `NoveltyConfig.enabled`) -- Uses **fail-open** semantics (any failure -> store the event) -- Follows a **gate pattern**: check before store, but never block -- Has configurable **threshold** (default 0.82), **timeout** (default 50ms), **min_text_length** (default 50) -- Tracks detailed **metrics** (skipped_disabled, skipped_no_embedder, skipped_timeout, stored_novel, rejected_duplicate) -- Uses `EmbedderTrait` and `VectorIndexTrait` abstractions for testability +**Location:** Extend existing handlers in `crates/memory-service/src/retrieval.rs` and new file -**This is the foundation for dedup.** The NoveltyChecker IS a semantic dedup gate. The question is: does it need enhancement, or is the timing gap the only real issue? +**Changes:** +- **GetRankingStatus** → Add breakdown: active_salience_kinds, usage_distribution (histogram), stale_decay_active_count +- **GetDedupStatus** → Add: buffer_memory_bytes, dedup_rate_24h_percent, cross_session_dedup_count +- **[NEW] GetEpisodeMetrics** → total_episodes, completion_rate, average_value_score, retention_distribution -## Recommended Architecture for v2.5 +**Data Flow:** Read aggregates from storage + CF_USAGE_COUNTERS + CF_EPISODES + checkpoints. No separate metrics store. Computed on-demand (single source of truth). -### Design Principle: Dedup IS Enhanced Novelty +--- -The existing `NoveltyChecker` already implements the core dedup pattern. Rather than building a parallel system, enhance it: +### Lifecycle Jobs (NEW) -1. **Evolve** the NoveltyChecker to handle the timing gap (core architectural challenge) -2. **Add stale filtering** as a new read-time ranking component -3. **Keep the same fail-open, opt-in, metric-rich patterns** +**EpisodeRetentionJob** — `crates/memory-scheduler/src/jobs/episode_retention.rs` [NEW FILE] -### Component 1: DedupGate (Enhanced NoveltyChecker) +```rust +pub struct EpisodeRetentionJob { + storage: Arc, + config: EpisodeRetentionConfig, +} -**Location:** `memory-service/src/novelty.rs` (extend existing) +pub struct EpisodeRetentionConfig { + pub max_episode_age_days: u32, // e.g., 180 + pub value_score_threshold: f32, // e.g., 0.3 (delete if < 0.3) + pub retention_policies: HashMap, +} -**The Timing Problem:** -The vector index is built asynchronously from the outbox. When event N arrives, events N-1, N-2, etc. may not yet be in the HNSW index. Two near-simultaneous duplicate events will BOTH pass the dedup check because neither sees the other in the index. +impl EpisodeRetentionJob { + pub async fn execute(&self) -> Result { + // 1. Scan CF_EPISODES with prefix "ep:" + // 2. For each episode: + // age_days = (now_ms - start_time_ms) / (86400 * 1000) + // if age_days > max_episode_age_days AND value_score < threshold: + // mark_for_deletion() + // 3. Write checkpoint: epmet:retention_sweep_{date} + // 4. Return { deleted_count, retained_count } + } +} +``` -**Solution: Two-tier dedup with in-flight buffer** +**Extends Scheduler:** Register with cron schedule (e.g., daily 2am), overlap policy (Skip), jitter (60s). Uses checkpoint pattern for crash recovery. -``` -IngestEvent RPC - | - v -DedupGate (enhanced NoveltyChecker) - | - +--> GATE 1: Config enabled? (fail-open if disabled) - +--> GATE 2: Text long enough? (skip short events) - +--> GATE 3: Embedder available? (fail-open if not) - | - +--> Generate embedding for incoming event - | - +--> CHECK A: In-flight buffer (recent embeddings not yet indexed) - | +--> Linear scan of buffer (bounded size, e.g., 256 entries) - | +--> Cosine similarity against each buffered embedding - | +--> If max_similarity > threshold -> REJECT as duplicate - | - +--> CHECK B: HNSW index (historical indexed content) - | +--> hnsw_index.search(embedding, k=1) - | +--> If top_score > threshold -> REJECT as duplicate - | - +--> If novel: - | +--> Add embedding to in-flight buffer (with TTL/max-size eviction) - | +--> Return STORE - | - +--> If duplicate: - +--> Increment rejected_duplicate metric - +--> Return SKIP (event NOT stored) -``` +--- -**In-flight buffer design:** +**VectorPruneJob** — `crates/memory-scheduler/src/jobs/vector_prune.rs` [EXTEND] ```rust -struct InFlightBuffer { - entries: VecDeque, - max_size: usize, // Default: 256 - max_age: Duration, // Default: 5 minutes +pub struct VectorPruneJobConfig { + pub retention_days: u32, // e.g., 90 + pub min_vectors_keep: u32, // safety limit } -struct InFlightEntry { - event_id: String, - embedding: Vec, - timestamp: Instant, - session_id: String, +impl VectorPruneJob { + pub async fn execute(&self) -> Result { + // 1. Read usearch index metadata (directory listing) + // 2. Extract embedding IDs + timestamps (from metadata file) + // 3. Mark for deletion if: timestamp < (now - retention_days) + // 4. Rebuild HNSW without marked vectors (usearch API) + // 5. Update CF_VECTOR_INDEX metadata pointer + // 6. Checkpoint: vector_prune_{date}_removed={count} + } } ``` -**Why this works:** -- The buffer catches duplicates that arrive faster than the indexing pipeline -- Buffer is small (256 entries x 384 dims x 4 bytes = ~400KB) -- trivial memory -- Linear scan of 256 vectors is <1ms -- well within the 50ms timeout -- Buffer entries auto-evict when old enough that they should be in the index -- Buffer is session-scoped (optional): only check within same session for tighter dedup +**Rationale:** Index rebuild is expensive. Copy-on-write pattern: new HNSW built in temp dir, pointer swapped atomically. Readers see no downtime. -**Why NOT a separate index:** -- A second HNSW index adds complexity (two indexes to maintain/rebuild) -- The in-flight window is short (seconds to minutes), linear scan is fast enough -- Buffer entries naturally age out as the indexing pipeline catches up +--- -### Component 2: StaleFilter (New Read-Time Ranking Component) +## Data Flow: New Capabilities -**Location:** New file `memory-service/src/stale.rs` or integrated into retrieval pipeline +### Episodic Recording Flow -**What is "stale"?** A result is stale when newer content semantically supersedes it. For example: -- "We decided to use PostgreSQL" superseded by "We switched to RocksDB" -- "JWT tokens expire in 1 hour" superseded by "JWT tokens now expire in 24 hours" +``` +Skill calls: rpc StartEpisode(StartEpisodeRequest) + request = { description: "Debug JWT token expiration", agent_id: "claude-code" } + + ↓ MemoryServiceImpl routes to EpisodeHandler + +EpisodeHandler.start_episode(request) + ├─ Generate episode_id = ULID() + ├─ record start_time_ms = now() + ├─ key = format!("ep:{:013}:{}", start_time_ms, episode_id) + ├─ episode = Episode { episode_id, start_time_ms, actions: [], ... } + ├─ storage.put_cf(CF_EPISODES, key, serde_json::to_bytes(&episode))? + └─ return StartEpisodeResponse { episode_id, start_time_ms } + + ↓ Skill now has episode_id, can record actions + +Skill calls: rpc RecordAction(RecordActionRequest) + request = { episode_id, action: EpisodeAction { action_type: TOOL_USE, ... } } + + ↓ EpisodeHandler.record_action(request) + ├─ Fetch episode from CF_EPISODES + ├─ if episode.end_time_ms > 0: return Err(EpisodeAlreadyCompleted) + ├─ Append action to episodes.actions + ├─ storage.put_cf(CF_EPISODES, same_key, updated_bytes)? [UPDATE existing] + └─ return RecordActionResponse { recorded: true } + + ↓ Repeat RecordAction for each tool_use, decision, etc. + +Skill calls: rpc CompleteEpisode(CompleteEpisodeRequest) + request = { episode_id, outcome_description: "Fixed JWT", value_score: 0.9, retention: KEEP_HIGH_VALUE } + + ↓ EpisodeHandler.complete_episode(request) + ├─ Fetch episode from CF_EPISODES + ├─ episode.end_time_ms = now() + ├─ episode.outcome_description = "Fixed JWT" + ├─ episode.value_score = 0.9 + ├─ episode.retention_policy = KEEP_HIGH_VALUE + ├─ storage.put_cf(CF_EPISODES, key, bytes)? [FINALIZE, immutable] + ├─ Optional: Generate embedding of outcome_description via Candle + ├─ Optional: Add to vector index for GetSimilarEpisodes + └─ return CompleteEpisodeResponse { completed: true } +``` + +--- -**Approach: Timestamp-based decay with semantic overlap detection** +### Episodic Retrieval Flow ``` -RetrievalExecutor returns raw results - | - v -StaleFilter (post-retrieval, pre-return) - | - +--> For each result pair (i, j) where i.timestamp < j.timestamp: - | +--> If cosine_similarity(i.embedding, j.embedding) > overlap_threshold: - | +--> Mark i as "superseded by j" - | +--> Apply staleness penalty to i.score - | - +--> Apply time-based decay: - | +--> age_days = (now - result.timestamp).days() - | +--> decay = 1.0 / (1.0 + staleness_decay_factor * age_days) - | +--> result.score *= decay - | - +--> Re-sort results by adjusted score - +--> Return filtered results +Skill calls: rpc GetSimilarEpisodes(GetSimilarEpisodesRequest) + request = { query: "How do we handle JWT expiration?", limit: 10, agent_id: "claude-code" } + + ↓ EpisodeHandler.get_similar_episodes(request) + ├─ Embed query using Candle (all-MiniLM-L6-v2) + ├─ Search usearch HNSW for similar embeddings (up to limit results) + ├─ Collect matching episode_ids from search results + ├─ Scan CF_EPISODES for matching episodes + ├─ Score by: embedding_similarity (0.0-1.0) + recency_boost + value_score + ├─ Sort by final_score descending + ├─ Build EpisodeSummary objects: + │ { + │ episode_id, + │ start_time_ms, + │ outcome_description: "Fixed JWT", + │ value_score: 0.9, + │ action_count: 7, + │ context_grip_ids: [grip_1, grip_2] ← Links to TOC for full context + │ } + └─ return GetSimilarEpisodesResponse { episodes: [summary_1, ...] } + + ↓ Skill inspects results, decides to expand context + +Skill calls: rpc ExpandGrip(ExpandGripRequest) + request = { grip_id: "grip_1" } [from context_grip_ids] + + ↓ Existing ExpandGrip RPC (v2.5) + ├─ Fetch Grip from CF_GRIPS + ├─ Get event_ids from grip.event_id_start..event_id_end + ├─ GetEvents returns raw events + context + └─ Skill now has full transcript of that episode step-by-step ``` -**Integration with existing ranking:** +--- + +### Ranking Composition Flow ``` -Final score = base_score - * salience_factor (write-time, from SalienceScorer) - * usage_penalty (read-time, from usage tracking) - * staleness_factor (read-time, NEW) +Skill calls: rpc RouteQuery(RouteQueryRequest) + request = { query: "What did we learn about dedup?", mode: SEQUENTIAL } + + ↓ RetrievalHandler.route_query(request) + ├─ ClassifyIntent(query) → Intent::Explore + ├─ TierDetector() → CapabilityTier::Five (all layers available) + ├─ FallbackChain::for_intent(...) → [AgenticTOC, BM25, Vector, Topics] + │ + ├─ Execute each layer (example: BM25) + │ └─ TeleportSearch(query) → [TocNode_1, TocNode_2, TocNode_3] + │ + └─ For EACH result TocNode: + ├─ RankingPayloadBuilder.build_for_node(node) + │ + │ ├─ Read salience_score from node (pre-computed at write time, v2.5) + │ │ salience_score = 0.8 + │ │ + │ ├─ Query CF_USAGE_COUNTERS for node.node_id + │ │ access_count = 5 + │ │ last_accessed_ms = 1710078000000 (3 days ago) + │ │ + │ ├─ Compute usage_adjusted_score: + │ │ elapsed_days = (now - last_accessed_ms) / (86400 * 1000) = 3 + │ │ usage_adjusted = e^(-3 / 30) = e^(-0.1) = 0.905 + │ │ + │ ├─ Call StaleFilter.compute_penalty(node.timestamp_ms, node.memory_kind) + │ │ timestamp_ms = 1709900000000 (11 days ago) + │ │ memory_kind = Constraint (exempt from decay, so penalty = 0.0) + │ │ stale_penalty = 0.0 + │ │ + │ ├─ Compute final_score: + │ │ final_score = 0.8 × 0.905 × (1.0 - 0.0) = 0.724 + │ │ + │ ├─ Build explanation: + │ │ "salience=0.8, usage_adjusted=0.905, stale_penalty=0.0 → final=0.724" + │ │ + │ └─ Return RankingPayload { + │ salience_score: 0.8, + │ usage_adjusted_score: 0.905, + │ stale_penalty: 0.0, + │ final_score: 0.724, + │ explanation: "..." + │ } + │ + └─ TeleportResult.ranking_payload = ABOVE + + ↓ Results sorted by final_score, returned with ranking_payload + +Skill receives: [ + { node: TocNode_1, rank: 0.724, ranking_payload: { explanation: "..." } }, + { node: TocNode_2, rank: 0.618, ranking_payload: { explanation: "..." } }, + { node: TocNode_3, rank: 0.501, ranking_payload: { explanation: "..." } }, +] + +Skill inspects ranking_payload.explanation: + → "Node 1 high because dedup Constraint (exempt from decay) + high salience + recent access" ``` -**Where staleness_factor:** -``` -staleness_factor = time_decay * supersession_penalty +--- -time_decay = 1.0 / (1.0 + staleness_decay * age_days) -supersession_penalty = if superseded { 0.3 } else { 1.0 } +### Lifecycle Sweep Flow + +``` +Scheduler fires: EpisodeRetentionJob (daily 2am) + + ↓ EpisodeRetentionJob.execute() + ├─ Load config: max_age=180 days, threshold=0.3 + ├─ Load checkpoint from CF_EPISODE_METRICS (resume position) + │ + ├─ Scan CF_EPISODES with prefix "ep:" starting from checkpoint + │ For EACH episode: + │ ├─ Parse key: ep:{ts:13}:{ulid} + │ ├─ Deserialize Episode + │ ├─ Compute age_days = (now_ms - start_time_ms) / (86400 * 1000) + │ │ + │ └─ If age_days > 180 AND value_score < 0.3: + │ └─ Delete (storage.delete_cf(CF_EPISODES, key)?) + │ [NOTE: RocksDB doesn't delete in place; tombstone + compaction] + │ + ├─ Write checkpoint: CF_EPISODE_METRICS[ "epmet:retention_sweep_2026_03_11" ] + │ checkpoint = { last_episode_checked: 1234, episodes_deleted: 42, timestamp_ms: now } + │ + └─ Return JobResult { + status: Success, + message: "Deleted 42 low-value episodes older than 180 days", + metadata: { deleted_count: 42, retained_count: 1058 } + } + + ↓ Scheduler records result in JobRegistry (for GetSchedulerStatus RPC) + +Scheduler fires: VectorPruneJob (weekly Sunday 1am) + + ↓ VectorPruneJob.execute() + ├─ Load config: retention_days=90 + ├─ Read usearch index metadata: + │ ├─ Open index directory: {db_path}/usearch/ + │ ├─ Read metadata file containing embedding_id → timestamp mappings + │ └─ Collect embeddings with timestamp < (now - 90 days) + │ + ├─ Rebuild HNSW WITHOUT marked embeddings: + │ ├─ Create temp directory: {db_path}/usearch.tmp/ + │ ├─ usearch::new_index(dimension=384) in temp dir + │ ├─ For EACH embedding in original index: + │ │ if NOT marked_for_deletion: + │ │ new_index.add(embedding_id, vector) + │ ├─ Write new index to temp dir + │ └─ Atomic rename: {db_path}/usearch.tmp/ → {db_path}/usearch/ + │ [Safe: readers hold RwLock on directory pointer] + │ + ├─ Update CF_VECTOR_INDEX metadata: + │ metadata = { index_path: ..., last_prune_ts: now, vectors_count: new_count } + │ storage.put_cf(CF_VECTOR_INDEX, "vec:meta", metadata)? + │ + ├─ Write checkpoint: CF_EPISODE_METRICS[ "epmet:vector_prune_2026_03_11" ] + │ checkpoint = { vectors_removed: 123, new_size_mb: 456, timestamp_ms: now } + │ + └─ Return JobResult { + status: Success, + message: "Removed 123 vectors older than 90 days, new index size 456 MB", + metadata: { vectors_removed: 123 } + } ``` -**Configuration:** +--- -```rust -pub struct StaleConfig { - /// Whether stale filtering is enabled (default: true for v2.5) - pub enabled: bool, - /// Cosine similarity threshold for considering two results as covering same topic - /// Range: 0.0-1.0, higher = stricter (default: 0.85) - pub overlap_threshold: f32, - /// Decay factor for time-based staleness (default: 0.01) - /// Higher = more aggressive time penalty - pub decay_factor: f32, - /// Score multiplier when result is superseded (default: 0.3) - pub superseded_penalty: f32, - /// Minimum age in days before time decay kicks in (default: 7) - pub grace_period_days: u32, +## Integration Points: Proto, Storage, Scheduler + +### 1. Proto Additions (memory.proto) + +**New enums:** +```protobuf +enum EpisodeStatus { + STATUS_UNSPECIFIED = 0; + STATUS_ACTIVE = 1; + STATUS_COMPLETED = 2; + STATUS_FAILED = 3; +} + +enum ActionType { + ACTION_UNSPECIFIED = 0; + ACTION_TOOL_USE = 1; + ACTION_DECISION = 2; + ACTION_OUTCOME = 3; + ACTION_FEEDBACK = 4; +} + +enum RetentionPolicy { + POLICY_UNSPECIFIED = 0; + POLICY_KEEP_ALL = 1; + POLICY_KEEP_HIGH_VALUE = 2; + POLICY_TIME_DECAY = 3; } ``` -### Component Boundaries +**New messages:** +```protobuf +message EpisodeAction { + int64 timestamp_ms = 1; + ActionType action_type = 2; + string description = 3; + map metadata = 4; // tool_name, input, output, etc. +} -| Component | Responsibility | Communicates With | Crate | -|-----------|---------------|-------------------|-------| -| DedupGate (enhanced NoveltyChecker) | Reject semantically duplicate events at ingest | Embedder, HNSW index, InFlightBuffer | memory-service | -| InFlightBuffer | Track recent un-indexed embeddings for dedup gap | DedupGate only (internal) | memory-service | -| StaleFilter | Downrank superseded/old results at query time | RetrievalExecutor, Embedder | memory-service or memory-retrieval | -| DedupConfig | Configuration for dedup gate | Settings, NoveltyConfig (extend) | memory-types | -| StaleConfig | Configuration for staleness filtering | Settings | memory-types | -| DedupMetrics | Extended novelty metrics with buffer stats | DedupGate | memory-service | +message Episode { + string episode_id = 1; + int64 start_time_ms = 2; + int64 end_time_ms = 3; + repeated EpisodeAction actions = 4; + string outcome_description = 5; + float value_score = 6; + RetentionPolicy retention_policy = 7; + repeated string context_grip_ids = 8; // Links to TOC grips + string agent_id = 9; // v2.1 multi-agent support +} -### Data Flow Changes +message StartEpisodeRequest { + string description = 1; + string agent_id = 2; +} -**Write path change (before/after):** +message StartEpisodeResponse { + string episode_id = 1; + int64 start_time_ms = 2; +} -``` -BEFORE: - IngestEvent -> validate -> serialize -> storage.put_event (atomic) -> return - -AFTER: - IngestEvent -> validate -> serialize - -> DedupGate.should_store(event) - -> embed(event.text) - -> check InFlightBuffer (linear scan) - -> check HNSW index (if not caught by buffer) - -> if novel: add to buffer, return STORE - -> if duplicate: return SKIP - -> if STORE: storage.put_event (atomic) -> return {created: true} - -> if SKIP: return {created: false, deduplicated: true} [new response field] -``` +message RecordActionRequest { + string episode_id = 1; + EpisodeAction action = 2; +} -**Read path change (before/after):** +message RecordActionResponse { + bool recorded = 1; + string error = 2; +} -``` -BEFORE: - RouteQuery -> classify -> tier detect -> execute layers -> merge -> return - -AFTER: - RouteQuery -> classify -> tier detect -> execute layers -> merge - -> StaleFilter.apply(results, stale_config) - -> pairwise overlap check (optional, O(n^2) but n is small ~10-20) - -> time decay - -> re-sort - -> return -``` +message CompleteEpisodeRequest { + string episode_id = 1; + string outcome_description = 2; + float value_score = 3; + RetentionPolicy retention_policy = 4; +} + +message CompleteEpisodeResponse { + bool completed = 1; + string error = 2; +} + +message GetSimilarEpisodesRequest { + string query = 1; + int32 limit = 2; + optional string agent_id = 3; +} + +message EpisodeSummary { + string episode_id = 1; + int64 start_time_ms = 2; + string outcome_description = 3; + float value_score = 4; + int32 action_count = 5; +} -### Proto Changes Required +message GetSimilarEpisodesResponse { + repeated EpisodeSummary episodes = 1; +} +``` +**Extended messages:** ```protobuf -message IngestEventResponse { - string event_id = 1; - bool created = 2; - bool deduplicated = 201; // NEW: true if rejected as duplicate - float similarity_score = 202; // NEW: highest similarity score found +message RankingPayload { + float salience_score = 1; + float usage_adjusted_score = 2; + float stale_penalty = 3; + float final_score = 4; + string explanation = 5; } -message DedupConfig { - bool enabled = 1; - float threshold = 2; - uint64 timeout_ms = 3; - uint32 min_text_length = 4; - uint32 buffer_size = 5; // In-flight buffer max entries - uint64 buffer_ttl_secs = 6; // In-flight buffer entry TTL +message TeleportResult { + // ... existing fields ... + optional RankingPayload ranking_payload = 20; // Field number > 200 per v2.6 reservation } -message StaleConfig { - bool enabled = 1; - float overlap_threshold = 2; - float decay_factor = 3; - float superseded_penalty = 4; - uint32 grace_period_days = 5; +// Extend status RPCs +message GetRankingStatusResponse { + // ... v2.5 fields ... + int32 usage_tracked_count = 11; // NEW + int32 high_salience_kind_count = 12; // NEW + map memory_kind_distribution = 13; // NEW } -// New RPC for dedup status -message GetDedupStatusRequest {} message GetDedupStatusResponse { - bool enabled = 1; - float threshold = 2; - uint64 total_checked = 3; - uint64 total_rejected = 4; - uint64 buffer_size = 5; - uint64 buffer_capacity = 6; + // ... v2.5 fields ... + int64 buffer_memory_bytes = 6; // NEW + int32 dedup_rate_24h_percent = 7; // NEW + int32 cross_session_dedup_count = 8; // NEW +} + +message GetEpisodeMetricsResponse { // NEW RPC + int32 total_episodes = 1; + int32 completed_episodes = 2; + int32 failed_episodes = 3; + float average_value_score = 4; + map retention_distribution = 5; + int64 last_retention_sweep_ms = 6; } ``` -**Proto field numbers:** Use 201+ range (reserved for Phase 23+ per project convention). +--- -## Patterns to Follow +### 2. Storage: New Column Families -### Pattern 1: Fail-Open Gate (from existing NoveltyChecker) +**In memory-storage/src/column_families.rs:** -**What:** Any check that could prevent event storage MUST fail open. -**When:** Always, for any ingest-time gate. -**Why:** The system's core invariant is that hooks never block the agent. If the dedup check fails (embedder down, timeout, index corrupt), the event MUST be stored anyway. +```rust +pub const CF_EPISODES: &str = "episodes"; +pub const CF_EPISODE_METRICS: &str = "episode_metrics"; + +pub const ALL_CF_NAMES: &[&str] = &[ + // ... existing 9 CFs ... + CF_EPISODES, + CF_EPISODE_METRICS, +]; + +fn episodes_options() -> Options { + let mut opts = Options::default(); + opts.set_compression_type(rocksdb::DBCompressionType::Zstd); + opts // Standard options for immutable append +} + +pub fn build_cf_descriptors() -> Vec { + vec![ + // ... existing descriptors ... + ColumnFamilyDescriptor::new(CF_EPISODES, episodes_options()), + ColumnFamilyDescriptor::new(CF_EPISODE_METRICS, Options::default()), + ] +} +``` +**Key formats:** ```rust -pub async fn should_store(&self, event: &Event) -> DedupDecision { - if !self.config.enabled { - return DedupDecision::Store(DedupReason::Disabled); +// Episode: ep:{start_ts:013}:{ulid} +// Example: ep:1710120000000:01ARZ3NDEKTSV4RRFFQ69G5FAV +pub fn episode_key(start_ts_ms: i64, episode_id: &str) -> String { + format!("ep:{:013}:{}", start_ts_ms, episode_id) +} + +// Episode metrics checkpoint: epmet:{checkpoint_type} +// Example: epmet:retention_sweep_2026_03_11 +pub fn episode_metrics_key(checkpoint_type: &str) -> String { + format!("epmet:{}", checkpoint_type) +} +``` + +**Usage Tracking Enhancement (CF_USAGE_COUNTERS):** + +```rust +// Existing in memory-storage/src/usage.rs, extend: +pub struct UsageStats { + pub access_count: u32, + pub last_accessed_ms: i64, // NEW +} + +impl UsageTracker { + pub fn record_access(&self, node_id: &str) -> Result<(), StorageError> { + // Increment access_count in CF_USAGE_COUNTERS + // Update last_accessed_ms to now } - // ... checks ... - match timeout(duration, self.check_dedup(event)).await { - Ok(Ok(decision)) => decision, - Ok(Err(_)) => DedupDecision::Store(DedupReason::Error), // fail-open - Err(_) => DedupDecision::Store(DedupReason::Timeout), // fail-open + + pub fn compute_access_decay( + &self, + access_count: u32, + last_accessed_ms: i64, + now_ms: i64, + ) -> f32 { + // exponential decay: e^(-lambda * time_elapsed) + // lambda = ln(2) / 30 days half-life + let elapsed_days = (now_ms - last_accessed_ms) as f32 / (86400.0 * 1000.0); + (-0.0231 * elapsed_days).exp() // 0.0231 ≈ ln(2)/30 } } ``` -### Pattern 2: Opt-In with Sensible Defaults (from NoveltyConfig) +--- -**What:** New features disabled by default, enabled via config. -**When:** Any feature that changes existing behavior. -**Why:** Backward compatibility. Existing users should see no change until they opt in. +### 3. Scheduler Jobs -```toml -# config.toml -[dedup] -enabled = true -threshold = 0.85 -buffer_size = 256 +**Register in memory-daemon/src/main.rs:** -[stale] -enabled = true -decay_factor = 0.01 +```rust +async fn register_jobs(scheduler: Arc, storage: Arc) { + // ... existing jobs ... + + // NEW: Episode retention (daily 2am) + let episode_job = EpisodeRetentionJob::new( + storage.clone(), + EpisodeRetentionConfig { + max_episode_age_days: 180, + value_score_threshold: 0.3, + retention_policies: Default::default(), + }, + ); + scheduler.register_job( + "episode_retention", + "0 2 * * * *", + None, + OverlapPolicy::Skip, + JitterConfig::new(60), + || Box::pin(episode_job.execute()), + ).await?; + + // NEW: Vector pruning (weekly Sunday 1am) + let vector_prune_job = VectorPruneJob::new( + storage.clone(), + vector_handler.clone(), + VectorPruneJobConfig { + retention_days: 90, + min_vectors_keep: 1000, + }, + ); + scheduler.register_job( + "vector_prune", + "0 1 * * 0 *", + None, + OverlapPolicy::Skip, + JitterConfig::new(120), + || Box::pin(vector_prune_job.execute()), + ).await?; + + // NOTE: BM25 pruning deferred to Phase 42 (requires SearchIndexer write access) +} ``` -### Pattern 3: Metric-Rich Observability (from NoveltyMetrics) +--- + +## Build Order & Phases + +**v2.6 is 4 phases. Each phase has dependency constraints:** + +### Phase 39: Episodic Memory Storage (Foundation) + +**Deliverables:** +- Add CF_EPISODES, CF_EPISODE_METRICS to column families +- Define Episode proto + messages in memory.proto +- Add Episode struct to memory-types +- Storage::put_episode(), get_episode(), scan_episodes() helpers -**What:** Every code path through the gate tracks a metric. -**When:** Any decision point in dedup or stale filtering. -**Why:** Debugging and tuning. Users need to know WHY events were rejected or WHY results were downranked. +**Dependencies:** v2.5 storage ✓ +**Tests:** Unit tests for episode storage operations (CRUD) +**Blockers:** None -### Pattern 4: Trait-Based Abstractions for Testing (from EmbedderTrait/VectorIndexTrait) +--- -**What:** Core dedup logic depends on traits, not concrete types. -**When:** Any component that interacts with embedder or vector index. -**Why:** MockEmbedder and MockVectorIndex enable fast, deterministic unit tests. +### Phase 40: Episodic Memory Handler (RPC Implementation) -## Anti-Patterns to Avoid +**Deliverables:** +- EpisodeHandler struct (memory-service/src/episode.rs) +- Implement 4 RPCs: StartEpisode, RecordAction, CompleteEpisode, GetSimilarEpisodes +- Wire handler into MemoryServiceImpl +- Integrate vector search for GetSimilarEpisodes (similarity scoring) -### Anti-Pattern 1: Separate Dedup Index +**Dependencies:** Phase 39 ✓, vector index (v2.5) ✓ +**Tests:** E2E tests: start → record → complete → retrieve similar +**Blockers:** None -**What:** Building a second HNSW index specifically for dedup checking. -**Why bad:** Double the maintenance, double the rebuild logic, double the disk usage. The in-flight buffer + existing HNSW covers the same ground with far less complexity. -**Instead:** In-flight buffer (256 entries, linear scan) + existing HNSW index. +--- -### Anti-Pattern 2: Blocking Dedup Check +### Phase 41: Ranking Payload & Observability (Signal Composition) -**What:** Making the IngestEvent RPC wait for dedup check with no timeout. -**Why bad:** Violates fail-open principle. If embedder is slow, all ingestion stalls. -**Instead:** Timeout (50ms default), fail-open on timeout. +**Deliverables:** +- RankingPayloadBuilder (new file memory-service/src/ranking.rs) +- Merge salience + usage_decay + stale_penalty → final_score + explanation +- Extend GetRankingStatus response with new fields +- Extend GetDedupStatus response with new fields +- NEW: GetEpisodeMetrics RPC +- Add ranking_payload field to TeleportResult proto +- Wire ranking_payload into TeleportSearch, VectorTeleport, HybridSearch RPCs -### Anti-Pattern 3: Mutating Events for Staleness +**Dependencies:** Phase 39 (storage) ✓, Phase 40 (handler) ✓, v2.5 ranking ✓ +**Tests:** Unit tests for ranking formula, E2E test for RouteQuery explainability +**Blockers:** None -**What:** Adding a `stale` flag to stored events or TOC nodes. -**Why bad:** Violates append-only model. Staleness is a read-time property that depends on what other content exists. -**Instead:** Compute staleness at query time from timestamps and similarity. +--- -### Anti-Pattern 4: O(n^2) Pairwise Comparison on Large Result Sets +### Phase 42: Lifecycle Automation Jobs (Scheduler) -**What:** Running pairwise overlap detection on hundreds of results. -**Why bad:** 100 results = 4,950 comparisons, each requiring an embedding lookup. -**Instead:** Only apply pairwise overlap to the top-k results (10-20 max). Results beyond top-k are already low-ranked. +**Deliverables:** +- EpisodeRetentionJob (memory-scheduler/src/jobs/episode_retention.rs) +- Extend VectorPruneJob (memory-scheduler/src/jobs/vector_prune.rs) +- Register both jobs in daemon startup +- Checkpoint-based crash recovery for both jobs -### Anti-Pattern 5: Dedup on Raw Events Instead of Content +**Dependencies:** Phase 39 (storage) ✓, Phase 41 (observability) ✓, scheduler (v2.5) ✓ +**Tests:** Unit tests for retention logic, E2E test for vector rebuild, integration test for checkpoint recovery +**Blockers:** None -**What:** Checking dedup at the raw event level (every user_message, tool_result, etc.). -**Why bad:** Many events are legitimately similar (e.g., "yes", "okay", session_start). Dedup should focus on substantive content. -**Instead:** Only dedup events with `min_text_length >= 50` (already in NoveltyConfig). Consider only user_message and assistant_message types. +--- -## Scalability Considerations +## Patterns & Constraints -| Concern | At 100 events/day | At 1K events/day | At 10K events/day | -|---------|-------------------|-------------------|-------------------| -| InFlightBuffer size | 256 entries plenty | 256 entries fine (5min TTL) | May need 512-1024 entries | -| Dedup latency | <5ms | <10ms (buffer scan) | <20ms (larger buffer) | -| HNSW search for dedup | <5ms | <10ms | <15ms (larger index) | -| Stale pairwise check | Negligible (10 results) | Negligible | Negligible (still 10-20 results) | -| Buffer memory | ~400KB | ~400KB | ~1.6MB at 1024 entries | +### Append-Only Immutability -## Build Order (Dependency-Aware) +Episodes are **immutable after CompleteEpisode**: +```rust +impl EpisodeHandler { + pub async fn record_action(&self, ep_id: &str, action: Action) -> Result<()> { + let episode = self.storage.get_episode(ep_id)?; + if episode.end_time_ms > 0 { + return Err(MemoryError::EpisodeAlreadyCompleted(ep_id.to_string())); + } + // Append-only: CF_EPISODES never updates, only adds new versions + Ok(()) + } +} ``` -Phase 1: DedupGate foundation - +--> DedupConfig in memory-types (extends NoveltyConfig) - +--> InFlightBuffer in memory-service (pure data structure, no deps) - +--> Enhanced NoveltyChecker with buffer integration - +--> Unit tests with MockEmbedder + MockVectorIndex - -Phase 2: Wire DedupGate into IngestEvent - +--> Inject DedupGate into MemoryServiceImpl - +--> Add dedup check before storage.put_event - +--> Proto changes (IngestEventResponse.deduplicated) - +--> Integration tests - -Phase 3: StaleFilter - +--> StaleConfig in memory-types - +--> StaleFilter implementation - +--> Integration with RetrievalExecutor (post-processing step) - +--> Unit tests - -Phase 4: E2E validation - +--> E2E test: duplicate events rejected - +--> E2E test: near-duplicate events rejected - +--> E2E test: stale results downranked - +--> E2E test: fail-open on embedder failure - +--> CLI bats tests for dedup behavior + +**Rationale:** Maintains append-only invariant (STOR-01), enables crash recovery, simplifies concurrency. + +--- + +### Handler Injection Pattern + +All handlers use dependency injection via Arc: + +```rust +pub struct EpisodeHandler { + storage: Arc, // Injected + vector_handler: Option>, // Optional + classifier: EpisodeValueClassifier, // Internal +} + +impl EpisodeHandler { + pub fn with_services( + storage: Arc, + vector_handler: Option>, + ) -> Self { ... } +} ``` -**Rationale for this order:** -1. DedupGate first because StaleFilter can be built independently, but DedupGate changes the write path (higher risk, needs more testing) -2. InFlightBuffer before wiring because it can be tested in isolation as a pure data structure -3. StaleFilter after DedupGate because it is read-path only (lower risk, no data mutation) -4. E2E last because it needs both features working end-to-end - -## Sources - -- Direct codebase analysis of: - - `crates/memory-service/src/novelty.rs` -- existing NoveltyChecker pattern (fail-open, opt-in, metrics) - - `crates/memory-service/src/ingest.rs` -- IngestEvent handler (MemoryServiceImpl, event storage) - - `crates/memory-indexing/src/pipeline.rs` -- IndexingPipeline (outbox processing, checkpoint tracking) - - `crates/memory-indexing/src/vector_updater.rs` -- VectorIndexUpdater (HNSW + Candle integration) - - `crates/memory-vector/src/hnsw.rs` -- HnswIndex (usearch wrapper, cosine similarity) - - `crates/memory-vector/src/index.rs` -- VectorIndex trait (search, add, remove interface) - - `crates/memory-retrieval/src/executor.rs` -- RetrievalExecutor (fallback chains, merge, scoring) - - `crates/memory-retrieval/src/types.rs` -- QueryIntent, CapabilityTier, StopConditions, ExecutionMode - - `crates/memory-types/src/salience.rs` -- SalienceScorer (write-time importance scoring) - - `crates/memory-types/src/usage.rs` -- UsageStats, usage_penalty (read-time decay) - - `crates/memory-types/src/config.rs` -- NoveltyConfig, Settings (layered config) - - `crates/memory-types/src/outbox.rs` -- OutboxEntry, OutboxAction (async index pipeline) - - `.planning/PROJECT.md` -- requirements, architectural decisions, constraints +**Rationale:** Separates concerns, testable with mock storage, follows existing RetrievalHandler pattern. + +--- + +### Metrics On-Demand (Single Source of Truth) + +Observability computes metrics by reading primary data, never maintains separate metrics store: + +```rust +impl GetRankingStatus { + pub async fn handle(&self, _req: Request<...>) -> Result> { + let usage_count = self.storage.cf_usage_counters.len()?; // Read current state + let salience_kinds = self.storage.count_memory_kinds()?; // Aggregate from nodes + let stale_decay_active = self.storage.count_stale_nodes()?; + + Ok(Response::new(GetRankingStatusResponse { + usage_tracked_count: usage_count, + high_salience_kind_count: salience_kinds.len(), + memory_kind_distribution: salience_kinds, + })) + } +} +``` + +**Rationale:** No sync issues, single source of truth, easy to test. + +--- + +### Job Checkpoint Recovery + +Jobs use checkpoints for crash recovery: + +```rust +pub async fn execute(&self) -> Result { + let checkpoint = self.load_checkpoint()?; // Resume from last position + + let mut idx = checkpoint.last_processed_idx; + while idx < total_episodes { + let episode = self.get_episode(idx)?; + match self.should_delete(episode) { + Ok(true) => self.mark_delete(episode), + Ok(false) => { /* keep */ }, + Err(e) => { + self.save_checkpoint(idx)?; // Save progress and retry next run + return Err(e); + } + } + idx += 1; + } + + self.save_checkpoint(total_episodes)?; // Mark complete + Ok(JobResult { ... }) +} +``` + +**Rationale:** Scheduler retries on next cron tick; checkpoint resumes from last good position. + +--- + +## Risks & Mitigations + +| Risk | Impact | Mitigation | +|------|--------|-----------| +| Episode retention job deletes wrong records | Data loss | (1) Dry-run mode in config, (2) Conservative defaults (max_age=180d), (3) Checkpoint recovery | +| Vector index rebuild locks queries | Query latency spike | (1) RwLock on index pointer, (2) Copy-on-write (tmp → live), (3) Fallback to TOC | +| Ranking payload computation slows retrieval | Latency increase | (1) Lazy-compute (only for top-K), (2) Cache optional, (3) Metrics show impact | +| GetSimilarEpisodes on large datasets | O(n) scan | (1) usearch HNSW is O(log n), (2) Limit top-10 by default, (3) Time filter (90d) | +| Episode disabled → RPCs return Unimplemented | Skill failure | (1) Skill checks capabilities, (2) Graceful fallback to TOC, (3) Clear docs | + +--- + +## Configuration + +**New config entries (config.toml):** + +```toml +[episode] +enabled = true +max_episode_age_days = 180 +value_score_retention_threshold = 0.3 +vector_search_limit = 10 + +[lifecycle] +vector_prune_enabled = true +vector_prune_retention_days = 90 +bm25_prune_enabled = false # Deferred to Phase 42b + +[ranking] +# Note: Salience, usage, stale already configured in v2.5 +salience_weight = 0.5 +usage_weight = 0.3 +stale_weight = 0.2 +``` + +--- + +## Success Criteria + +**v2.6 is complete when:** + +1. **Episodic Memory:** + - Episode start → record actions → complete → retrieve similar ✓ + - GetSimilarEpisodes returns top-10 semantically matched past episodes ✓ + - Episode context retrievable via ExpandGrip on linked grips ✓ + +2. **Ranking Quality:** + - RankingPayload = salience × usage_decay × (1 - stale_penalty) ✓ + - Explanation human-readable ✓ + - TeleportResult includes ranking_payload ✓ + +3. **Lifecycle Automation:** + - VectorPruneJob removes vectors > 90 days old ✓ + - EpisodeRetentionJob deletes episodes (age > 180d AND value < 0.3) ✓ + - Jobs report metrics to observability ✓ + +4. **Observability:** + - GetRankingStatus includes usage_tracked_count, high_salience_kind_count ✓ + - GetDedupStatus includes buffer_memory_bytes, dedup_rate_24h_percent ✓ + - GetEpisodeMetrics returns completion_rate, value_distribution ✓ + +5. **No Regressions:** + - All v2.5 E2E tests pass ✓ + - Dedup gate unaffected ✓ + - Features optional (feature-gated if needed) ✓ + +--- + +## Summary + +v2.6 integrates **four orthogonal capabilities** into v2.5 via: + +1. **New handlers** (EpisodeHandler) using existing patterns (Arc injection) +2. **New column families** (CF_EPISODES, CF_EPISODE_METRICS) following storage conventions +3. **Extended RPCs** (4 episode RPCs, enhanced status RPCs) with new protos +4. **New scheduler jobs** (episode retention, vector pruning) using checkpoint recovery +5. **Signal composition** (ranking payload) merging v2.5 rankings into explainable payload + +**No architectural rewrite.** All additions are *additive, not structural.* Build order respects dependencies. Patterns align with existing codebase (handler injection, checkpoint recovery, immutable storage, single-source-of-truth metrics). diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md index 36a50e9..50bed58 100644 --- a/.planning/research/FEATURES.md +++ b/.planning/research/FEATURES.md @@ -1,108 +1,431 @@ -# Feature Landscape +# Feature Landscape: v2.6 Episodic Memory, Ranking Quality, Lifecycle & Observability -**Domain:** Semantic deduplication and retrieval quality for agent conversation memory -**Researched:** 2026-03-05 +**Domain:** Agent Memory System - Cognitive Architecture with Retrieval Quality & Experience Learning +**Researched:** 2026-03-11 +**Scope:** Episodic memory, salience scoring, usage-based decay, lifecycle automation, observability RPCs, hybrid search integration + +--- ## Table Stakes -Features users expect from a dedup/stale-filtering system. Missing = the feature feels incomplete or broken. +Features users expect given the existing 6-layer cognitive stack. Missing these = system feels incomplete or untrustworthy. + +| Feature | Why Expected | Complexity | Category | Notes | +|---------|--------------|-----------|----------|-------| +| **Hybrid Search (BM25 + Vector)** | Lexical + semantic search is industry standard for RAG; existing BM25/vector layers must interoperate | Medium | Retrieval | Currently hardcoded routing logic; needed to complete Layer 3/4 wiring | +| **Salience Scoring at Write Time** | High-value/structural events (Definitions, Constraints) must rank higher; already in design (Layer 6) | Low | Ranking | Write-time scoring avoids expensive retrieval-time computation; enables kind-based exemptions | +| **Usage-Based Decay in Ranking** | Frequently accessed memories fade; rarely touched memories strengthen — mimics human forgetting (Ebbinghaus) | Medium | Ranking | Requires access_count tracking on reads; integrates with existing StaleFilter (14-day half-life) | +| **Vector Index Pruning** | Memory grows unbounded; stale/low-value vectors waste storage and retrieval speed | Low | Lifecycle | Part of background scheduler; removes old/low-salience vectors periodically | +| **BM25 Index Maintenance** | Lexical index needs periodic rebuild/compaction; low-entropy shards waste search time | Low | Lifecycle | Level-filtered rebuild (only rebuild bottom N levels of TOC tree) | +| **Admin Observability RPCs** | Operators need visibility into dedup/ranking health; required for production troubleshooting | Low | Observability | GetDedupMetrics, GetRankingStatus RPCs; expose buffer_size, events_skipped, salience distribution | +| **Episodic Memory Storage & Schema** | Record task outcomes, search similar past episodes — enables learning from experience | Medium | Episodic | CF_EPISODES column family; Episode proto with start_time, actions, outcome, value_score | -| Feature | Why Expected | Complexity | Notes | -|---------|--------------|------------|-------| -| Ingest-time vector similarity gate | Core dedup mechanism. Without it, repeated agent conversations fill the index with near-identical content, degrading retrieval quality. | Medium | Existing `NoveltyChecker` in `memory-service/src/novelty.rs` already implements the pattern (embed -> search top-1 -> threshold check). Must be wired into the actual ingest pipeline rather than being a standalone checker. | -| Configurable similarity threshold | Different projects have different repetition patterns. A code-heavy project tolerates lower thresholds than a conversational one. | Low | `NoveltyConfig.threshold` already exists (default 0.82). Expose through config.toml. Threshold is domain-specific; 0.80-0.90 is the practical range per community evidence. | -| Fail-open on dedup errors | Dedup must never block ingestion. If embedder is down, index not ready, or timeout hit, store the event anyway. | Low | Already implemented in `NoveltyChecker::should_store()` with full fail-open semantics (6 skip paths). This is validated design. | -| Temporal decay in ranking | Old results about superseded topics must rank lower than recent ones. Without this, stale answers pollute retrieval. | Medium | `VectorEntry` already stores `timestamp_millis`. Layer 6 ranking has `salience` and `usage_penalty` but no time-decay factor yet. Add exponential decay based on document age. | -| Dedup metrics/observability | Operators need to know how many events were deduplicated vs stored, to tune thresholds. | Low | `NoveltyMetrics` already tracks `rejected_duplicate`, `stored_novel`, and 6 skip categories. Expose via gRPC `GetDedupStats` or similar. | -| Minimum text length bypass | Short events (session_start, tool_result status lines) should skip dedup entirely -- they are structurally important but semantically thin. | Low | `NoveltyConfig.min_text_length` already exists (default 50 chars). Already implemented. | +--- ## Differentiators -Features that set the dedup system apart from naive implementations. Not expected, but add significant value. +Features that set the system apart from naive implementations. Not expected, but highly valued by power users. + +| Feature | Value Proposition | Complexity | Category | Notes | +|---------|-------------------|-----------|----------|-------| +| **Value-Based Episode Retention** | Delete low-value episodes, retain "Goldilocks zone" (medium utility); learn from successful experiences without storage bloat | High | Episodic | Prevents pathological retention (too high = dedup everything; too low = no learning); requires outcome scoring percentile analysis | +| **Retrieval Integration for Similar Episodes** | When answering a query, optionally search past episodes (GetSimilarEpisodes); surface "we solved this before and it worked" | High | Episodic | Bridges episodic → semantic; depends on episode embedding + vector search; powerful for repeated task patterns | +| **Adaptive Lifecycle Policies** | Retention thresholds adjust based on storage pressure, salience distribution, usage patterns | High | Lifecycle | Not essential v2.6; deferred for v2.7 adaptive optimization phase | +| **Multi-Layer Decay Coordination** | Stale filter + usage decay + episode retention all tune together (no conflicting signals) | Medium | Ranking | Requires tuning framework; candidates: weighted sum, per-layer thresholds, Bayesian composition | +| **Observability Dashboard Integration** | Admin RPC metrics feed into operator dashboards (Prometheus, CloudWatch, DataDog) | Low | Observability | External tool integration only; requires stable RPC interface + consistent metric names | +| **Cross-Episode Learning Patterns** | Identify repeated task types, success/failure patterns across episodes | Very High | Episodic | Requires NLP/clustering on episode summaries; deferred for v2.7+ self-improvement | -| Feature | Value Proposition | Complexity | Notes | -|---------|-------------------|------------|-------| -| Supersession detection (content-aware staleness) | Instead of just time-decay, detect when a newer event semantically supersedes an older one on the same topic. Mark the older result as superseded. Goes beyond dumb temporal decay. | High | Requires comparing new ingest against existing similar entries and marking old entries with a `superseded_by` reference. Could use the same vector search but with a "supersession window" (e.g., only check events from same agent/session). | -| Per-event-type dedup policies | Different event types warrant different dedup behavior: `user_message` should be aggressively deduped, `session_start`/`session_end` should never be deduped, `assistant_stop` may have a looser threshold. | Low | Add `event_type` to the dedup decision. Simple match on `EventType` enum to select threshold or skip. | -| Staleness half-life configuration | Configurable half-life for temporal decay (e.g., 7 days, 30 days) rather than a fixed decay curve. Projects with fast-moving topics want aggressive decay; archival projects want gentle decay. | Low | Single `half_life_days` config parameter. Decay formula: `score * exp(-ln(2) * age_days / half_life_days)`. | -| Agent-scoped dedup | Dedup within a single agent's history, not across all agents. Agent A saying "let's fix the bug" and Agent B saying the same thing are independent events worth keeping. | Medium | Already have `Event.agent` field. Scope the vector similarity search with an agent filter. Requires post-filtering HNSW results by agent metadata since usearch has no native metadata filtering. | -| Dedup dry-run mode | Allow operators to see what WOULD be deduped without actually dropping events. Useful for threshold tuning. | Low | Add `dry_run` flag to `NoveltyConfig`. Log rejections but store anyway. Return dedup decisions in metrics. | -| Stale result exclusion window | Hard cutoff: results older than N days are excluded from retrieval entirely (not just downranked). Configurable per intent type -- `TimeBoxed` queries might exclude results older than 7 days while `Explore` queries include everything. | Medium | Add `max_age_days` to retrieval config per `QueryIntent`. Filter at query time before ranking. | +--- ## Anti-Features -Features to explicitly NOT build. These seem tempting but create more problems than they solve. +Features to explicitly NOT build. | Anti-Feature | Why Avoid | What to Do Instead | |--------------|-----------|-------------------| -| Mutable event deletion on dedup | Tempting to delete duplicate events from RocksDB. Violates the append-only invariant that is foundational to the architecture. Deleted events break grip references, TOC nodes, and crash recovery checkpoints. | Mark duplicates silently by not storing them at ingest time. Already-stored events stay forever. | -| Cross-project dedup | Comparing events across different project stores adds massive complexity and violates the per-project isolation model. | Keep dedup scoped to a single project store. Cross-project memory is explicitly deferred/out-of-scope. | -| LLM-based dedup decisions | Using an LLM to decide if two events are duplicates (like Mem0 does) adds API latency, cost, and a hard dependency on external services. Agent Memory uses local embeddings precisely to avoid API dependencies. | Use local vector similarity (all-MiniLM-L6-v2 via Candle, already in-process). The 50ms timeout is achievable with local embeddings but not with API calls. | -| Exact-match dedup only | Hashing-based exact dedup catches identical text but misses semantic near-duplicates ("let's fix the auth bug" vs "we need to address the authentication issue"). | Semantic similarity via embeddings catches both exact and near-duplicate content. Hash-based dedup is a subset of vector similarity at threshold=1.0. | -| Global re-ranking of all stored events | Re-ranking everything at query time based on staleness is O(n) and defeats the purpose of indexed search. | Apply staleness filtering/decay AFTER index search returns top-k candidates. Post-retrieval filtering keeps cost at O(k). | -| Retroactive dedup of existing events | Scanning all historical events to find and mark duplicates is expensive and risks flagging legitimate repeated discussions. | Apply dedup only to new events going forward. Historical data stays as-is. | +| **Automatic Memory Forgetting Without User Choice** | Agent should never silently delete memories; violates append-only principle and causality debugging | Lifecycle jobs are delete-by-policy (configurable); admins set thresholds; users can override | +| **Real-Time Outcome Feedback Loop (Agent Self-Correcting)** | Too complex for v2.6; requires agent control flow that's outside memory's scope | Record episode outcomes (human validation); v2.7 can add reward signaling to retrieval policy | +| **Graph-Based Episode Dependencies** | Tempting but overengineered; TOC tree + timestamps sufficient for temporal navigation | Use TOC + episode timestamps; cross-reference via event_id links; avoid graph DB complexity | +| **LLM-Based Episode Summarization** | High latency, API dependency, hallucination risk; hard to troubleshoot | Use salience scores + existing grip-based summaries (already in TOC); optionally add human review | +| **Per-Agent Lifecycle Scoping** | Multi-agent mode can defer this; would require partition keys in every pruning job | Lifecycle policies are global; agents filter on retrieval (agent-filtered queries already work) | +| **Continuous Outcome Recording** | If users must label every action, adoption suffers | Make outcome recording opt-in; batch via CompleteEpisode RPC with single outcome score | +| **Real-Time Index Rebuilds** | Blocking user queries during index maintenance kills UX | Schedule pruning jobs during off-hours; implement dry-run reporting for production safety | + +--- ## Feature Dependencies +Dependency graph for implementation order. + +``` +Hybrid Search (BM25 Router) + ↓ (requires Layer 3/4 operational, unblocks routing logic) +Salience Scoring at Write Time + ↓ (requires write-time scoring populated in TOC/Grips) +Usage-Based Decay in Ranking + ↓ (requires access_count tracking + ranking pipeline) +Admin Observability RPCs + ├─ (exposes dedup + ranking metrics) + ↓ +Vector/BM25 Index Lifecycle Jobs + ├─ (scheduler jobs, can run parallel with above) + ↓ +Episodic Memory Storage & RPCs + ├─ (depends on Event storage, independent of indexes) + ├─ (can start parallel with lifecycle work) + ↓ +Value-Based Episode Retention + ├─ (depends on outcome scoring; runs after retention policy jobs) + ↓ +Similar Episode Retrieval (Optional) + └─ (depends on CompositeVectorIndex; runs post-episodic-memory) +``` + +**Critical Path (must do in order):** +1. Hybrid Search wiring (unblocks ranking) +2. Salience + Usage Decay (ranking works end-to-end) +3. Admin RPCs (observability for production) +4. Episodic Memory storage (independent, parallel-safe) +5. Value-based retention (completion feature, can defer 1 sprint) + +**Parallel-Safe Work:** +- Index lifecycle jobs (no dependency on episodic memory) +- Admin RPC metrics gathering (can stub metrics early, populate later) + +--- + +## Implementation Patterns + +### Hybrid Search (BM25 + Vector Fusion) + +**What it does:** Route queries to both BM25 and vector indexes; combine rankings via Reciprocal Rank Fusion (RRF) or weighted average. + +**How it works (industry standard):** +1. **Parallel execution:** Run BM25 query + Vector query concurrently +2. **Score normalization:** Bring both to [0, 1] scale (RRF or linear mapping) +3. **Fusion:** Combine via RRF (no tuning) or weighted blend (tunable weights) +4. **Routing heuristic:** + - Keyword-heavy query (identifiers, class names) → weight BM25 higher (0.6 BM25, 0.4 Vector) + - Semantic query ("find discussions about X") → weight Vector higher (0.4 BM25, 0.6 Vector) + - Default → equal weights (0.5 BM25, 0.5 Vector) + +**Integration with existing retrieval policy:** +- Already has intent classification (Explore/Answer/Locate/TimeBoxed) +- Layer 3/4 searches are independent; hybrid merges at ranking stage +- Retrieval policy's tier detection and fallback chains already in place + +**Complexity:** MEDIUM — RRF is simple math; requires coordinating two async searches. + +**Expected behavior (validation):** +- Keyword queries (e.g., "JWT token") retrieve via BM25 without latency spike +- Semantic queries (e.g., "how did we handle auth?") use vector similarity +- Graceful fallback: if BM25 fails, vector search results are returned (and vice versa) + +--- + +### Salience Scoring at Write Time + +**What it does:** Assign importance scores (0.0-1.0) at ingest time based on event kind. + +**How it works:** +- Already in Layer 6 design; KIND classification determines salience +- High-salience kinds: `constraint`, `definition`, `procedure`, `tool_result_error` (0.9-1.0) +- Medium-salience: `user_message`, `assistant_stop` (0.5-0.7) +- Low-salience: `session_start`, `session_end` (0.1-0.3) + +**Integration point:** +- TocNode and Grip protos already have `salience_score` field (v2.5+) +- Populate at ingest time via `SalienceScorer::score_event(kind)` (static lookup) +- Used in Layer 6 ranking as multiplicative factor + +**Complexity:** LOW — scoring rules are static lookup table; no ML required. + +**Expected behavior:** +- Constraints/definitions never decay (exempted from StaleFilter) +- Session markers have low salience (deprioritized in ranking) +- Ranking score = base_score × salience_factor × (1 - stale_penalty) × (1 - usage_decay) + +--- + +### Usage-Based Decay in Ranking + +**What it does:** Reduce ranking score for frequently-accessed items (inverse recency); strengthen rarely-touched items. + +**How it works:** +- Track `access_count` per TOC node / Grip (incremented on read) +- At retrieval ranking time: apply decay factor = 1.0 / log(access_count + 1) or exp(-access_count / K) +- Decay is multiplicative: `final_score = base_score × salience_factor × (1 - decay_factor) × (1 - stale_penalty)` + +**Rationale:** Mimics human memory — rehearsed facts fade from conscious retrieval; novel facts stay sharp (Ebbinghaus forgetting curve validated in cognitive psychology). + +**Tuning considerations:** +- Decay floor: never drop score below 20% (prevent collapse) +- Decay half-life: decay factor = 0.5 at access_count = 100 (tunable via config) +- Exempt structural events: high-salience kinds don't decay (same as StaleFilter) + +**Complexity:** MEDIUM — requires tracking + lookup at ranking time; no external service. + +**Expected behavior:** +- Recent queries with low access_count rank higher (novel information) +- Popular results (high access_count) gradually fade unless repeatedly accessed +- Salience exemptions prevent "boring but important" facts from disappearing + +--- + +### Index Lifecycle Automation via Scheduler + +**Vector Index Pruning:** +- **When:** Weekly or when storage threshold exceeded +- **What:** Remove vectors for events marked `skip_vector` or older than 90 days + low-salience +- **How:** HNSW index is rebuildable from TOC tree; deletion is safe +- **Job:** `VectorPruneJob` in background scheduler (framework exists since v1.0) +- **Dry-run:** Log what WOULD be deleted; allow admin override + +**BM25 Index Maintenance:** +- **When:** Weekly or when search latency exceeds SLA +- **What:** Rebuild BM25 index for bottom N levels of TOC (recent events prioritized) +- **How:** Tantivy segment merge + compaction; can be online (dual indexes) +- **Job:** `Bm25RebuildJob` with level filtering +- **Dry-run:** Report segment stats before rebuild + +**Complexity:** LOW — scheduler framework exists; jobs are independent. + +**Expected behavior:** +- Vector index size decreases over time (no unbounded growth) +- BM25 latency stays consistent (no slowdown from segment bloat) +- Operators can monitor pruning effectiveness via metrics RPCs + +--- + +### Admin Observability RPCs + +**What users need to see:** + +| Metric | RPC Field | Why | Example Value | +|--------|-----------|-----|-------| +| **Dedup Buffer Size** | `infl_buffer_size` | Is dedup gate backed up? | 128 / 256 entries | +| **Events Deduplicated (Session)** | `events_skipped_session` | How many duplicates caught? | 47 events | +| **Events Deduplicated (Cross-Session)** | `events_skipped_cross_session` | Long-term dedup working? | 312 events | +| **Salience Distribution** | `salience_histogram[0.0-0.2]`, etc. | Is content balanced? | {0.0-0.2: 100, 0.2-0.4: 50, ...} | +| **Usage Decay Distribution** | `access_count_p50`, `p99` | Are hot/cold patterns healthy? | p50=3, p99=157 | +| **Vector Index Size** | `vector_index_entries` | Storage used by vectors? | 18,432 entries | +| **BM25 Index Size** | `bm25_index_bytes` | Storage used by BM25? | 2.4 MB | +| **Last Pruning Timestamp** | `last_vector_prune_time` | When did cleanup last run? | 2026-03-09T14:30:00Z | + +**Exposed via:** +- `GetRankingStatus` RPC (already stubbed v2.2) +- `GetDedupMetrics` RPC (new in v2.6) +- Both return structured proto with histogram buckets + +**Complexity:** LOW — reading metrics from existing data structures; no computation. + +**Expected behavior:** +- Metrics RPCs respond in <100ms (cached, no expensive scans) +- Salience histogram shows multimodal distribution (not flat) +- Usage decay p50 < p99 by 50x+ (confirming hot/cold pattern) + +--- + +### Episodic Memory Storage & RPCs + +**What it does:** Record sequences of actions + outcomes from tasks, enabling "we solved this before" retrieval. + +**Proto Schema:** +```protobuf +message Episode { + string episode_id = 1; // UUID + int64 start_time_us = 2; // micros since epoch + int64 end_time_us = 3; // 0 if incomplete + string task_description = 4; // "debug JWT token leak" + repeated EpisodeAction actions = 5; // sequence of steps + EpisodeOutcome outcome = 6; // success/partial/failure + value_score + float value_score = 7; // 0.0-1.0, outcome importance + repeated string tags = 8; // ["auth", "jwt"] for retrieval filtering + string contributing_agent = 9; // agent_id, reuses existing field +} + +message EpisodeAction { + int64 timestamp_us = 1; + string action_type = 2; // "query_memory", "tool_call", "decision" + string description = 3; + map metadata = 4; +} + +message EpisodeOutcome { + string status = 1; // "success" | "partial" | "failure" + float outcome_value = 2; // 0.0-1.0, how well did we do? + string summary = 3; // "JWT token rotation fixed in 3 steps" + int64 duration_ms = 4; // total task duration +} ``` -NoveltyChecker wired to ingest pipeline - -> Configurable threshold (already exists in NoveltyConfig) - -> Per-event-type policies (extends NoveltyChecker) - -> Agent-scoped dedup (extends vector search with agent filter) - -> Dedup dry-run mode (extends NoveltyChecker) - -> Dedup metrics exposed via gRPC (extends existing NoveltyMetrics) - -Temporal decay in ranking - -> Staleness half-life config (extends ranking config) - -> Stale result exclusion window (extends retrieval executor) - -> Supersession detection (extends ingest + retrieval) - -Vector similarity search at ingest (already exists: HnswIndex.search) - -> NoveltyChecker integration (already partially built) - -> Agent-scoped search filtering (needs metadata filter) + +**Storage:** RocksDB column family `CF_EPISODES`; keyed by episode_id; queryable by start_time range. + +**RPCs:** +```protobuf +service EpisodeService { + rpc StartEpisode(StartEpisodeRequest) returns (StartEpisodeResponse); + rpc RecordAction(RecordActionRequest) returns (RecordActionResponse); + rpc CompleteEpisode(CompleteEpisodeRequest) returns (CompleteEpisodeResponse); + rpc GetSimilarEpisodes(GetSimilarEpisodesRequest) returns (GetSimilarEpisodesResponse); + rpc ListEpisodes(ListEpisodesRequest) returns (ListEpisodesResponse); +} ``` +**Complexity:** MEDIUM — new storage layer; RPCs are straightforward; outcome_value is user-provided (not computed). + +**Expected behavior:** +- StartEpisode returns unique episode_id +- RecordAction appends to episode's action sequence +- CompleteEpisode commits outcome (idempotent) +- GetSimilarEpisodes returns episodes with similar task_description + tags +- Episodes survive crash recovery (like TOC nodes) + +--- + +### Value-Based Episode Retention + +**What it does:** Auto-delete low-value episodes; keep high-value ones; sweet-spot detection prevents pathological retention. + +**Problem:** If all episodes are retained, system degrades (storage + retrieval latency). If auto-delete is too aggressive, learning is lost. + +**Solution (industry pattern):** Retention threshold based on outcome score distribution. + +**Algorithm:** +1. **Analyze distribution:** Compute p25, p50, p75 of value_score across recent episodes +2. **Sweet spot:** Retain episodes in range [p50, p75] or [p50, 1.0] depending on storage pressure +3. **Culling policy:** Delete episodes with value_score < p25 OR older than 180 days +4. **Tuning lever:** Config parameter `retention_percentile` (default 50) + +**Rationale:** +- p25 (low-value): routine tasks, minimal learning value → delete early +- p50-p75 (sweet spot): moderately complex, high learning value → retain long-term +- p75+ (high-value): critical issues, precedent-setting → never auto-delete + +**Complexity:** HIGH — requires statistical analysis + configurable tuning; deferred to v2.6.2. + +**Expected behavior:** +- Retention job runs weekly without blocking writes +- Episodes with value_score < p25 are removed +- Operators can view retention policy metrics (deletion count, space reclaimed) + +--- + ## MVP Recommendation -Prioritize: +**Phase 1 (Weeks 1-2): Hybrid Search Wiring** +- Unblock Layer 3/4 routing logic +- Enables salience + usage-based ranking to have effect +- Complexity: MED, high impact + +**Phase 2 (Weeks 2-3): Salience Scoring at Write Time** +- Low complexity, enables kind-based exemptions in decay +- Integrates naturally with existing TOC/Grip protos +- Complexity: LOW + +**Phase 3 (Weeks 3-4): Usage-Based Decay in Retrieval Ranking** +- Multiplicative with StaleFilter; tunable floor +- Requires access_count tracking (add to TocNode/Grip) +- Complexity: MED + +**Phase 4 (Weeks 4-5): Admin Observability RPCs** +- Expose metrics for production troubleshooting +- Low complexity, high operational value +- Complexity: LOW + +**Phase 5 (Weeks 5-6): Vector Index Pruning + BM25 Lifecycle** +- Scheduler jobs; independent implementation +- Prevent unbounded index growth +- Complexity: LOW -1. **Wire NoveltyChecker into actual ingest pipeline** -- The checker exists but is not connected to the real ingest path. This is the single highest-value change: it immediately reduces noise in the vector/BM25 indexes. +**Phase 6 (Weeks 7-8, if time allows): Episodic Memory Storage & RPCs** +- Independent of ranking; can be built in parallel +- Complexity: MED, moderate impact -2. **Temporal decay factor in Layer 6 ranking** -- Add time-based decay alongside existing salience and usage_penalty scores. Formula: `decay = exp(-ln(2) * age_days / half_life_days)`, default half-life 14 days. Apply as a multiplier on retrieval scores post-search. +**Defer (v2.6.1 or v2.7):** +- **Value-Based Episode Retention** (v2.6.2) — Requires outcome scoring model; HIGH complexity +- **Similar Episode Retrieval** (v2.7) — Nice-to-have; HIGH complexity +- **Adaptive Lifecycle Policies** (v2.7) — Not essential; HIGH complexity -3. **Per-event-type dedup bypass** -- Skip dedup for structural events (session_start, session_end, subagent_start, subagent_stop). Only dedup content-bearing events (user_message, assistant_stop, tool_result). +--- -4. **Expose dedup metrics via gRPC** -- Wire existing `NoveltyMetrics` into a status RPC so operators can monitor dedup effectiveness and tune thresholds. +## Success Criteria -5. **E2E tests proving dedup works** -- Ingest duplicate events, verify only one is stored. Query with temporal decay, verify recent results rank higher. +**v2.6 Feature Completeness:** +- [ ] Hybrid search queries route correctly (E2E test hitting both BM25 + Vector) +- [ ] Salience scores populated at write time (inspect TOC nodes/grips in RocksDB) +- [ ] Usage decay reduces scores predictably (access_count increments, ranking penalizes correctly) +- [ ] Admin metrics RPCs return non-zero values (GetRankingStatus, GetDedupMetrics) +- [ ] Index pruning jobs complete without errors (scheduler logs show cleanup) +- [ ] Episodic memory RPCs accept/return well-formed protos (round-trip test) +- [ ] 10+ E2E tests cover new features (hybrid routing, decay behavior, lifecycle jobs, observability) -Defer: -- **Supersession detection**: High complexity, requires topic-matching infrastructure beyond simple vector similarity. Research deeper in a future phase. -- **Agent-scoped dedup**: Requires post-filtering HNSW results by agent metadata since usearch has no native metadata filtering. Feasible but adds complexity. Defer until multi-agent dedup is a validated pain point. -- **Stale result exclusion window per intent**: Nice to have but temporal decay covers 80% of the use case. Add later if decay alone is insufficient. +**Regression Prevention:** +- [ ] All v2.5 tests still pass (dedup, stale filter, multi-agent) +- [ ] No new performance regressions (latency within 5% of v2.5 baseline) +- [ ] Graceful degradation holds (hybrid search falls back if BM25 fails, etc.) -## Existing Infrastructure to Leverage +--- -| Component | Location | What It Provides | What's Missing | -|-----------|----------|-----------------|----------------| -| `NoveltyChecker` | `memory-service/src/novelty.rs` | Full fail-open dedup logic with embed -> search -> threshold | Not wired into actual ingest pipeline | -| `NoveltyConfig` | `memory-types/src/config.rs` | `enabled`, `threshold` (0.82), `timeout_ms` (50), `min_text_length` (50) | No per-event-type policies | -| `NoveltyMetrics` | `memory-service/src/novelty.rs` | Atomic counters for all dedup outcomes | Not exposed via gRPC | -| `VectorEntry.timestamp_millis` | `memory-vector/src/index.rs` | Timestamp on every indexed document | Not used in ranking | -| `SalienceScorer` | `memory-types/src/salience.rs` | Write-time salience calculation | No temporal component | -| `usage_penalty()` | `memory-types/src/usage.rs` | Access-count based decay formula | No time-based decay | -| `HnswIndex` | `memory-vector/src/hnsw.rs` | Cosine similarity search via usearch | No metadata filtering for agent-scoped search | -| `IndexingPipeline` | `memory-indexing/src/pipeline.rs` | Outbox-driven batch indexing | Dedup check not part of pipeline | -| `VectorIndexUpdater` | `memory-indexing/src/vector_updater.rs` | Embeds and indexes TOC nodes and grips | Already skips duplicates by doc_id (exact match only) | +## Integration with Existing Architecture + +**Layers Affected:** + +| Layer | Change | Impact | +|-------|--------|--------| +| Layer 0 (Events) | Add access_count tracking to event retrieval path | Minimal — new field, write-only during reads | +| Layer 1 (TOC) | Add salience_score, access_count to TocNode | Minimal — already has versioning for append-safe updates | +| Layer 2 (TOC Search) | None | None | +| Layer 3 (BM25) | Wire into hybrid routing; add pruning job | Medium — coordination with Layer 4 ranking | +| Layer 4 (Vector) | Wire into hybrid routing; add pruning job | Medium — coordination with Layer 3 ranking | +| Layer 5 (Topic Graph) | None | None | +| Layer 6 (Ranking) | Add salience factor, usage decay factor | Medium — multiplicative composition of factors | +| Control (Retrieval Policy) | Wire hybrid search router; tune fallback chains | Medium — new routing decision point | +| Scheduler | Add VectorPruneJob, Bm25RebuildJob | Low — framework already exists | +| Storage (RocksDB) | Add CF_EPISODES column family | Low — isolated new column family | + +**No breaking changes** to existing gRPC contracts; new RPCs/fields added via proto `oneof` or new message types. + +--- + +## Risk Mitigation + +| Risk | Likelihood | Mitigation | +|------|------------|-----------| +| **Hybrid search combines incompatible scores** | MED | Normalize both indexes to [0, 1] before fusion; test with known-good queries | +| **Usage decay creates retrieval bias** | MED | Log all decay factors in traces; audit queries with low access_count but high relevance | +| **Index pruning deletes needed content** | LOW | Dry-run mode with reporting; never auto-delete structural events; admin confirmation | +| **Episode value_score inflation** | MED | Cap at 1.0; require outcome_value validation in RPC; monitor distribution metrics | +| **Episodic memory storage bloat** | MED | Implement retention policy early; set aggressive TTL during v2.6 pilot | +| **Observability metrics cause latency** | LOW | Metrics are computed on-demand or cached; profile before/after RPC calls | + +--- ## Sources -- [Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory](https://arxiv.org/abs/2504.19413) -- Mem0 uses LLM-based memory extraction and dedup; we deliberately avoid this for latency reasons (MEDIUM confidence) -- [Temporal RAG: Why RAG Always Gets 'When' Questions Wrong](https://blog.sotaaz.com/post/temporal-rag-en) -- Temporal awareness critical for retrieval freshness (MEDIUM confidence) -- [Data Freshness Rot as the Silent Failure Mode in Production RAG Systems](https://glenrhodes.com/data-freshness-rot-as-the-silent-failure-mode-in-production-rag-systems-and-treating-document-shelf-life-as-a-first-class-reliability-concern-2/) -- Treats document shelf life as first-class concern (MEDIUM confidence) -- [Solving Freshness in RAG: A Simple Recency Prior](https://arxiv.org/html/2509.19376) -- Recency prior fused with semantic similarity for temporal ranking (MEDIUM confidence) -- [OpenAI Community: Rule of Thumb Cosine Similarity Thresholds](https://community.openai.com/t/rule-of-thumb-cosine-similarity-thresholds/693670) -- No universal threshold; 0.79-0.85 common for near-duplicate detection (MEDIUM confidence) -- [Data Deduplication at Trillion Scale](https://zilliz.com/blog/data-deduplication-at-trillion-scale-solve-the-biggest-bottleneck-of-llm-training) -- MinHash LSH at 0.8 threshold for near-duplicate detection at scale (MEDIUM confidence) -- [Enhancing RAG: A Study of Best Practices](https://arxiv.org/abs/2501.07391) -- RAG best practices including dedup in context assembly (HIGH confidence) -- [The Knowledge Decay Problem](https://ragaboutit.com/the-knowledge-decay-problem-how-to-build-rag-systems-that-stay-fresh-at-scale/) -- Staleness monitoring as ongoing operational concern (MEDIUM confidence) -- Existing codebase: `NoveltyChecker`, `NoveltyConfig`, `NoveltyMetrics`, `SalienceScorer`, `usage_penalty()`, `VectorEntry`, `HnswIndex` (HIGH confidence -- direct code inspection) +- [Designing Memory Architectures for Production-Grade GenAI Systems | Avijit Swain | March 2026](https://medium.com/@avijitswain11/designing-memory-architectures-for-production-grade-genai-systems-2c20f71f9a45) +- [Memory Patterns for AI Agents: Short-term, Long-term, and Episodic | DEV Community](https://dev.to/gantz/memory-patterns-for-ai-agents-short-term-long-term-and-episodic-5ff1) +- [From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms | Preprints.org](https://www.preprints.org/manuscript/202601.0618) +- [Implementing Cognitive Memory for Autonomous Robots: Hebbian Learning, Decay, and Consolidation in Production | Varun Sharma | Medium](https://medium.com/@29.varun/implementing-cognitive-memory-for-autonomous-robots-hebbian-learning-decay-and-consolidation-in-faea53b3973a) +- [A Comprehensive Hybrid Search Guide | Elastic](https://www.elastic.co/what-is/hybrid-search) +- [About hybrid search | Vertex AI | Google Cloud Documentation](https://docs.cloud.google.com/vertex-ai/docs/vector-search/about-hybrid-search) +- [Full-text search for RAG apps: BM25 & hybrid search | Redis](https://redis.io/blog/full-text-search-for-rag-the-precision-layer/) +- [7 Hybrid Search Recipes: BM25 + Vectors Without Lag | Hash Block | Medium](https://medium.com/@connect.hashblock/7-hybrid-search-recipes-bm25-vectors-without-lag-467189542bf0) +- [Hybrid Search: Combining BM25 and Semantic Search for Better Results with Langchain | Akash A Desai | Medium](https://medium.com/etoai/hybrid-search-combining-bm25-and-semantic-search-for-better-results-with-lan-1358038fe7e6) +- [Hybrid Search RAG in the Real World: Graphs, BM25, and the End of Black-Box Retrieval | NetApp Community](https://community.netapp.com/t5/Tech-ONTAP-Blogs/Hybrid-RAG-in-the-Real-World-Graphs-BM25-and-the-End-of-Black-Box-Retrieval/ba-p/464834) +- [Index lifecycle management (ILM) in Elasticsearch | Elastic Docs](https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management) +- [What is agent observability? Tracing tool calls, memory, and multi-step reasoning | Braintrust](https://www.braintrust.dev/articles/agent-observability-tracing-tool-calls-memory) +- [Observability for AI Workloads: A New Paradigm for a New Era | Dotan Horovits | Medium | January 2026](https://horovits.medium.com/observability-for-ai-workloads-a-new-paradigm-for-a-new-era-b8972ba1b6ba) +- [AI Agent Memory Security Requires More Observability | Valdez Ladd | Medium | December 2025](https://medium.com/@oracle_43885/ai-agent-memory-security-requires-more-observability-b12053e39ff0) +- [Building Self-Improving AI Agents: Techniques in Reinforcement Learning and Continual Learning | Technology.org | March 2026](https://www.technology.org/2026/03/02/self-improving-ai-agents-reinforcement-continual-learning/) +- [Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning | OpenReview](https://openreview.net/forum?id=h3LlJ6Bh4S) +- [Experiential Reinforcement Learning | Microsoft Research](https://www.microsoft.com/en-us/research/articles/experiential-reinforcement-learning/) +- [A Survey on the Memory Mechanism of Large Language Model-based Agents | ACM Transactions on Information Systems](https://dl.acm.org/doi/10.1145/3748302) +- [Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions | ICLR 2026 | GitHub](https://github.com/HUST-AI-HYZ/MemoryAgentBench) +- [Cache Replacement Policies Explained for System Performance | Aerospike](https://aerospike.com/blog/cache-replacement-policies/) +- [How to Configure LRU and LFU Eviction in Redis | OneUptime | January 2026](https://oneuptime.com/blog/post/2026-01-25-redis-lru-lfu-eviction/view) + +--- + +**Last Updated:** 2026-03-11 +**For Milestone:** v2.6 Retrieval Quality, Lifecycle & Episodic Memory diff --git a/.planning/research/STACK.md b/.planning/research/STACK.md index e38643e..7278cc8 100644 --- a/.planning/research/STACK.md +++ b/.planning/research/STACK.md @@ -1,238 +1,216 @@ -# Technology Stack: v2.5 Semantic Dedup & Retrieval Quality +# Technology Stack: v2.6 Episodic Memory, Salience Scoring, Lifecycle Automation -**Project:** Agent Memory v2.5 -**Researched:** 2026-03-05 -**Focus:** Ingest-time semantic dedup gate and stale result filtering +**Project:** Agent Memory — Local agentic memory system with retrieval layers +**Researched:** 2026-03-11 +**Confidence:** HIGH -## Key Finding: No New Dependencies Required +## Executive Summary -The existing stack already provides everything needed for both features. This milestone is purely a **feature implementation** on top of existing infrastructure, not a stack expansion. +The v2.6 milestone adds episodic memory (task outcome tracking), salience/usage-based ranking, lifecycle automation, and BM25 hybrid wiring to a mature 14-crate Rust system (v2.5 shipped with semantic dedup + stale filtering). -**Confidence:** HIGH -- based on direct codebase inspection of all relevant crates. - -## Existing Stack (Relevant to v2.5) - -### Already Present -- Use As-Is - -| Technology | Version (Locked) | Crate | Role in v2.5 | -|------------|-----------------|-------|---------------| -| usearch | 2.23.0 | memory-vector | HNSW index for dedup similarity search at ingest | -| candle-core/nn/transformers | 0.8.4 | memory-embeddings | all-MiniLM-L6-v2 embedding generation for dedup | -| RocksDB | 0.22 | memory-storage | Dedup metadata storage, staleness markers | -| chrono | 0.4 | memory-types | Timestamp comparison for staleness decay | -| tokio | 1.43 | memory-service | Async timeout for dedup gate (fail-open) | -| serde/serde_json | 1.0 | memory-types | Config serialization for dedup/staleness settings | - -### No Version Bumps Needed - -All current versions support the required operations: -- **usearch 2.23.0**: `search()` returns distances, `add()` inserts vectors -- both needed for dedup gate. Already validated in `HnswIndex::search()` at `crates/memory-vector/src/hnsw.rs`. -- **candle 0.8.4**: `embed()` generates 384-dim vectors -- same embedder used for query-path vector teleport. Already wrapped in `CandleEmbedder` at `crates/memory-embeddings/`. -- **RocksDB 0.22**: Column families support metadata storage. `VectorMetadata` at `crates/memory-vector/src/metadata.rs` already maps vector IDs to doc IDs with timestamps (`VectorEntry.created_at`). - -## Integration Points for v2.5 - -### Feature 1: Ingest-Time Semantic Dedup Gate - -**What exists:** The `NoveltyChecker` at `crates/memory-service/src/novelty.rs` already implements the exact pattern needed -- a fail-open, opt-in, async vector similarity check at ingest time. It: -- Has `EmbedderTrait` and `VectorIndexTrait` abstractions -- Implements timeout with fail-open behavior -- Tracks metrics (skipped_disabled, skipped_no_embedder, skipped_no_index, skipped_index_not_ready, skipped_error, skipped_timeout, skipped_short_text, stored_novel, rejected_duplicate) -- Uses `NoveltyConfig` with threshold (default 0.82), timeout (50ms), min_text_length (50) -- Is disabled by default, requires explicit opt-in - -**What needs to change:** The current `NoveltyChecker` uses its own `EmbedderTrait` and `VectorIndexTrait` that are **not wired to the actual usearch index**. The `check_similarity()` method delegates to abstract traits but the real `HnswIndex` and `CandleEmbedder` are not connected. The implementation needs: - -1. **Wire `NoveltyChecker` to real `HnswIndex`** -- Implement `VectorIndexTrait` for `Arc>` with `VectorMetadata` lookup to convert vector IDs back to doc IDs -2. **Wire `NoveltyChecker` to real `CandleEmbedder`** -- Implement `EmbedderTrait` for `Arc` (wrapping the sync `embed()` call in `tokio::task::spawn_blocking`) -3. **Integrate into ingest path** -- The `MemoryServiceImpl` at `crates/memory-service/src/ingest.rs` needs to call `NoveltyChecker::should_store()` before `storage.put_event()` -4. **Adjust threshold** -- Current default of 0.82 may need tuning; 0.92 is more appropriate for dedup (vs novelty detection which should be looser) - -**Stack impact:** Zero new crates. The `NoveltyChecker` pattern is already built; it just needs plumbing. - -### Feature 2: Stale Result Filtering/Downranking - -**What exists:** The ranking layer already has these components: -- **Salience scoring** (`crates/memory-types/src/salience.rs`): Write-time importance scoring with `SalienceScorer`, formula: `base(0.35) + length_density + kind_boost + pinned_boost` -- **Usage decay** (`crates/memory-types/src/usage.rs`): `usage_penalty()` function using `1 / (1 + decay_factor * access_count)`, `apply_usage_penalty()` multiplies score by penalty -- **VectorMetadata** (`crates/memory-vector/src/metadata.rs`): `VectorEntry.created_at` timestamp (ms since epoch) already stored for every indexed vector -- **Retrieval policy** (`crates/memory-retrieval/src/`): Intent classification, tier detection, execution orchestration with `StopConditions` including `min_confidence` - -**What needs to be added (pure Rust, no new deps):** - -1. **Staleness config** -- Add `StalenessConfig` to `crates/memory-types/src/config.rs` alongside `NoveltyConfig`: - - `enabled: bool` (default: false, matching existing opt-in pattern) - - `decay_half_life_days: f32` (default: 30.0) -- score halves every N days - - `supersession_threshold: f32` (default: 0.90) -- similarity above which newer content supersedes older - - `max_age_penalty: f32` (default: 0.1) -- floor for time decay (never fully zero out old results) - -2. **Time-decay scoring** -- Add `staleness_penalty()` to `crates/memory-types/src/usage.rs` (adjacent to existing `usage_penalty()`): - - Formula: `max(max_age_penalty, 0.5^(age_days / half_life_days))` -- exponential decay with floor - - Applied as multiplicative factor on retrieval scores, same pattern as `apply_usage_penalty()` - - Uses `chrono::Utc::now()` vs `VectorEntry.created_at` -- both already available - -3. **Supersession detection** -- When multiple results are semantically similar (cosine > supersession_threshold), keep only the most recent: - - Compare pairwise similarity of top-K results (embeddings available via `VectorMetadata` + `HnswIndex`) - - For each cluster of similar results, retain the newest by `created_at` - - This reuses `HnswIndex::search()` and `VectorMetadata::get()` -- no new dependencies - -4. **Ranking integration** -- Apply staleness penalty in the retrieval/query layer at `crates/memory-service/src/teleport_service.rs` or `crates/memory-service/src/query.rs` - -**Stack impact:** Zero new crates. All computation uses existing `chrono` timestamps and `usearch` similarity scores. +**No new external dependencies required.** The existing stack (Tantivy, Candle, usearch, RocksDB) handles all new features. The key changes are: +1. **Schema extensions** in proto to episodic messages + outcome fields +2. **New crates** for episodic storage (not new packages — use existing RocksDB) +3. **Configuration** for retention, salience, value thresholds +4. **Existing APIs** (vector pruning, BM25 lifecycle) wired into scheduler ## Recommended Stack -### Core Framework (NO CHANGES) +### No New External Dependencies -| Technology | Version | Purpose | Why No Change | -|------------|---------|---------|---------------| -| usearch | 2.23.0 | HNSW vector index | Already supports search() for dedup gate | -| candle-* | 0.8.4 | Local embeddings | Already generates 384-dim vectors | -| RocksDB | 0.22 | Storage + metadata | Already stores timestamps for staleness | -| tokio | 1.43 | Async runtime | Already used for timeout in NoveltyChecker | -| chrono | 0.4 | Time calculations | Already used for timestamps throughout | +| Category | Tech | Version | Why | Status | +|----------|------|---------|-----|--------| +| **Episodic Storage** | RocksDB (existing) | 0.22 | Same append-only engine + new CF_EPISODES | Already in use | +| **Hybrid Search** | Tantivy (existing) + usearch (existing) | 0.25 / 2 | RRF fusion between BM25 and vector | Implemented in v2.2 | +| **Embeddings** | Candle (existing) + all-MiniLM-L6-v2 | 0.8 | Local inference, no API calls | Validated v2.0 | +| **Async Runtime** | Tokio + tonic | 1.43 / 0.12 | gRPC service, scheduler tasks | Core infrastructure | +| **Serialization** | serde + serde_json + prost | 1.0 / 1.0 / 0.13 | Config, JSON, proto messages | Standard | +| **Time** | chrono | 0.4 | Timestamps, decay calculations | Already in use | +| **Concurrency** | dashmap + Arc + std::sync::RwLock | 6 / — / — | ConcurrentHashMap for usage stats, RwLock for InFlightBuffer | Already in use | -### Supporting Libraries (NO CHANGES) +### Already-Integrated Libraries (No Upgrades Needed) -| Library | Version | Purpose | Why No Change | -|---------|---------|---------|---------------| -| serde/serde_json | 1.0 | Config/metadata serialization | Already serializes NoveltyConfig, VectorEntry | -| tracing | 0.1 | Logging for dedup decisions | Already used in NoveltyChecker | -| thiserror | 2.0 | Error types for new error variants | Already used in all crates | -| async-trait | (existing) | Async trait bounds for EmbedderTrait/VectorIndexTrait | Already used in memory-service | +| Library | Current Version | Purpose | Note | +|---------|-----------------|---------|------| +| usearch | 2 | HNSW vector index + dedup similarity | Used in cross-session dedup (v2.5) | +| hdbscan | 0.12 | Semantic clustering for topic graph | Topic discovery layer (v2.0) | +| lru | 0.12 | LRU cache for usage tracking | Access count caching in storage (v2.1) | +| ulid | 1.1 | Unique ID generation | Event IDs, Episode IDs | +| tokio-cron | (via tokio-util) | Background scheduler | Job scheduling for lifecycle jobs | +| thiserror | 2.0 | Error types | Standard error handling | +| tracing | 0.1 | Observability | Logging + metrics | -### What NOT to Add +## Architecture Integration Points -| Temptation | Why Avoid | -|------------|-----------| -| SimHash / MinHash crate | Overkill -- cosine similarity via usearch is sufficient for 384-dim vectors. SimHash trades accuracy for speed but HNSW is already O(log n). | -| Bloom filter crate | Adds complexity without benefit -- HNSW search is already O(log n) and provides similarity scores, not just membership | -| Separate dedup index | Unnecessary -- reuse existing HNSW index; dedup is just search-before-insert on the same index | -| External embedding service | Already have local Candle; adding API dependency violates zero-API-dependency design principle | -| Time-series DB for staleness | RocksDB already stores timestamps; exponential decay is a pure math function, not a query | -| Approximate dedup (LSH) | usearch cosine similarity is accurate enough for 384-dim; LSH adds false negatives which means lost dedup | -| ordered-float crate | Unnecessary for score comparison; f32 comparisons with `partial_cmp` are fine for ranking | -| New column family for dedup state | The existing `VectorMetadata` already stores everything needed (vector_id, doc_id, created_at, text_preview) | +### 1. Episodic Memory Storage (New Crate: memory-episodes) -## Architecture of Changes (Stack Perspective) +**Location:** `crates/memory-episodes/` +**Dependencies:** memory-types, memory-storage, memory-embeddings, tokio, serde -``` -Ingest Path (BEFORE v2.5): - gRPC IngestEvent -> NoveltyChecker (UNWIRED) -> Store in RocksDB -> Outbox -> Background indexing - -Ingest Path (AFTER v2.5): - gRPC IngestEvent -> NoveltyChecker (WIRED) -> Store in RocksDB -> Outbox -> Background indexing - | - +-> Embed text (CandleEmbedder -- already instantiated in service) - +-> Search HNSW (usearch -- already instantiated in service) - +-> If similarity > threshold: reject (fail-open on any error) - -Query Path (BEFORE v2.5): - Search -> Rank by relevance + salience + usage_decay - -Query Path (AFTER v2.5): - Search -> Rank by relevance + salience + usage_decay + [STALENESS] -> [SUPERSESSION] -> Return - | | - | +-> Pairwise cosine on top-K - | +-> Keep newest per cluster - +-> Apply time-decay penalty (chrono math) -``` +**Integration:** +- New column family in RocksDB: `CF_EPISODES` +- Store Episode structs (episode_id → Episode JSON in RocksDB) +- Reuse existing embedding pipeline (Candle all-MiniLM-L6-v2) +- Store episode embeddings in same vector index as TOC nodes (with metadata tag "episode") -## Crate Dependency Changes +**No new dependencies:** RocksDB is the storage engine. Episode lifecycle management reuses the existing scheduler (memory-scheduler). -### memory-service (changes needed) -- **Already depends on:** memory-embeddings, memory-vector, memory-types, memory-storage, memory-search, memory-scheduler, tokio, async-trait -- **Needs:** Wire `NoveltyChecker` to real `HnswIndex` and `CandleEmbedder` implementations. Add supersession filter as post-processing step in teleport/hybrid results. -- **No new Cargo.toml entries.** +### 2. Salience + Usage Ranking (memory-retrieval enhancement) -### memory-types (changes needed) -- **Already depends on:** serde, chrono -- **Needs:** Add `StalenessConfig` struct (same file as `NoveltyConfig`). Add `staleness_penalty()` and `apply_staleness_penalty()` functions (same file as `usage_penalty()`). -- **No new Cargo.toml entries.** +**Current state:** +- Salience fields exist in proto (TocNode.salience_score, TocNode.memory_kind) and memory-types +- Usage tracking exists (UsageStats, UsageConfig in memory-types, dashmap cache in memory-storage) +- SalienceScorer exists in memory-types but not wired into retrieval -### memory-retrieval (may need changes) -- **Already depends on:** memory-types, chrono, async-trait -- **Needs:** If staleness filtering is done at the retrieval policy layer (vs service layer), add staleness config to execution context. `StopConditions` may need a `staleness_enabled` field. -- **No new Cargo.toml entries.** +**Changes needed:** +- Wire SalienceScorer into all retrieval result ranking (BM25, vector, topics) +- Thread usage stats from storage through retrieval pipeline +- Apply formula: `score = base_similarity * (0.55 + 0.45 * salience) * usage_penalty(access_count)` -### memory-vector (no changes) -- Already has: `HnswIndex` with `search()`, `VectorMetadata` with `VectorEntry.created_at` -- No modifications needed -- the vector layer is a read target for dedup, not modified. +**No new dependencies:** Uses existing UsageConfig, SalienceScorer, and dashmaps in storage. -### memory-indexing (no changes) -- Already has: `VectorIndexUpdater` that adds to HNSW index via outbox pipeline -- The dedup gate runs BEFORE event storage (and therefore before indexing), so no changes here. +### 3. Lifecycle Automation (memory-scheduler + memory-search enhancements) -### memory-embeddings (no changes) -- Already has: `CandleEmbedder` with `embed()` method, `EmbeddingModel` trait -- The dedup gate wraps this in `EmbedderTrait` adapter at the service layer. +**Current state:** +- Tokio cron scheduler exists (memory-scheduler crate) +- Vector pruning API exists: `VectorIndexPipeline::prune(age_days)` +- BM25 lifecycle config exists: `Bm25LifecycleConfig` +- RocksDB operations are append-only; soft-delete via filtered rebuild -## Configuration Design +**Changes needed:** +- Add scheduler job for vector index pruning (daily 3 AM) +- Add scheduler job for BM25 index rebuild with level filter (weekly) +- Wire config from `[lifecycle]` section in config.toml +**Configuration additions (config.toml):** ```toml -# In ~/.config/agent-memory/config.toml - -# Existing config -- already implemented, just needs wiring -[novelty] -enabled = false # Opt-in dedup gate (existing field) -threshold = 0.92 # Bump from 0.82 for stricter dedup -timeout_ms = 100 # Bump from 50ms to allow embedding + search -min_text_length = 50 # Existing field, keep as-is - -# New config section -[staleness] -enabled = false # Opt-in, matching novelty pattern -decay_half_life_days = 30.0 # Score halves every 30 days -supersession_threshold = 0.90 # Cosine sim for "this supersedes that" -max_age_penalty = 0.1 # Floor -- never fully zero out old results +[lifecycle] +enabled = true + +[lifecycle.vector] +# Existing but needs automation +segment_retention_days = 30 +grip_retention_days = 30 +day_retention_days = 365 +prune_schedule = "0 3 * * *" + +[lifecycle.bm25] +segment_retention_days = 30 +grip_retention_days = 30 +rebuild_schedule = "0 4 * * 0" # Weekly Sunday 4 AM + +[lifecycle.episodes] +# New: Value-based retention for episodes +value_threshold = 0.18 +max_episodes = 1000 +prune_schedule = "0 2 * * *" ``` -**Design decision:** Keep `NoveltyConfig` name and semantics -- the "novelty check" IS the "dedup gate." The name `novelty` accurately describes checking whether incoming content is novel relative to existing content. Adding a separate `DedupConfig` would duplicate the same structure. +**No new dependencies:** Reuses Tokio cron, existing RocksDB, existing lifecycle APIs. -**Threshold tuning note:** The default threshold should be raised from 0.82 to 0.92 because: -- 0.82 is appropriate for "is this content novel enough to be interesting?" (novelty detection) -- 0.92 is appropriate for "is this content essentially the same thing?" (dedup) -- The difference matters: at 0.82, paraphrased content gets rejected; at 0.92, only near-identical content does +### 4. BM25 Hybrid Wiring (memory-search enhancement) -## Alternatives Considered +**Current state:** +- HybridSearch RPC exists in proto +- BM25 search (TeleportSearch) exists +- Vector search exists +- RRF fusion algorithm designed but not fully wired into routing -| Category | Recommended | Alternative | Why Not | -|----------|-------------|-------------|---------| -| Dedup mechanism | Reuse NoveltyChecker + real HNSW | Separate dedup index (hash-based) | NoveltyChecker already implements the pattern; hash-based loses semantic similarity | -| Dedup mechanism | Reuse NoveltyChecker + real HNSW | Content hash (SHA-256) | Catches only exact duplicates; misses semantic duplicates like paraphrases | -| Staleness scoring | Exponential time decay | Linear decay | Exponential is standard for memory/forgetting curves; old results should not linearly vanish | -| Supersession | Pairwise cosine of top-K | Track explicit supersession links in storage | Explicit links require schema changes, complex bookkeeping, and backfill; pairwise cosine is stateless | -| Config pattern | Opt-in with fail-open | Always-on | Matches existing novelty/usage patterns; lets users enable when ready | -| Threshold default | 0.92 for dedup | 0.82 (existing) | 0.82 is too aggressive for dedup; rejects legitimately different content | +**Changes needed:** +- Wire BM25 results through hybrid search handler (not hardcoded `false`) +- Apply RRF normalization: `score = 60 / (60 + rank_bm25) + 60 / (60 + rank_vector)` +- Weight fusion by mode (HYBRID_MODE_HYBRID uses 0.5/0.5 by default) +- Ensure agent filtering applied to both tiers -## Installation +**No new dependencies:** Uses existing Tantivy and usearch. -```bash -# No new dependencies -- just build -cargo build --workspace +## Integration Path (No Blockers) -# No Cargo.toml changes needed -# All features implemented using existing crates +``` +v2.5 (Shipped) → v2.6 (New) +├─ Existing Schema ✓ +│ ├─ TocNode.salience_score (proto field 101) +│ ├─ TocNode.memory_kind (proto field 102) +│ └─ Grip.salience_score (proto field 11) +│ +├─ New Schema (Proto additions, field numbers > 200) +│ ├─ Episode message (new column family CF_EPISODES) +│ ├─ StartEpisodeRequest/Response +│ ├─ RecordActionRequest/Response +│ ├─ CompleteEpisodeRequest/Response +│ └─ GetSimilarEpisodesRequest/Response +│ +├─ Storage (RocksDB only) +│ ├─ CF_EPISODES (append-only episode journal) +│ └─ Existing usage stats cache (dashmap) +│ +├─ Computation (Existing ML stack) +│ ├─ Episode embeddings (Candle all-MiniLM-L6-v2) +│ ├─ Similarity search (usearch HNSW) +│ └─ Salience scoring (existing formula) +│ +├─ Lifecycle (Tokio scheduler only) +│ ├─ Vector prune job (existing API, new scheduler wiring) +│ ├─ BM25 rebuild job (existing API, new scheduler wiring) +│ └─ Episode prune job (new, reuses same job framework) +│ +└─ Retrieval (memory-retrieval + handlers) + ├─ Hybrid search wiring (existing RPC, new routing) + ├─ Salience integration (existing scorer, new ranking layer) + ├─ Usage decay application (existing stats, new formula) + └─ Episode similarity search (new handler, existing embeddings) ``` -## Proto Changes +## What NOT to Add + +| Anti-Pattern | Reason | What to Do Instead | +|--------------|--------|-------------------| +| New async runtime | Tokio is standard for Rust systems | Keep tokio 1.43 | +| Separate vector DB (Weaviate, Qdrant, etc.) | Single-process system; RocksDB is correct | Store vectors in HNSW index + metadata | +| SQL database (SQLx, Tokio-postgres) | Append-only RocksDB is the model | Add new column families, not tables | +| New LLM API for embeddings | Local Candle ensures zero API dependency | Use all-MiniLM-L6-v2 exclusively | +| Feature flag framework (feature-gates) | Not needed; code is simple enough | Use config.toml bools for toggles | +| Streaming/real-time updates (tonic streaming for episodes) | Unidirectional request/response is correct | Keep gRPC request/response pattern | +| Consolidation/NLP extraction (spaCy, NLTK) | Out of scope for v2.6; episodic memory only | Defer to v2.7 if pursued | + +## Verification Checklist + +- [x] Episodic storage: RocksDB column family sufficient (no new DB) +- [x] Embeddings: Candle handles episodes same as TOC nodes +- [x] Hybrid search: Existing BM25/vector APIs, just needs routing wiring +- [x] Lifecycle jobs: Tokio scheduler covers vector/BM25/episode pruning +- [x] Salience: Proto fields and SalienceScorer already defined; integrate into ranking +- [x] Usage tracking: dashmap + LRU cache already in place +- [x] No runtime changes: Tokio 1.43 sufficient for all async operations +- [x] Proto safety: Field numbers > 200 reserved for phase 23+ (safe to add episodes) +- [x] Backward compatibility: All new fields optional in proto; serde(default) handles JSON parsing + +## Confidence Assessment + +| Component | Level | Notes | +|-----------|-------|-------| +| **RocksDB schema** | HIGH | CF_EPISODES is straightforward append-only; validated pattern | +| **Embeddings** | HIGH | all-MiniLM-L6-v2 + Candle proven in production (v2.0+) | +| **Vector search** | HIGH | usearch HNSW + dedup similarity search working (v2.5) | +| **Scheduler** | HIGH | Tokio cron job framework operational since v2.0 | +| **Hybrid fusion** | MEDIUM | RRF algorithm designed, existing handlers need wiring only | +| **Salience integration** | HIGH | SalienceScorer exists, needs threading through retrieval | +| **Configuration** | HIGH | config.toml pattern established; new sections are additive | +| **Episode retention** | MEDIUM | Value-based pruning algorithm is novel but low-complexity (threshold check) | -The gRPC proto at `proto/memory.proto` may need minor additions: -- `IngestEventResponse` could include a `deduplicated: bool` field indicating the event was rejected -- `GetRankingStatusResponse` could include staleness config status -- Field numbers >200 are reserved for new additions (per project convention) +## Sources -No new RPCs needed. Dedup is transparent to callers (event just silently not stored). Staleness is transparent to callers (results just ranked differently). +- **Code:** `/Users/richardhightower/clients/spillwave/src/agent-memory/` + - Workspace Cargo.toml (dependencies verified 2026-03-11) + - proto/memory.proto (schema v2.5 shipped, v2.6 additions safe in field > 200) + - crates/memory-types/src/ (SalienceScorer, UsageStats, UsageConfig, DedupConfig, StalenessConfig) + - crates/memory-storage/src/ (dashmap 6.0, lru 0.12, RocksDB 0.22) + - crates/memory-search/src/lifecycle.rs (Bm25LifecycleConfig, retention_map) + - crates/memory-scheduler/ (Tokio cron job framework) + - crates/memory-vector/ (VectorIndexPipeline::prune API) -## Sources +- **Design:** `.planning/PROJECT.md` (v2.6 requirements, validated decisions) +- **RFC:** `docs/plans/memory-ranking-enhancements-rfc.md` (episodic memory Tier 2 spec, lifecycle Tier 1.5) -- Direct codebase inspection: `crates/memory-service/src/novelty.rs` -- NoveltyChecker with EmbedderTrait, VectorIndexTrait, fail-open, metrics, disabled-by-default -- Direct codebase inspection: `crates/memory-vector/src/hnsw.rs` -- HnswIndex wrapping usearch with cosine similarity, search() returns 1.0-distance -- Direct codebase inspection: `crates/memory-vector/src/metadata.rs` -- VectorEntry with created_at timestamp, VectorMetadata RocksDB store -- Direct codebase inspection: `crates/memory-types/src/config.rs` -- NoveltyConfig with threshold 0.82, timeout 50ms, disabled by default -- Direct codebase inspection: `crates/memory-types/src/usage.rs` -- usage_penalty() and apply_usage_penalty() patterns -- Direct codebase inspection: `crates/memory-types/src/salience.rs` -- SalienceScorer write-time scoring -- Direct codebase inspection: `crates/memory-indexing/src/vector_updater.rs` -- VectorIndexUpdater pipeline -- Direct codebase inspection: `crates/memory-service/src/ingest.rs` -- IngestEvent RPC handler -- Direct codebase inspection: `crates/memory-retrieval/src/types.rs` -- StopConditions, CapabilityTier, QueryIntent -- Cargo.lock: usearch 2.23.0, candle-core 0.8.4, tantivy 0.25.0, rocksdb 0.22 +--- +*Research completed 2026-03-11. No external dependencies added. All features implemented via existing crates + RocksDB column families + proto schema extensions.* diff --git a/.planning/research/SUMMARY.md b/.planning/research/SUMMARY.md index 1d7b6e5..953617a 100644 --- a/.planning/research/SUMMARY.md +++ b/.planning/research/SUMMARY.md @@ -1,248 +1,215 @@ # Project Research Summary -**Project:** Agent Memory v2.5 — Semantic Dedup & Retrieval Quality -**Domain:** Ingest-time semantic deduplication and stale result filtering for append-only event store -**Researched:** 2026-03-05 +**Project:** Agent Memory — v2.6 Episodic Memory, Ranking Quality, Lifecycle & Observability +**Domain:** Rust-based cognitive memory architecture for AI agents (gRPC service, 14-crate workspace) +**Researched:** 2026-03-11 **Confidence:** HIGH ## Executive Summary -Agent Memory v2.5 adds two capabilities: ingest-time semantic deduplication to prevent near-identical events from polluting the vector and BM25 indexes, and stale result filtering to downrank superseded content at query time. All four research streams confirm that no new Rust crate dependencies are required — usearch 2.23.0, Candle 0.8.4, RocksDB 0.22, and chrono 0.4 already provide everything needed. The codebase already contains a largely-complete `NoveltyChecker` in `memory-service/src/novelty.rs` that implements the correct fail-open, opt-in, metric-tracked pattern — the primary work is wiring it to real infrastructure and resolving four critical design decisions identified by the pitfalls researcher. +Agent Memory v2.6 is a mature milestone adding four orthogonal capabilities to a production-proven 14-crate Rust system: episodic memory (task outcome recording and retrieval), ranking quality (salience + usage-based decay composition), lifecycle automation (scheduled vector/BM25/episode pruning), and observability RPCs (admin metrics for dedup, ranking, episodes). The system already has 7 shipped milestones (v1.0–v2.5), 48,282 LOC, 122 plans, and a complete 6-layer retrieval stack (TOC, agentic search, BM25, vector, topic graph, ranking). The critical architectural insight is that v2.6 requires zero new external dependencies — every new feature plugs into existing patterns (RocksDB column families, Tokio scheduler jobs, Arc handler injection, proto field extensions) rather than introducing structural changes. -The hardest problem is not the feature implementation itself but the architectural constraints that must be resolved first. The PITFALLS researcher identified four critical issues that contradict the naive implementation: (1) the HNSW index contains TOC nodes and grips, NOT raw events, so comparing incoming events against it produces misleading similarity scores; (2) the async outbox pipeline creates a timing gap where burst duplicates (the most common kind) escape detection entirely; (3) ingest-time event dropping breaks the append-only invariant that TOC segmentation depends on; and (4) stale filtering stacks multiplicatively with existing ranking penalties, risking score collapse on high-salience historical content. Each of these requires a design decision before implementation begins. +The recommended approach is additive integration in four phases (39–42). Phase 39 lays the episodic storage foundation (CF_EPISODES column family + proto schema), Phase 40 implements the EpisodeHandler RPCs (StartEpisode, RecordAction, CompleteEpisode, GetSimilarEpisodes), Phase 41 wires the RankingPayloadBuilder (salience × usage decay × stale penalty = explainable final score + observability extensions), and Phase 42 registers lifecycle scheduler jobs (EpisodeRetentionJob, VectorPruneJob). The architecture is dependency-ordered: storage before handlers, handlers before ranking composition, ranking before lifecycle. The key feature dependency that must be respected is that hybrid search wiring (BM25 routing) should come before or alongside salience/usage ranking to ensure ranking signals have results to operate on. -The recommended approach addresses all four: use a two-tier dedup system (in-memory in-flight buffer of 384-dim embeddings as primary, HNSW as secondary for cross-session), store-and-skip-indexing instead of dropping events to preserve append-only semantics, set a conservative default threshold of 0.85 with dry-run mode for calibration, and exempt high-salience memory kinds (Constraint, Definition, Procedure) from stale filtering entirely. The architecture researcher and pitfalls researcher are in full agreement on this approach, and the STACK researcher confirms no new dependencies are needed to implement it. +The primary risks come from the existing dedup architecture (v2.5): the HNSW vector index does NOT contain raw event embeddings (only TOC summaries), so dedup and similarity comparisons must use the in-memory InFlightBuffer as the primary source rather than the index. Stale filtering must be bounded (max 30% score reduction) and must exempt structural memory kinds (Constraint, Definition, Procedure) to avoid burying critical historical context. Ranking signals must be composed with a defined formula before implementation to avoid score collapse — multiplicative stacking of salience + usage + stale + novelty penalties can crush all scores to near-zero, triggering false fallback-chain activations and dropping valid results below the min_confidence threshold. ## Key Findings ### Recommended Stack -No new dependencies. The entire milestone is implemented using existing crates — nothing in `Cargo.toml` changes. See `.planning/research/STACK.md` for full detail. +The v2.6 stack requires no new external dependencies. All features are implemented via existing crates. See `.planning/research/STACK.md` for full details. **Core technologies:** -- **usearch 2.23.0** (`memory-vector`): HNSW index for cross-session dedup similarity search — already has `search()` and `add()`, already instantiated in the service -- **Candle 0.8.4** (`memory-embeddings`): all-MiniLM-L6-v2 local embedding generation — already wrapped in `CandleEmbedder`, already generates 384-dim vectors; no external API dependency -- **RocksDB 0.22** (`memory-storage`): dedup metadata storage, staleness markers, existing `VectorEntry.created_at` timestamps cover all staleness needs -- **chrono 0.4** (`memory-types`): timestamp comparison for staleness decay — already used throughout -- **tokio 1.43** (`memory-service`): async timeout for dedup gate (fail-open on timeout) — already used in `NoveltyChecker` +- **RocksDB (0.22):** Episodic storage via new CF_EPISODES and CF_EPISODE_METRICS column families; append-only, crash-safe — already in production +- **Candle + all-MiniLM-L6-v2:** Episode embeddings for GetSimilarEpisodes; 384-dim, CPU-only, ~5ms per embedding — validated since v2.0 +- **usearch HNSW (v2):** Vector similarity search for episode retrieval; O(log n) approximate nearest neighbor — in production since v2.2 +- **Tantivy BM25 (0.25):** Hybrid search lexical tier; needs routing wiring to complete Layer 3/4 integration — implemented but not fully wired into routing handler +- **Tokio cron scheduler:** Background lifecycle jobs; framework exists since v1.0, needs EpisodeRetentionJob + VectorPruneJob registered +- **dashmap + Arc:** Usage stats tracking (access_count, last_accessed_ms) for ranking decay — already in CF_USAGE_COUNTERS +- **prost + tonic (0.13/0.12):** Proto schema extensions for Episode messages + 4 new RPCs; field numbers reserved above 200 — backward-compatible additions -**What NOT to add:** SimHash/MinHash crates, Bloom filter crates, external embedding services, separate time-series databases for staleness, ordered-float crate — all are overkill for the 384-dim cosine similarity + exponential decay approach. +**Critical constraint:** All proto additions must use field numbers above 200 (reserved for Phase 23+ per PROJECT.md). The CF_EPISODES key format is `ep:{start_ts:013}:{ulid}` — lexicographic ordering enables time-range scans without secondary indexes. No SQL, no separate vector DB, no streaming RPCs, no LLM-based summarization. ### Expected Features -See `.planning/research/FEATURES.md` for full detail with dependency graph. +See `.planning/research/FEATURES.md` for full feature details with complexity analysis and implementation patterns. **Must have (table stakes):** -- **Ingest-time vector similarity gate** — core dedup; without it, repeated agent conversations fill indexes with near-identical content degrading retrieval quality. `NoveltyChecker` pattern exists, needs wiring. -- **Configurable similarity threshold** — different projects have different repetition patterns; `NoveltyConfig.threshold` already exists (default 0.82, should be raised to 0.85 for dedup). -- **Fail-open on dedup errors** — dedup must never block ingestion; already implemented in `NoveltyChecker::should_store()` with 6 skip paths. -- **Temporal decay in ranking** — old results about superseded topics must rank lower; `VectorEntry` already stores `timestamp_millis`. -- **Dedup metrics/observability** — operators need to know how many events were deduplicated to tune thresholds; `NoveltyMetrics` already tracks the right counters, needs gRPC exposure. -- **Minimum text length bypass** — short events (session_start, tool_result status lines) skip dedup entirely; `NoveltyConfig.min_text_length` already exists. - -**Should have (differentiators):** -- **Supersession detection** — mark older events semantically replaced by newer content on same topic (goes beyond time decay); high complexity, architecture researcher provides concrete design. -- **Per-event-type dedup policies** — `session_start`/`session_end` never deduped, `user_message`/`assistant_stop` deduped with higher threshold; low complexity, high value. -- **Staleness half-life configuration** — configurable `half_life_days` for exponential decay rather than fixed curve. -- **Dedup dry-run mode** — log what WOULD be rejected without dropping events; critical for threshold tuning before production enable. -- **Agent-scoped dedup** — dedup within single agent's history, not across agents; requires post-filtering HNSW results by agent metadata. - -**Defer to v2.6+:** -- **Agent-scoped dedup**: requires post-filtering HNSW results by agent metadata since usearch has no native metadata filtering — feasible but adds complexity; defer until multi-agent dedup is a validated pain point. -- **Stale result exclusion window per intent**: temporal decay covers 80% of the use case; add hard cutoff by `QueryIntent` only if decay alone proves insufficient. - -**Anti-features (explicitly excluded):** -- Mutable event deletion on dedup — violates append-only invariant; mark by not indexing, never by deleting. -- LLM-based dedup decisions — adds API latency, cost, external dependency; use local Candle embeddings. -- Exact-match dedup only — misses semantic near-duplicates; use vector similarity. -- Global re-ranking of all stored events — O(n) at query time; apply staleness to top-k only. -- Retroactive dedup of historical events — expensive, risky; new events only going forward. -- Cross-project dedup — violates per-project isolation model. +- Hybrid Search (BM25 + Vector fusion via RRF) — lexical + semantic search is industry standard; currently hardcoded routing logic in hybrid handler +- Salience Scoring at Write Time — high-value events (Definitions, Constraints) must rank higher; proto fields exist, need population at ingest +- Usage-Based Decay in Ranking — access_count-weighted score adjustment; CF_USAGE_COUNTERS exists, needs threading into ranking pipeline +- Vector Index Pruning — prevents unbounded HNSW index growth; VectorIndexPipeline::prune() API exists, needs scheduler wiring +- BM25 Index Maintenance — prevents Tantivy segment bloat; Bm25LifecycleConfig exists, needs job wiring +- Admin Observability RPCs — GetDedupMetrics, GetRankingStatus extensions; operators need production visibility +- Episodic Memory Storage + RPCs — CF_EPISODES + StartEpisode/RecordAction/CompleteEpisode/GetSimilarEpisodes + +**Should have (competitive differentiators):** +- Value-Based Episode Retention — percentile-based culling (delete value_score below p25, retain p50–p75 sweet spot) +- RankingPayload with explanation field — per-result explainability ("salience=0.8, usage=0.905, stale=0.0 → final=0.724") +- GetSimilarEpisodes with vector similarity — "we solved this before" retrieval pattern bridging episodic to semantic memory + +**Defer (v2.7+):** +- Adaptive Lifecycle Policies — storage-pressure-based threshold adjustment (HIGH complexity, needs usage data to tune) +- Cross-Episode Learning Patterns — NLP/clustering on episode summaries (VERY HIGH complexity, requires separate NLP pipeline) +- Real-Time Outcome Feedback Loop — agent self-correction via reward signaling (out of scope for memory service) +- LLM-Based Episode Summarization — API dependency, hallucination risk, high latency (anti-pattern for local-first design) ### Architecture Approach -The architecture is an enhancement of existing patterns, not a new system. The `NoveltyChecker` in `memory-service/src/novelty.rs` IS the dedup gate — it already implements fail-open, opt-in, metric-rich semantics. Two new components are added alongside it: an `InFlightBuffer` (in-memory ring buffer of recent embeddings) and a `StaleFilter` (post-retrieval ranking adjustment). All three components follow the same four architectural patterns: fail-open gate, opt-in with sensible defaults, metric-rich observability, and trait-based abstractions for testability. See `.planning/research/ARCHITECTURE.md` for complete component designs with Rust structs and proto definitions. +The v2.6 architecture is purely additive: four new components plug into the existing handler pattern (Arc injection, checkpoint-based jobs, on-demand metrics computation). No architectural rewrite is required. The component dependency order (39 → 40 → 41 → 42) matches storage-before-handler, handler-before-ranking, ranking-before-lifecycle. All new storage uses RocksDB column families (CF_EPISODES, CF_EPISODE_METRICS) with the existing append-only immutability invariant. See `.planning/research/ARCHITECTURE.md` for full data flow diagrams and Rust struct definitions. **Major components:** -1. **DedupGate (enhanced NoveltyChecker)** (`memory-service/src/novelty.rs`) — rejects semantically duplicate events at ingest; two-tier check: InFlightBuffer first (O(n) linear scan on bounded set), then HNSW index (O(log n) for cross-session); wraps both in the existing timeout/fail-open wrapper -2. **InFlightBuffer** (`memory-service`, internal to DedupGate) — `VecDeque` with max_size (256) and max_age (5 min) eviction; stores raw event embeddings for the timing gap window; ~400KB memory footprint; volatile (lost on restart, acceptable by design) -3. **StaleFilter** (`memory-service/src/stale.rs` or integrated into `memory-retrieval`) — post-retrieval, pre-return; applies exponential time decay and pairwise supersession detection on top-k results only (never O(n)); exempts Constraint/Definition/Procedure memory kinds -4. **DedupConfig / StaleConfig** (`memory-types/src/config.rs`) — extends existing `NoveltyConfig`; `[novelty]` kept as deprecated alias for backward compatibility via `serde(alias)` -5. **DedupMetrics** (extended `NoveltyMetrics`) — adds buffer hit rate, HNSW fallback rate; exposed via new `GetDedupStatus` gRPC RPC - -**Data flow changes:** - -``` -Write path (BEFORE): IngestEvent -> validate -> serialize -> storage.put_event -> return -Write path (AFTER): IngestEvent -> validate -> serialize -> DedupGate.should_store() - -> embed (CandleEmbedder) - -> check InFlightBuffer (linear) - -> check HNSW (if not in buffer) - -> if novel: add to buffer, STORE - -> if dup: SKIP indexing only* - -> if STORE: storage.put_event -> return {created: true} - -> if SKIP: store event (append-only!), skip outbox* -> return {created: false, deduplicated: true} - -Read path (BEFORE): RouteQuery -> classify -> execute layers -> merge -> return -Read path (AFTER): RouteQuery -> classify -> execute layers -> merge -> StaleFilter.apply() -> return -``` - -*See Pitfall 3: "store event, skip outbox" preserves the append-only invariant for TOC segmentation. +1. **EpisodeHandler** (`crates/memory-service/src/episode.rs`) — 4 RPCs for episode lifecycle; uses Arc + optional VectorTeleportHandler for similarity search; episodes are immutable after CompleteEpisode (enforces append-only invariant) +2. **RankingPayloadBuilder** (`crates/memory-service/src/ranking.rs`) — composes salience × usage_adjusted × (1 - stale_penalty) into final_score with human-readable explanation; extends TeleportResult proto field +3. **ObservabilityHandler extensions** — GetRankingStatus + GetDedupStatus + GetEpisodeMetrics; reads from primary CF data, no separate metrics store (single source of truth, no sync issues) +4. **EpisodeRetentionJob** (`crates/memory-scheduler/src/jobs/episode_retention.rs`) — daily 2am cron; deletes episodes where (age > 180d AND value_score < 0.3); checkpoint-based crash recovery +5. **VectorPruneJob** (`crates/memory-scheduler/src/jobs/vector_prune.rs`) — weekly Sunday 1am; copy-on-write HNSW rebuild in temp directory with atomic rename; zero query downtime during rebuild ### Critical Pitfalls -The PITFALLS researcher identified 4 critical, 4 high-severity, and 3 minor pitfalls. See `.planning/research/PITFALLS.md` for full analysis with codebase evidence and detection guidance. +See `.planning/research/PITFALLS.md` for full analysis with codebase evidence. All pitfalls are from v2.5's dedup/ranking architecture that v2.6 must build on top of correctly. -**Top 5 by severity:** +1. **HNSW index contains TOC summaries, NOT raw events** — Reusing the existing HNSW index for raw event dedup produces misleading similarity scores (~0.6–0.7). The InFlightBuffer (256-entry, RwLock, stores raw event embeddings) is the correct primary dedup source for within-session comparison. HNSW search is secondary for cross-session only. -1. **HNSW index contains TOC nodes/grips, NOT raw events (Pitfall 8)** — Reusing the existing HNSW index for dedup compares incoming events to summaries, producing misleading similarity scores (~0.6-0.7 instead of 0.85+). Comparing "implement JWT token validation" (event) vs "Day summary: authentication work" (TOC node) will NOT catch the duplicate. **Prevention:** The InFlightBuffer (which stores raw event embeddings by design) is the primary dedup source; the HNSW index is a secondary fallback for cross-session only. Do NOT attempt to reuse the TOC/grip index for dedup at raw event granularity. +2. **Threshold miscalibration for all-MiniLM-L6-v2** — Cosine similarity scores cluster [0.07, 0.80] for unrelated content with this model. Default dedup threshold must be 0.85+ (not the 0.82 novelty default). Below 0.70 causes dangerous false positives and PERMANENT data loss in the append-only store. Use dry-run mode for one week before enabling dedup in production. -2. **Timing gap: burst duplicates escape detection (Pitfall 1)** — The outbox pipeline is async; events ingested in rapid succession cannot see each other in the HNSW index. Within-session duplicates (the most common kind) are exactly what the current design misses. **Prevention:** InFlightBuffer catches these — it holds raw embeddings for the last N events with a TTL covering the maximum expected indexing lag. Size 256 entries x 5min TTL covers typical session bursts. +3. **Ranking score collapse from multiplicative signal stacking** — Salience × usage × stale × novelty penalties compound destructively. Define composition formula before implementation. Stale penalty must be bounded at max 30% reduction. Exempt Constraint/Definition/Procedure memory kinds from all decay signals. The `min_confidence: 0.3` threshold in RetrievalExecutor will silently drop results pushed below it. -3. **Dedup drops break the append-only invariant (Pitfall 3)** — Dropping events at ingest changes event counts, breaking TOC segment boundaries, causing segments to cover longer time spans, and potentially omitting discussed topics from day summaries. **Prevention:** Store ALL events; for dedup duplicates, store the event to RocksDB but do NOT create an outbox entry (so it is never indexed into HNSW or BM25). Event count is preserved for segmentation; index quality is preserved by not indexing duplicates. This is a critical design decision that must be made before implementation. +4. **Append-only invariant: store events, skip outbox (not drop events)** — Dedup must store all events but skip the outbox entry for duplicates. Dropping events before storage breaks TOC segmentation (segment boundaries use event counts) and breaks causality debugging. The store-and-skip-outbox pattern (implemented in v2.5) is the architectural precedent. -4. **Stale filtering hides critical historical context (Pitfall 4)** — Conversational memory is not a news feed; old context is frequently the most important. An agent asking "what was the authentication approach we decided on?" needs the ORIGINAL decision (old, high-salience), not the latest passing mention (new, low-salience). Stale filtering stacked with existing salience + usage_decay can bury the right answer below the `min_confidence` threshold. **Prevention:** Exempt `Constraint`, `Definition`, and `Procedure` memory kinds from staleness penalties entirely; cap maximum stale penalty at 30% score reduction; apply stale filtering AFTER the fallback chain resolves (not within individual layer results). +5. **HNSW write lock blocks dedup reads during index rebuild** — VectorIndexUpdater holds write lock for batch inserts; dedup reads queue behind it. Use try_read() with InFlightBuffer fallback. The VectorPruneJob copy-on-write approach (temp dir → atomic rename) eliminates contention during lifecycle sweeps. -5. **Threshold miscalibration for all-MiniLM-L6-v2 (Pitfall 2)** — The model's cosine similarity distribution is non-intuitive: unrelated content scores 0.20-0.40, near-duplicates 0.75-0.85, verbatim duplicates 0.85+. The existing `NoveltyConfig` default of 0.82 was set for novelty detection (a different use case); for dedup the consequences of false positives are IRREVERSIBLE (event never stored). **Prevention:** Default threshold 0.85 for dedup; mandatory dry-run mode for first week; per-event-type thresholds; compound check (cosine + Jaccard token overlap) to reduce false positives. +## Implications for Roadmap -**Additional high-severity pitfalls:** -- **Embedding latency on hot path (Pitfall 5)**: Candle runs synchronously; on CI Linux or older hardware, embedding takes 20-50ms. Prevention: text hash pre-check for exact duplicates before computing embedding; embedding cache; increase timeout to 200ms; skip structural events. -- **HNSW RwLock contention (Pitfall 6)**: Indexing pipeline holds write lock while dedup reads; under load, dedup times out during indexing runs. Prevention: use `try_read()` with buffer fallback; never block ingest path on HNSW lock. -- **Stale filtering interacts poorly with ranking layers (Pitfall 7)**: Score collapse when stale penalty stacks with salience + usage_decay + novelty. Prevention: bounded penalty (max 30%), test against existing 29 E2E queries before ship. -- **Dedup + Novelty double-filtering**: Two similarity checks on ingest path with different thresholds create unpredictable interaction. Prevention: dedup REPLACES novelty filtering; unify into single `DedupConfig`; keep `[novelty]` as deprecated alias. +Based on combined research, the suggested phase structure for v2.6 maps to phases 39–42 as defined in ARCHITECTURE.md. The ordering respects storage-before-handler dependencies, puts observability before lifecycle (so jobs can report metrics), and treats episodic storage as the foundation all other features depend on. -## Implications for Roadmap +### Phase 39: Episodic Memory Storage Foundation + +**Rationale:** All other v2.6 phases depend on CF_EPISODES and the Episode proto schema. This is the lowest-risk phase — pure storage additions following established patterns (cf_descriptors, serde-serialized structs, ULID keys). No handler logic, no new RPCs yet. Building storage first allows thorough unit testing before handler complexity is introduced. + +**Delivers:** CF_EPISODES column family, CF_EPISODE_METRICS column family, Episode/EpisodeAction/EpisodeOutcome proto messages, Episode Rust struct in memory-types, Storage::put_episode/get_episode/scan_episodes helpers, unit tests for CRUD operations. -Based on combined research, the implementation should follow a dependency-aware 4-phase structure. The dedup work (write path, higher risk) comes before stale filtering (read path, lower risk). Design decisions must precede implementation to avoid the critical pitfalls. +**Addresses:** "Episodic Memory Storage & Schema" (table stakes), foundation for "Value-Based Episode Retention." -### Phase 1: DedupGate Foundation +**Avoids:** Embedding episode storage logic in the handler layer before the storage layer is tested and stable. -**Rationale:** Pure data structures and enhanced checker can be fully unit-tested before touching the ingest path. The InFlightBuffer and enhanced NoveltyChecker are the riskiest new code (they define correctness); isolate them for thorough testing. +**Research flag:** Standard patterns — RocksDB column family additions are well-documented in existing codebase. No additional research needed; use CF_TOPICS and CF_TOPIC_LINKS additions from v2.0 as templates. -**Delivers:** InFlightBuffer data structure; enhanced NoveltyChecker wired to real `CandleEmbedder` and `HnswIndex`; DedupConfig in memory-types; unit tests with MockEmbedder + MockVectorIndex. +--- -**Addresses (from FEATURES.md):** Ingest-time vector similarity gate (table stakes), fail-open behavior (table stakes), configurable threshold (table stakes), minimum text length bypass (table stakes). +### Phase 40: Episodic Memory Handler & RPCs -**Avoids (from PITFALLS.md):** Timing gap (Pitfall 1) via InFlightBuffer; TOC/grip index reuse (Pitfall 8) by using buffer as primary source; threshold miscalibration (Pitfall 2) by implementing dry-run mode. +**Rationale:** After storage foundation is stable, the handler can be built following the Arc injection pattern used by RetrievalHandler and AgentDiscoveryHandler. This phase completes the episodic memory user-facing API before ranking or lifecycle features touch it. Episode similarity search (GetSimilarEpisodes) uses the existing HNSW index — the same vector infrastructure, different granularity than dedup. -**Needs research:** Threshold calibration for all-MiniLM-L6-v2 — need calibration test fixture with known similarity pairs covering identical, near-duplicate, related, and unrelated text pairs. +**Delivers:** EpisodeHandler struct (memory-service/src/episode.rs), StartEpisode/RecordAction/CompleteEpisode/GetSimilarEpisodes RPCs, handler wired into MemoryServiceImpl, optional embedding generation on CompleteEpisode for similarity indexing, E2E test: start → record → complete → retrieve similar. -### Phase 2: Wire DedupGate into Ingest Path +**Addresses:** "Episodic Memory Storage & RPCs" (table stakes), "Retrieval Integration for Similar Episodes" (differentiator). -**Rationale:** Depends on Phase 1 being solid. Changes the write path (higher risk than read path). Proto changes and integration tests required. Fail-open design ensures backward compatibility on any failure. +**Avoids:** HNSW lock contention during GetSimilarEpisodes — use try_read() pattern; never block on write lock. Episode records are immutable after CompleteEpisode — enforce via early return Err(EpisodeAlreadyCompleted) in RecordAction. -**Delivers:** DedupGate injected into `MemoryServiceImpl`; store-event-skip-outbox behavior for duplicates (preserving append-only invariant); proto additions (`IngestEventResponse.deduplicated`, `GetDedupStatus` RPC, field numbers 201+); integration tests proving dedup catches burst duplicates. +**Research flag:** Standard patterns — handler injection + ULID key + vector search are established in v2.5. No additional research needed. -**Addresses (from FEATURES.md):** Dedup metrics/observability via gRPC (table stakes), per-event-type dedup bypass (differentiator), dedup dry-run mode (differentiator). +--- -**Avoids (from PITFALLS.md):** Append-only invariant break (Pitfall 3) via store-event-skip-outbox design; HNSW RwLock contention (Pitfall 6) via try_read() + buffer fallback; embedding latency (Pitfall 5) via text hash pre-check and skip for structural events; dedup+novelty double-filtering via unified DedupConfig replacing NoveltyConfig. +### Phase 41: Ranking Payload & Observability -**Standard patterns:** Wiring pattern is straightforward given Phase 1 foundation; unlikely to need deeper research. +**Rationale:** Ranking quality improvements (salience + usage decay composition) are the highest-value retrieval changes in v2.6. They depend on v2.5's SalienceScorer and CF_USAGE_COUNTERS already being in place, and on Phase 39's Episode storage for GetEpisodeMetrics. This phase also extends admin observability RPCs to expose the metrics needed for lifecycle monitoring in Phase 42. Hybrid search BM25 routing wiring must be confirmed or completed here — FEATURES.md identifies it as the critical path prerequisite. -### Phase 3: StaleFilter +**Delivers:** RankingPayloadBuilder (memory-service/src/ranking.rs), composed final_score = salience × usage_adjusted × (1 - stale_penalty), explanation field in TeleportResult, GetRankingStatus extension (usage_tracked_count, memory_kind_distribution), GetDedupStatus extension (buffer_memory_bytes, dedup_rate_24h_percent), GetEpisodeMetrics RPC (new), unit tests for ranking formula, E2E test for RouteQuery explainability. -**Rationale:** Read-path only — no data mutation concerns. Can be built/tested in parallel with Phase 2 if resources allow. Depends on having retrieval infrastructure in place (which predates v2.5). +**Addresses:** "Salience Scoring at Write Time" (table stakes), "Usage-Based Decay in Ranking" (table stakes), "Admin Observability RPCs" (table stakes), "Multi-Layer Decay Coordination" (differentiator), "Hybrid Search" wiring (table stakes — confirm or complete). -**Delivers:** `StaleFilter` component in memory-service or memory-retrieval; `StalenessConfig` in memory-types (alongside `NoveltyConfig`); exponential time-decay factor applied post-retrieval on top-k results; pairwise supersession detection (O(k^2) bounded, k<=20); Constraint/Definition/Procedure kind exemptions; bounded penalty (max 30% reduction). +**Avoids:** Score collapse from unbounded stale penalty — cap at max 30% reduction; exempt Constraint/Definition/Procedure from all decay; define formula as named constants before threading through callers. Apply stale filtering AFTER the fallback chain resolves, not within individual layer results. -**Addresses (from FEATURES.md):** Temporal decay in ranking (table stakes), staleness half-life configuration (differentiator), stale result exclusion window (differentiator, partial). +**Research flag:** Needs attention before planning. The exact composition formula weights (salience=0.5, usage=0.3, stale=0.2) are initial guesses from STACK.md config — validate against E2E test queries before shipping. Also inspect `crates/memory-service/src/hybrid.rs` to confirm actual state of BM25 routing wiring. -**Avoids (from PITFALLS.md):** Historical context buried (Pitfall 4) via kind exemptions and bounded penalty; ranking score collapse (Pitfall 7) via bounded penalty and post-fallback-chain application; O(n^2) comparison (Architecture anti-pattern) by bounding to top-k. +--- -**May need research:** Interaction between stale filtering and existing min_confidence threshold — run against existing 29 E2E queries to verify no regressions before finalizing score formula. +### Phase 42: Lifecycle Automation Jobs -### Phase 4: E2E Validation and Observability +**Rationale:** Lifecycle jobs are last because they depend on Phase 39 (episode storage to scan), Phase 41 (observability to report job metrics), and the v2.5 scheduler framework. VectorPruneJob uses copy-on-write (temp dir + atomic rename) to avoid query downtime. BM25 pruning is explicitly deferred — it requires SearchIndexer write access that needs a separate design pass (noted as "Phase 42b" in ARCHITECTURE.md). -**Rationale:** Validates both features working end-to-end through the real pipeline. CLI bats tests provide regression coverage. Standard E2E patterns. +**Delivers:** EpisodeRetentionJob (daily 2am, deletes episodes where age > 180d AND value_score < 0.3), VectorPruneJob (weekly Sunday 1am, copy-on-write HNSW rebuild), checkpoint-based crash recovery for both jobs, cron registration in memory-daemon/src/main.rs, integration test for checkpoint recovery, E2E test for vector index shrinkage after prune. -**Delivers:** E2E tests for duplicate event rejection, near-duplicate rejection, stale result downranking, fail-open on embedder failure, fail-open on timeout; CLI bats tests for dedup behavior; `GetDedupStatus` and `SetDedupThreshold` gRPC admin RPCs for runtime tuning. +**Addresses:** "Vector Index Pruning" (table stakes), "BM25 Index Maintenance" (table stakes, partial — full wiring deferred), "Value-Based Episode Retention" (differentiator, threshold-based initial implementation using value_score < 0.3 hardcoded rather than percentile analysis). -**Addresses (from FEATURES.md):** E2E proof that dedup works (table stakes), dedup metrics exposed via gRPC (table stakes). +**Avoids:** Episode retention job deleting wrong records — conservative defaults (max_age=180d, threshold=0.3), dry-run mode, checkpoint recovery so aborted sweeps resume correctly. Vector prune locking out queries — copy-on-write pattern (temp directory → atomic rename) with RwLock on index directory pointer. -**Avoids (from PITFALLS.md):** Test fixture calibration problem (Pitfall 11) by building calibration test suite with pre-computed similarity pairs as ground truth; no runtime tuning gap (Pitfall 10) via admin RPCs; model version drift detection via model metadata in dedup index header. +**Research flag:** The copy-on-write HNSW prune is the most novel engineering in v2.6. Validate that usearch supports the atomic directory rename pattern under concurrent reads. If HNSW metadata file format (embedding_id → timestamp mappings) is unclear from source, request a `/gsd:research-phase` before implementation. -**Standard patterns:** E2E test patterns well-established in this codebase (29 existing tests as reference); unlikely to need deeper research. +--- ### Phase Ordering Rationale -- DedupGate foundation before wiring because the InFlightBuffer and trait adapters can be fully unit-tested in isolation — the highest-risk new code gets the most testing time before it touches the live ingest path. -- Ingest wiring before StaleFilter because write-path changes have higher risk than read-path changes; shipping dedup first also generates real dedup metrics to validate the approach. -- StaleFilter can proceed in parallel with Phase 2 if needed since they are independent subsystems (write path vs read path). -- E2E last because it validates both features working through the complete pipeline. -- Design decisions (append-only invariant, HNSW granularity, threshold defaults) must be recorded as architectural decisions before Phase 1 implementation begins — these cannot be retrofitted. +- **Storage first (39):** Every other phase reads or writes CF_EPISODES. Storage changes are also the hardest to retrofit safely; establishing the schema early prevents cascading changes later. +- **Handler second (40):** EpisodeHandler provides the write path. Once it exists, Phase 41's GetEpisodeMetrics RPC has real data to aggregate. +- **Ranking third (41):** RankingPayloadBuilder is the highest-value retrieval change and has no lifecycle dependency. It also exposes the observability RPCs needed for lifecycle job reporting. +- **Lifecycle last (42):** Jobs are background processes that can be added after all core functionality is tested. They depend on Phase 39 storage + Phase 41 metrics infrastructure. +- **Hybrid search wiring:** FEATURES.md identifies this as the critical path prerequisite (unblocks routing logic so salience + usage decay have effect on real results). Treat this as a pre-Phase-39 patch or include at the start of Phase 41. ### Research Flags -Phases likely needing deeper research during planning: -- **Phase 1 (DedupGate Foundation):** Threshold calibration for all-MiniLM-L6-v2 requires a calibration test that embeds known text pairs and records similarity distributions. Do not rely on intuition about similarity scores with this model. -- **Phase 3 (StaleFilter):** Score composition formula needs validation against existing 29 E2E tests. Run with stale filtering enabled and verify result count and top-score distributions show no regression before finalizing penalty bounds. +**Needs deeper research during planning:** +- **Phase 41 (Ranking formula weights):** The salience_weight/usage_weight/stale_weight config values are initial guesses. Validate against real query sets before shipping. Run existing 39 E2E tests with ranking_payload enabled to verify no regressions. +- **Phase 41 (Hybrid BM25 routing):** Inspect `crates/memory-service/src/hybrid.rs` before writing the phase plan — FEATURES.md reports "hardcoded routing logic" but exact state is unconfirmed. +- **Phase 42 (VectorPruneJob copy-on-write):** usearch HNSW atomic directory rename behavior under concurrent reads is the key risk. Verify RwLock release timing and directory pointer swap semantics from `crates/memory-vector/src/hnsw.rs`. -Phases with standard patterns (skip research-phase): -- **Phase 2 (Wire DedupGate into Ingest):** Straightforward wiring given Phase 1 foundation; proto extension pattern well-established (field numbers 201+). -- **Phase 4 (E2E Validation):** Standard bats + Rust E2E patterns; 29 existing tests provide strong reference. +**Standard patterns (skip research-phase):** +- **Phase 39 (Episodic storage):** RocksDB column family additions follow existing CF pattern exactly. Refer to CF_TOPICS and CF_TOPIC_LINKS additions in v2.0 as the template. +- **Phase 40 (EpisodeHandler):** Arc handler injection is well-established; RetrievalHandler and AgentDiscoveryHandler are direct templates. +- **Phase 42 (EpisodeRetentionJob):** Checkpoint-based scheduler jobs follow the existing outbox_processor and rollup job patterns exactly. ## Confidence Assessment | Area | Confidence | Notes | |------|------------|-------| -| Stack | HIGH | Direct codebase inspection confirmed all existing crates sufficient; no new deps. Locked versions (usearch 2.23.0, candle 0.8.4, rocksdb 0.22) verified in Cargo.lock. | -| Features | HIGH | NoveltyChecker precedent validates the dedup pattern; stale filtering is standard ranking math. External sources (Mem0, temporal RAG research) provide corroboration. | -| Architecture | HIGH | In-flight buffer + HNSW dual-check is proven in vector DB literature. All 4 critical pitfalls have concrete prevention strategies based on direct code analysis. | -| Pitfalls | HIGH | All pitfalls verified with specific file paths and line references in the codebase. Model threshold distributions backed by published research on all-MiniLM-L6-v2. | +| Stack | HIGH | No new dependencies; all technologies verified against workspace Cargo.toml on 2026-03-11; zero uncertainty about what to use | +| Features | HIGH | Feature list derived from direct codebase analysis (existing proto stubs, half-implemented handlers) + 20+ industry sources on hybrid search, episodic memory, lifecycle patterns | +| Architecture | HIGH | Direct codebase analysis — existing handler patterns, column family descriptors, scheduler registration, and proto field numbers all confirmed; build order respects dependency graph | +| Pitfalls | HIGH | Pitfalls derived from codebase evidence (specific file paths, line numbers, metrics confirmed) + vector search community patterns; HNSW contention, threshold calibration, and score collapse are all verifiable | **Overall confidence:** HIGH -### Gaps to Address - -These are unresolved questions that must be decided as architectural decisions at the start of Phase 1: - -- **Threshold calibration**: Exact threshold values for all-MiniLM-L6-v2 dedup need a calibration test with known text pairs. Current recommendation (0.85 default) is conservative but not empirically validated against the specific event corpus. Build calibration fixture in Phase 1 before setting production defaults. - -- **Append-only design decision**: "Store event, skip outbox" (PITFALLS recommendation) vs "drop at ingest" (STACK recommendation) need explicit resolution. The pitfalls researcher's analysis of TOC segmentation impact makes "store-and-skip-outbox" the recommended choice, but this is an architectural decision that affects Phase 1 design. Must be recorded in PROJECT.md before implementation. +The main uncertainty is not technical but operational: ranking formula weights (0.5/0.3/0.2) are initial guesses that require tuning against real query distributions once implemented. The copy-on-write HNSW prune is the most architecturally novel component and deserves a targeted investigation before Phase 42 planning. -- **HNSW lock contention strategy**: `try_read()` with buffer fallback vs periodic read-only HNSW snapshot. The in-flight buffer (Pitfall 6 prevention) is the primary defense, but the strategy for when try_read() fails needs explicit specification. - -- **Score composition formula for stale filtering**: The exact weighting of `vector_similarity * salience_weight * recency_factor * usage_boost` needs to be defined before Phase 3 to avoid score collapse. The PITFALLS researcher recommends bounded penalty (max 30%), the ARCHITECTURE researcher suggests `superseded_penalty = 0.3` for explicitly superseded results. These must be reconciled with the existing min_confidence threshold of 0.3 in `RetrievalExecutor`. - -- **Config backward compatibility**: `[novelty]` section in existing config.toml files must continue working. Use `serde(alias = "novelty")` on `DedupConfig`. Deprecation warning on startup when alias is used. This is a minor detail but must not be forgotten. +### Gaps to Address -- **Per-event-type dedup exemptions**: session_start, session_end, subagent_start, subagent_stop should bypass dedup entirely (structural events). user_message and assistant_stop should be deduped with conservative threshold. tool_result is ambiguous — may need a moderate threshold since repeated tool calls ARE legitimate duplicates. +- **Hybrid search routing code:** STACK.md notes BM25 routing is "not fully wired into routing" and FEATURES.md confirms "hardcoded routing logic." Inspect `crates/memory-service/src/hybrid.rs` before writing the Phase 41 plan to understand exact wiring needed. +- **CF_USAGE_COUNTERS schema:** UsageStats struct needs `last_accessed_ms` field added (not just access_count). Verify current schema in `crates/memory-storage/src/usage.rs` before Phase 41 — existing data may need migration handling. +- **VectorPruneJob metadata format:** The HNSW index metadata file format (embedding_id → timestamp mappings) needs to be confirmed from the usearch crate API. ARCHITECTURE.md assumes a metadata file exists; verify this assumption in `crates/memory-vector/src/hnsw.rs`. +- **BM25 lifecycle wiring:** STACK.md explicitly defers BM25 prune to "Phase 42b" because "SearchIndexer write access" needs its own design. Plan as a stretch goal or explicit follow-on outside the v2.6 scope. +- **Value-based episode retention algorithm:** FEATURES.md rates this HIGH complexity and recommends deferring to v2.6.2. Phase 42 should implement a simple threshold (value_score < 0.3) rather than the full percentile-distribution algorithm. ## Sources -### Primary (HIGH confidence — direct codebase inspection) -- `crates/memory-service/src/novelty.rs` — existing `NoveltyChecker` with `EmbedderTrait`, `VectorIndexTrait`, fail-open, metrics (6 skip categories), `NoveltyConfig` integration -- `crates/memory-service/src/ingest.rs` — `MemoryServiceImpl`, `IngestEvent` handler, `storage.put_event()` atomic write -- `crates/memory-indexing/src/pipeline.rs` — `IndexingPipeline`, `process_batch()`, outbox checkpoint tracking -- `crates/memory-indexing/src/vector_updater.rs` — `VectorIndexUpdater`, `find_grip_for_event()` returns None (critical: raw events NOT indexed), `index_toc_node()`, `index_grip()` -- `crates/memory-vector/src/hnsw.rs` — `HnswIndex`, `Arc>`, `MetricKind::Cos`, `search()` returns 1.0-distance -- `crates/memory-vector/src/metadata.rs` — `VectorEntry.created_at` (ms since epoch), `VectorMetadata` RocksDB store -- `crates/memory-types/src/config.rs` — `NoveltyConfig` (threshold 0.82, timeout 50ms, disabled by default) -- `crates/memory-types/src/usage.rs` — `usage_penalty()`, `apply_usage_penalty()` (pattern for staleness functions) -- `crates/memory-types/src/salience.rs` — `SalienceScorer`, `MemoryKind` enum (Constraint, Definition, Procedure) -- `crates/memory-retrieval/src/executor.rs` — `RetrievalExecutor`, `min_confidence: 0.3`, fallback chain execution -- `crates/memory-retrieval/src/types.rs` — `QueryIntent`, `CapabilityTier`, `StopConditions` -- `Cargo.lock` — usearch 2.23.0, candle-core 0.8.4, tantivy 0.25.0, rocksdb 0.22 (versions locked) -- `.planning/PROJECT.md` — architectural decisions, requirements, constraints - -### Secondary (MEDIUM confidence — published research and community) -- [Mem0: Building Production-Ready AI Agents](https://arxiv.org/abs/2504.19413) — LLM-based memory extraction and dedup (we deliberately avoid for latency reasons) -- [Temporal RAG: Why RAG Gets 'When' Questions Wrong](https://blog.sotaaz.com/post/temporal-rag-en) — temporal awareness critical for retrieval freshness -- [AI-Driven Semantic Similarity Pipeline (2025)](https://arxiv.org/html/2509.15292v1) — threshold calibration at 0.659 for literature dedup; score distribution [0.07, 0.80] for all-MiniLM-L6-v2 -- [Solving Freshness in RAG: A Simple Recency Prior](https://arxiv.org/html/2509.19376) — recency prior fused with semantic similarity for temporal ranking -- [OpenAI Community: Cosine Similarity Thresholds](https://community.openai.com/t/rule-of-thumb-cosine-similarity-thresholds/693670) — no universal threshold; 0.79-0.85 common for near-duplicate detection -- [Data Deduplication at Trillion Scale](https://zilliz.com/blog/data-deduplication-at-trillion-scale-solve-the-biggest-bottleneck-of-llm-training) — MinHash LSH at 0.8 threshold for near-duplicate detection -- [Enhancing RAG: Best Practices](https://arxiv.org/abs/2501.07391) — dedup in context assembly best practices -- [Data Freshness Rot in Production RAG](https://glenrhodes.com/data-freshness-rot-as-the-silent-failure-mode-in-production-rag-systems-and-treating-document-shelf-life-as-a-first-class-reliability-concern-2/) — document shelf life as first-class reliability concern +### Primary (HIGH confidence — codebase analysis) +- `crates/memory-types/src/` — SalienceScorer, UsageStats, UsageConfig, DedupConfig, StalenessConfig (confirmed 2026-03-11) +- `crates/memory-storage/src/` — dashmap 6.0, lru 0.12, RocksDB 0.22, CF definitions +- `crates/memory-search/src/lifecycle.rs` — Bm25LifecycleConfig, retention_map +- `crates/memory-scheduler/` — Tokio cron job framework, OverlapPolicy, JitterConfig +- `crates/memory-vector/src/hnsw.rs` — HNSW index wrapper, RwLock, cosine distance +- `crates/memory-service/src/novelty.rs` — NoveltyChecker fail-open design, timeout handling +- `crates/memory-indexing/src/vector_updater.rs` — Confirmed: indexes TOC nodes/grips, NOT raw events +- `proto/memory.proto` — Field numbers, existing message types, reserved ranges +- `.planning/PROJECT.md` — v2.6 requirements, architectural decisions +- `docs/plans/memory-ranking-enhancements-rfc.md` — Episodic memory Tier 2 spec + +### Secondary (HIGH confidence — industry sources) +- [all-MiniLM-L6-v2 Model Card](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) — Threshold calibration (0.659 for literature dedup, 0.85+ for conversational dedup) +- [Elastic: A Comprehensive Hybrid Search Guide](https://www.elastic.co/what-is/hybrid-search) — RRF fusion (k=60 constant), parallel BM25+vector execution +- [Google Vertex AI: About Hybrid Search](https://docs.cloud.google.com/vertex-ai/docs/vector-search/about-hybrid-search) — Score normalization patterns +- [Memory Patterns for AI Agents](https://dev.to/gantz/memory-patterns-for-ai-agents-short-term-long-term-and-episodic-5ff1) — Episodic memory design for agentic systems +- [Designing Memory Architectures for Production-Grade GenAI Systems](https://medium.com/@avijitswain11/designing-memory-architectures-for-production-grade-genai-systems-2c20f71f9a45) — Cognitive architecture layers +- [AI-Driven Semantic Similarity Pipeline (2025)](https://arxiv.org/html/2509.15292v1) — Threshold calibration, score distribution [0.07, 0.80] for all-MiniLM-L6-v2 +- [8 Common Mistakes in Vector Search](https://kx.com/blog/8-common-mistakes-in-vector-search/) — Threshold defaults, normalization pitfalls - [OpenSearch Vector Dedup RFC](https://github.com/opensearch-project/k-NN/issues/2795) — 22% indexing speedup, 66% size reduction from dedup -- [Event Sourcing Projection Deduplication](https://domaincentric.net/blog/event-sourcing-projection-patterns-deduplication-strategies) — at-least-once delivery and idempotency patterns -- [8 Common Mistakes in Vector Search](https://kx.com/blog/8-common-mistakes-in-vector-search/) — normalization and default threshold pitfalls -### Tertiary (LOW confidence — needs validation) -- [all-MiniLM-L6-v2 Similarity Discussion](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/discussions/16) — community discussion of similarity thresholds; needs calibration test to validate against actual event corpus -- [pgvector HNSW Dedup Issue](https://github.com/pgvector/pgvector/issues/760) — HNSW index not used with combined dedup+distance ordering; usearch behavior may differ +### Tertiary (MEDIUM confidence — community patterns) +- [Event Sourcing Projection Deduplication](https://domaincentric.net/blog/event-sourcing-projection-patterns-deduplication-strategies) — Store-and-skip-outbox pattern validation +- [Redis: Full-text search for RAG apps: BM25 and hybrid search](https://redis.io/blog/full-text-search-for-rag-the-precision-layer/) — Hybrid search production patterns +- [What is agent observability?](https://www.braintrust.dev/articles/agent-observability-tracing-tool-calls-memory) — Admin metrics for agentic systems --- -*Research completed: 2026-03-05* +*Research completed: 2026-03-11* *Synthesized by: gsd-synthesizer from STACK.md, FEATURES.md, ARCHITECTURE.md, PITFALLS.md* *Ready for roadmap: yes* diff --git a/.planning/research/v2.6-PITFALLS.md b/.planning/research/v2.6-PITFALLS.md new file mode 100644 index 0000000..37fe3ed --- /dev/null +++ b/.planning/research/v2.6-PITFALLS.md @@ -0,0 +1,660 @@ +# Domain Pitfalls: v2.6 Episodic Memory, Ranking, Lifecycle, and Hybrid Search + +**Domain:** Adding episodic memory, salience/usage ranking, lifecycle automation, observability RPCs, and BM25 hybrid wiring to an existing Rust/RocksDB agent memory system + +**Researched:** 2026-03-11 + +**Overall Confidence:** HIGH (verified against v2.5 dedup/lifecycle patterns, Phase 16 ranking research, 2026 hybrid search literature, and RocksDB column family design) + +--- + +## Critical Pitfalls + +Mistakes that cause data loss, schema breakage, or major rewrites. + +--- + +### Pitfall 1: Score Normalization Mismatch in Hybrid Search (Phase 39) + +**What goes wrong:** Hybrid search results are incorrectly ranked because BM25 and vector scores are merged without proper normalization. BM25 scores are unbounded (0-500+), vector scores are bounded (0-1). Naive averaging produces results dominated by whichever scorer has higher absolute values, breaking relevance ordering. + +**Why it happens:** +- BM25 (TF-IDF from Tantivy) produces scores that scale with query length and corpus size — no fixed range +- Vector cosine similarity (from usearch HNSW) always returns [0, 1] +- Proto `HybridSearchRequest` (line 626-644 in memory.proto) defines `bm25_weight` and `vector_weight`, but no normalization specification +- The hybrid handler implementation (Phase 39 task) must choose: min-max normalization, RRF (reciprocal rank fusion), or score quantiles +- Each approach has different ranking characteristics and requires different tuning + +**Codebase evidence:** +- `proto/memory.proto:HybridSearchRequest` has `bm25_weight` and `vector_weight` fields with no normalization guidance +- `crates/memory-search/src/searcher.rs` returns TeleportResult with unbounded `score: f32` +- `crates/memory-vector/src/hnsw.rs` returns VectorEntry with normalized `score: f32` in [0, 1] +- No normalization layer currently exists (Phase 39 will add it) + +**Consequences:** +- Top results biased toward BM25 or vector depending on relative score ranges +- Relevance ranking feels arbitrary; users cannot tune behavior predictably +- Cannot reliably compare hybrid search quality to single-layer searches (baseline breaks) +- RRF requires reranking; weighted averaging requires choosing normalization; both change result order unpredictably + +**Prevention:** +1. **Design decision in Phase 39 plan (before coding):** Choose ONE normalization strategy and document it in plan +2. **Recommended approach:** RRF (reciprocal rank fusion) because: + - Score-agnostic (works on rank position, not absolute values) + - No tuning needed (just k=60 constant per 2026 research) + - Proven at scale (2009 SIGIR paper "Reciprocal Rank Fusion outperforms...", adopted by OpenSearch/Weaviate/Elastic) + - Formula: `score = 1 / (rank + 60)` for each layer, then merge and re-rank +3. **Alternative:** Min-max normalization per batch with configurable weights + - Requires tuning `bm25_weight` and `vector_weight` for specific query distribution + - More complex; only choose if RRF doesn't meet performance goals +4. **Validation in E2E tests:** Verify that hybrid mode produces results between single-BM25 and single-vector top scores, not dominated by one layer + +**Detection:** +- Warning sign 1: Hybrid results identical to single-layer results (one score dominates) +- Warning sign 2: Tuning `bm25_weight` / `vector_weight` has no effect on result ordering +- Warning sign 3: User feedback that "hybrid always returns BM25/vector results" (one layer dominating) + +**Phase to address:** Phase 39 (Hybrid Search Wiring) — lock normalization strategy in plan BEFORE implementation begins + +**Severity:** CRITICAL (wrong normalization = wrong ranking for weeks until tuning occurs) + +--- + +### Pitfall 2: Column Family CF_EPISODES Missing on Upgrade (Phase 36) + +**What goes wrong:** New instances get `CF_EPISODES` column family automatically, but existing RocksDB stores (v2.5 and earlier) don't have it. When v2.6 code tries to write episodes to an old database, it fails with "column family not found" error. + +**Why it happens:** +- `crates/memory-storage/src/column_families.rs` lists all column families in `ALL_CF_NAMES` array +- `Storage::open()` (in `crates/memory-storage/src/db.rs` line 37-58) calls RocksDB with `create_missing_column_families(true)` +- This flag tells RocksDB to CREATE missing CFs during open, but ONLY when the DB is opened for the first time +- Existing v2.5 databases were opened with v2.5's CF list; they were created WITHOUT `CF_EPISODES` +- When v2.6 adds `CF_EPISODES` to `ALL_CF_NAMES`, the RocksDB open call SHOULD auto-create it, but this is not guaranteed to happen on existing DBs if RocksDB's internal logic skips it + +**Codebase evidence:** +- `crates/memory-storage/src/column_families.rs:build_cf_descriptors()` creates descriptors for all CFs +- `Storage::open()` passes these to `DB::open_cf_descriptors()` +- Episode handlers (Phase 36) will call `self.db.cf_handle(CF_EPISODES)` which returns `None` if CF doesn't exist + +**Consequences:** +- Episode ingestion RPC fails on production instances running v2.5 stores with cryptic RocksDB error +- Users cannot upgrade from v2.5 to v2.6 without manual migration command +- If not caught in testing, affects all existing users +- Data integrity is NOT at risk (old data still safe), but forward compatibility breaks + +**Prevention:** +1. **In Phase 36 implementation:** + - Add `CF_EPISODES` to `column_families.rs::ALL_CF_NAMES` array before Phase 36 coding starts + - This triggers `create_missing_column_families(true)` to create the CF on next v2.6 `Storage::open()` + - Existing v2.5 DBs will get the CF created automatically on first v2.6 open +2. **Upgrade handling:** Episode write methods should handle missing CF gracefully: + ```rust + let cf_handle = self.db.cf_handle(CF_EPISODES) + .ok_or_else(|| StorageError::ColumnFamilyNotFound(CF_EPISODES.to_string()))?; + ``` + - This error should be transparent (not misleading to user) +3. **E2E test:** Create a v2.5-format DB (open with v2.5 storage), then: + - Open same DB with v2.6 storage (should auto-create CF_EPISODES) + - Write an episode (should succeed) + - Verify CF_EPISODES appears in DB + +**Detection:** +- Warning sign 1: RocksDB error "invalid column family" on episode write +- Warning sign 2: Documentation says "migration script required" post-upgrade +- Warning sign 3: Only new instances (post-v2.6 install) can write episodes + +**Phase to address:** Phase 36 (Episode Storage Foundation) — define CF_EPISODES and test upgrade path BEFORE other episode work + +**Severity:** CRITICAL (blocks upgrade; all v2.5 users affected) + +--- + +### Pitfall 3: Aggressive Pruning Causes Data Loss (Phase 37+) + +**What goes wrong:** Lifecycle automation jobs (vector/BM25 pruning) run on schedule and delete old index records per retention policy. A user asks "what did we discuss 60 days ago?" but the index records were pruned at day 60, so vector/BM25 search returns 0 results. Raw events still exist in `CF_EVENTS`, but without indexes, retrieval falls back to expensive TOC traversal. If retention policy is too aggressive (retention < expected lookback period), users lose search capability. + +**Why it happens:** +- Phase 16 design includes `PruneVectorIndex` and `PruneBm25Index` RPCs with configurable retention +- Proto `PruneVectorIndexRequest` (line 773) includes `age_days_override` parameter; default is `retention_days` from config +- Retention policy is configurable in `config.toml` under `[lifecycle]` section +- If user sets `vector_retention_days: 60` (for aggressive space saving), all events older than 60 days are pruned from the vector index +- Users with long lookback periods (e.g., "what architecture decisions did we make last quarter?") lose search capability when data ages + +**Codebase evidence:** +- `crates/memory-types/src/config.rs` will include a `LifecycleConfig` struct with retention days +- Phase 16 research document mentions this is disabled by default, waiting for Phase 37+ to finalize +- No current default documented; Phase 37 will define it + +**Consequences:** +- Search behavior degrades over time (old data becomes unsearchable) +- Users cannot reliably retrieve data older than retention period +- Fallback to TOC-based navigation is much slower (O(k) instead of O(log n)) +- No warning to users before data becomes unsearchable +- Unlike dedup false positives, this is not permanent data loss (raw events still exist) but search capability loss + +**Prevention:** +1. **Conservative defaults in Phase 37 (Lifecycle Foundation):** + - Default retention: **180 days** (6 months, not 90 days) + - Rationale: Most agent use cases need at least 6 months of searchability for quarterly/project reviews + - Document in migration guide: "If you need to search older data, increase `vector_retention_days` in config.toml" +2. **Explicit configuration requirement:** + - Require users to explicitly set retention in `config.toml` + - Do NOT have hidden defaults; make it visible + - Example config comment: + ```toml + [lifecycle] + # Vector index retention (default: 180 days) + # Older events are not found via vector search, fall back to TOC navigation + vector_retention_days = 180 + ``` +3. **Validation + logging:** + - On startup, log a WARNING if retention < 90 days: + ``` + WARN: vector_retention_days is 60 days. Events older than 60 days will not appear in vector search. + ``` + - In lifecycle job, before executing prune, log: + ``` + INFO: Pruning vector index. Events older than 2026-01-10 will be removed. + ``` +4. **Dry-run mode (already in proto):** + - `PruneVectorIndexRequest` has `dry_run: bool` field + - Document that admins should run `memory-admin prune-vector --dry-run` first to see what would be deleted +5. **Testing:** + - E2E test: Ingest events with timestamps, mock time passage, run prune, verify: + - Old events still in `CF_EVENTS` (raw storage intact) + - Old events NOT found via vector search + - TOC navigation still works (no data loss) + - Rollback scenario: if user undoes prune, search still works + +**Detection:** +- Warning sign 1: Search queries return 0 results for queries that should match old data +- Warning sign 2: User reports "I can't find things from last quarter" +- Warning sign 3: Retention policy is < 90 days without explicit acknowledgment + +**Phase to address:** Phase 37 (Lifecycle Automation Foundation) — set 180d default, require explicit config, test before shipping + +**Severity:** CRITICAL (permanent loss of search capability for large portion of data) + +--- + +### Pitfall 4: Score Collapse from Stacked Penalties (Phase 37 + 40) + +**What goes wrong:** Query results are penalized by multiple independent factors: time-decay (max 30% per Phase 37 research), supersession (15%), usage decay (unbounded per Phase 16). A frequently-accessed old result hit by all three penalties: `0.8 × 0.70 × 0.85 × 0.40 ≈ 0.19`. This result disappears from top-10 even though it should remain relevant. Penalty stacking causes score collapse unpredictably. + +**Why it happens:** +- Phase 37 (StaleFilter) applies time-decay: `score * (1.0 - max_penalty * decay_factor)` where max_penalty=0.30 +- Phase 37 also applies supersession: additional 15% penalty `score * 0.85` +- Phase 16/40 (Usage Tracking) apply usage decay: `score * 1.0 / (1.0 + decay_factor * access_count)`, unbounded downward +- These are applied sequentially in retrieval pipeline: SearchResult score -> time-decay -> supersession -> usage-decay +- Each factor is multiplicative; with three factors, even moderate values compound into harsh penalties +- No built-in floor; results can degrade from 0.8 to 0.1 (87.5% reduction) + +**Codebase evidence:** +- `crates/memory-types/src/salience.rs` (Phase 16) has `SalienceScorer` compute write-time salience +- Phase 37 research defines time-decay formula and supersession penalty (15% flat) +- `crates/memory-types/src/usage.rs` (Phase 16) has `usage_penalty()` function with unbounded decay +- No current penalty composition function; Phase 40 will wire these together + +**Consequences:** +- Results that should rank top-10 disappear from results +- Users see inconsistent behavior (same query returns different results based on result age/usage) +- Difficult to debug: user doesn't know which penalty caused the drop +- Cannot tune individual penalties without affecting overall behavior +- Penalty design must be coordinated across multiple phases; easy to miss during implementation + +**Prevention:** +1. **Design-time formula in Phase 37 plan (before Phase 40 wiring):** + - Bound combined penalty with explicit formula and hard floor + - Recommended composition: + ```rust + // Phase 37 time-decay + time_decay_factor = 1.0 - (max_penalty * (1.0 - e^(-age/half_life))) // 0.70-1.0 + + // Phase 37 supersession + if superseded { score *= 0.85; } // Additional 15% + + // Phase 40 usage decay + usage_factor = 1.0 / (1.0 + decay_factor * access_count) // 0.0-1.0 unbounded + + // FLOOR: Ensure minimum score = original × 0.50 + final_score = max(original_score * 0.50, original_score * time_decay_factor * 0.85 * usage_factor) + ``` + - This ensures worst-case penalty is 50% (not 87.5%) +2. **Configuration in Phase 37 plan:** + - Make each penalty individually configurable: `time_decay_enabled`, `supersession_enabled`, `usage_decay_enabled` + - Allow users to disable specific penalties if behavior is wrong + - Validate in config that combined theoretical max penalty is documented (should be ≤ 70%) +3. **Phase 37 + 40 coordination:** + - In Phase 37 plan, specify that Phase 40 will integrate usage decay + - In Phase 40 plan, reference Phase 37 formula + - Document in both plans the final composition formula +4. **Testing:** + - Property-based test: Generate random (score, age_days, access_count, superseded) tuples + - Verify `final_score >= 0.5 × original_score` (hard floor) + - Verify results don't disappear from top-100 unexpectedly + - Verify disabling one penalty significantly improves relevance +5. **Monitoring:** + - Track score distribution before/after penalties applied + - Alert if median score < 0.4 (indicates over-aggressive penalizing) + +**Detection:** +- Warning sign 1: Users report "relevant results disappeared after I accessed them" +- Warning sign 2: Score distribution shows heavy left tail (many results penalized below 0.3) +- Warning sign 3: Disabling one penalty category significantly improves result relevance + +**Phase to address:** Phase 37 (Lifecycle Automation Foundation) — design and document penalty composition formula; Phase 40 (Usage Tracking Integration) — implement with coordinated formula + +**Severity:** CRITICAL (breaks ranking quality; difficult to debug post-ship) + +--- + +### Pitfall 5: BM25 Index Reconstruction Blocking and Consistency (Phase 39) + +**What goes wrong:** Hybrid search uses Tantivy BM25 index. When pruning old documents (Phase 37+), segments must be merged to reclaim space. Full merges can take minutes on large indexes; if merge is interrupted (crash, timeout, signal), index may become corrupted or stale. Administrators don't have visibility into when rebuilds are needed. + +**Why it happens:** +- Tantivy stores documents in immutable segments +- Deleting documents marks them as deleted but doesn't reclaim space +- To reclaim space, segments must be merged (expensive, potentially blocking operation) +- Merge can be interrupted by process crash or explicit cancellation +- Recovery from interrupted merge is not well-defined for append-only pattern + +**Codebase evidence:** +- `crates/memory-search/src/searcher.rs` wraps Tantivy; no current rebuild capability +- Phase 39 will add rebuild RPC, but no design yet for merge coordination +- RocksDB checkpoint pattern (used in Phase 13+) could be adapted for BM25 rebuild state + +**Consequences:** +- BM25 search may stall or return stale results during merges +- Large indexes require explicit rebuild (not automatic) +- Risk of stale index if rebuild is interrupted (hybrid search serves old results) +- Operators must manually monitor and trigger rebuilds or accept performance degradation +- Disk space bloat if pruning doesn't trigger rebuild (dead docs accumulate) + +**Prevention:** +1. **Design decision in Phase 39 plan (Hybrid Search Wiring):** + - Store BM25 rebuild timestamp in checkpoint CF (similar to lifecycle job state) + - On startup, validate that BM25 index timestamp matches checkpoint timestamp + - If mismatch detected (rebuild was interrupted), log WARNING and offer admin command to rebuild +2. **Explicit prune/rebuild strategy:** + - Phase 37 prune job: Mark old documents as deleted in Tantivy (quick operation) + - Do NOT automatically merge segments (this is the blocking operation) + - Provide admin RPC: `RebuildBm25Index(dry_run: bool)` (explicit trigger, not background) + - Only run full rebuilds during maintenance windows (not in scheduler background job) + - Log progress and estimated completion time +3. **Monitoring:** + - Track BM25 index "dead" ratio (deleted docs / total docs) + - Alert if dead_ratio > 30% (recommend rebuild) + - Expose metric: `bm25_last_rebuild_timestamp` for operational visibility +4. **Testing:** + - Test prune (mark delete) + rebuild cycle + - Verify BM25 search works correctly after rebuild + - Test rebuild failure recovery (simulate crash during merge) + - Verify health check detects stale index + +**Detection:** +- Warning sign 1: BM25 search performance degrades (more dead docs = slower searches) +- Warning sign 2: Administrators see "rebuild recommended" in logs +- Warning sign 3: Hybrid search returns stale results after prune (rebuild was not run) + +**Phase to address:** Phase 39 (Hybrid Search Wiring) — design BM25 rebuild strategy and checkpoint tracking BEFORE implementation + +**Severity:** CRITICAL (index corruption or staleness can persist undetected) + +--- + +## Moderate Pitfalls + +Pitfalls that cause operational issues or UX problems but not permanent data loss. + +--- + +### Pitfall 6: Usage Tracking Write Explosion and Cache Eviction (Phase 40) + +**What goes wrong:** Usage counters are incremented on every search result access. In cache-first pattern (Phase 16 design), cache hits don't block on disk I/O; pending writes batch and flush every 60s. Under heavy load (10+ searches/sec × 20 results/search = 200 writes/sec), the pending write queue grows unbounded. LRU cache evicts old entries before flushes complete, causing duplicate writes to RocksDB or loss of recent access counts. + +**Why it happens:** +- `crates/memory-types/src/usage.rs` defines `UsageConfig` with defaults: `flush_interval_secs: 60`, `cache_size: 10_000` +- At 200 counter updates/second, 60 seconds of buffering = 12,000 updates queued +- LRU cache only holds 10K entries; it must evict ~2K entries before the flush window closes +- When a counter is evicted from cache (before being flushed), its pending write is lost +- On next search for that result, cache miss re-fetches from DB (which still has stale value) +- Result: recent usage patterns lost; decay ranking underestimates frequently-accessed results + +**Codebase evidence:** +- `crates/memory-types/src/usage.rs::UsageConfig` has hardcoded defaults +- `crates/memory-storage/src/usage.rs` implements cache with LRU eviction (not yet in v2.6) +- No current config validation for cache size vs load + +**Consequences:** +- Usage decay ranking unreliable under load +- Heavily-accessed results appear fresher than they are (decay not applied) +- Cache churn reduces hit rate (cache becomes ineffective) +- Hidden until production load testing (dev/staging may not trigger it) + +**Prevention:** +1. **Config validation in Phase 40 (Usage Tracking Integration):** + - Add formula-based default cache size: + ```rust + cache_size = max(10K, expected_results_per_sec * 60 * 2) + ``` + - If search rate is 10/sec, cache should be at least `10 * 60 * 2 = 1,200` (keep 2 minutes pending) + - Make configurable in `config.toml` with a validator + - Log cache size and expected buffer duration on startup +2. **Flush interval tuning:** + - Default to 30s (not 60s) to halve time-in-cache + - Reduce default cache_size to 5K with proportional flush + - Allow user override in config with validation +3. **Monitoring:** + - Add metric: `usage_cache_eviction_rate` (evictions / total puts) + - Add metric: `usage_pending_queue_size` (entries waiting to flush) + - Alert if eviction_rate > 10% (indicates cache too small) + - Log cache stats on each flush (how many entries flushed, how many evicted) +4. **Testing:** + - Load test with simulated 10+ searches/sec for 5 minutes + - Verify final usage counts in DB match expected values (within 1% tolerance) + - Check cache metrics are healthy (eviction_rate < 5%) + - Test with cache_size=0 (fallback to disk-on-read) + +**Detection:** +- Warning sign 1: High cache eviction rate in metrics (>10%) +- Warning sign 2: Usage counts plateau (not increasing despite more searches) +- Warning sign 3: Decay ranking doesn't improve; frequently-used results still rank high + +**Phase to address:** Phase 40 (Usage Tracking Integration) — validate config defaults and load-test before shipping + +**Severity:** MODERATE (ranking quality degraded under load; not permanent loss) + +--- + +### Pitfall 7: Episodic Memory Schema Bloat and Slow Retrieval (Phase 36) + +**What goes wrong:** Episode schema (Phase 36) stores all context needed for value evaluation: task description, actions taken, outcome, value score. If stored as large JSON blobs or with all fields indexed, storage balloons and iteration over episodes becomes slow. A system with millions of episodes may become expensive to scan for similarity or filtering. + +**Why it happens:** +- Episodic memory captures full task context for learning (needed for value scoring) +- Value-based retention requires evaluating outcomes; this information must be stored +- If all fields are stored uncompressed in a single record, each episode is 1-5KB +- Scanning 1M episodes for "similar episodes" requires iteration; at 5KB/episode = 5GB scan +- RocksDB iterators are fast but not free; this impacts query latency + +**Codebase evidence:** +- Phase 36 will design Episode proto message; schema TBD +- If designed like current TocNode/Grip (uncompressed JSON + all fields indexed), bloat will occur + +**Consequences:** +- Episode similarity search slow (> 100ms p99 for large databases) +- Storage bloat (episodes consume significant disk space, reducing SSD lifespan) +- RocksDB compaction time increases +- Vector embedding full episode text produces redundant/noisy embeddings + +**Prevention:** +1. **Schema design in Phase 36 (Episode Storage Foundation):** + - Separate lightweight record (id, outcome_score, timestamp) from full context + - Store lightweight in `CF_EPISODES`, full context in optional `CF_EPISODE_DETAILS` + - Index only: episode_id, outcome_score, timestamp, embedding_vector_id + - Lazy-load details only when needed (not on every retrieve) +2. **Compression:** + - Use Zstd compression for full episode context (already standard for CF_EVENTS) + - Lightweight record remains uncompressed (small, frequently accessed) +3. **Vector embedding strategy:** + - Embed episode SUMMARY (outcome + key action names), not full context + - Embed only if episode survives value-based retention (don't embed episodes you'll discard) + - Reuse all-MiniLM-L6-v2 from main memory system (consistent embeddings) +4. **Testing:** + - Perf test: Ingest 100K episodes, measure time to find 10 similar episodes + - Target: < 100ms p99 latency + - Measure storage: should be < 10MB for 100K simple episodes + - Monitor RocksDB compaction time + +**Detection:** +- Warning sign 1: Episode search queries returning slow (> 500ms p99) +- Warning sign 2: Database size grows faster than episode count justifies (> 50KB/episode) +- Warning sign 3: RocksDB compaction taking unexpectedly long + +**Phase to address:** Phase 36 (Episode Storage Foundation) — design schema with lazy-loading and compression upfront + +**Severity:** MODERATE (performance issue; not correctness) + +--- + +### Pitfall 8: Episodic Memory Value Score Inconsistency (Phase 36) + +**What goes wrong:** Episodes are retained based on outcome_score computed at episode completion. Over time, the scoring function may improve (e.g., weight task complexity differently). Old episodes have scores from the old function; new episodes from the new function. When filtering episodes by value, results are inconsistent or biased toward newer episodes. + +**Why it happens:** +- Episode value scoring is application-specific (no universal definition of "valuable") +- Scoring function typically improves over time with user feedback +- Changing the function requires re-scoring old episodes (expensive) +- If re-scoring is not done, bias persists (new episodes weighted differently) + +**Codebase evidence:** +- Phase 36 will design value scoring mechanism; strategy TBD +- No current version tracking in episode schema + +**Consequences:** +- Users cannot safely change scoring function without massive re-computation +- Value-based retention becomes locked to original scoring +- Bias toward newer episodes in queries (they have "better" scores from improved function) +- Users unhappy with retention policy but cannot change it without expensive re-processing + +**Prevention:** +1. **Design decision in Phase 36 plan (Episode Storage Foundation):** + - Store outcome_score AND score_version (hash or timestamp of scoring function) + - When querying episodes, note which versions are present in results + - Document: "Value-based retention is not retrospective. Changing the scoring function does not retroactively change retention decisions." +2. **Configuration:** + - Recommend users commit to a scoring function early and only change in major version bumps + - If change is necessary, provide `re_score_episodes` admin command (offline operation) +3. **Testing:** + - Test: Ingest episodes with score_v1, change function to v2, ingest new episodes + - Verify retention policy applies correctly to both versions + - Document in plan: mixing versions is supported but not recommended + +**Detection:** +- Warning sign 1: Users ask "how do I change the scoring function?" +- Warning sign 2: Episode retention seems biased toward newer episodes +- Warning sign 3: Queries return inconsistent episode sets across time + +**Phase to address:** Phase 36 (Episode Storage Foundation) — document immutability and version tracking in plan + +**Severity:** MODERATE (design limitation; not a bug) + +--- + +### Pitfall 9: Salience Scoring Keyword False Positives (Phase 16+) + +**What goes wrong:** Salience scorer (Phase 16, from v2.5 codebase) classifies memory kind by keyword matching (e.g., "I prefer" → Preference, "step 1" → Procedure). Simple keyword matching produces false positives: "step by step" in a sentence accidentally triggers Procedure classification, boosting salience when it shouldn't. Over time, false positives skew ranking. + +**Why it happens:** +- `crates/memory-types/src/salience.rs::SalienceScorer::classify_kind()` uses context-free keyword patterns +- Patterns are simple substring checks, not semantic analysis +- Natural language is ambiguous; keywords can appear in wrong contexts +- False positives are rare enough individually but accumulate over millions of events + +**Codebase evidence:** +- `crates/memory-types/src/salience.rs` line 162-221 shows classify_kind implementation +- Patterns like `lower.contains("step ")` are context-free + +**Consequences:** +- Some memories are incorrectly classified and over-boosted +- Ranking is slightly off but not obviously wrong +- Difficult to debug (users don't realize scoring is heuristic) + +**Prevention:** +1. **Acknowledge limitation in Phase 16 documentation:** + - Document that classification is heuristic, not ML-based + - Mention false positive rate estimate (e.g., "~2-3% false positive rate estimated") + - Note that non-Observation kinds are `nice-to-have` boosts, not critical for correctness +2. **Improve patterns in Phase 16:** + - Only match high-confidence patterns (e.g., "is defined as" is better than "is") + - Require more context (e.g., "step 1:" not just "step") + - Test classifier on 1K diverse sentences; manually verify false positive rate +3. **Configuration option:** + - Add `salience.classification_enabled: bool` in config (default true) + - If disabled, all memories default to Observation (no kind boost) +4. **Testing:** + - Unit test: Classify 100 diverse real-world sentences; manually verify no obvious false positives + - E2E test: Ingest varied event types, verify ranking seems reasonable + +**Detection:** +- Warning sign 1: Users report memories get boosted for unclear reasons +- Warning sign 2: Classifier is too aggressive (90%+ classified as non-observation) +- Warning sign 3: Tuning salience_weight has huge impact (implies false positives are significant) + +**Phase to address:** Phase 16 (Salience Scoring) — test classifier on real data, estimate false positive rate, document limitations + +**Severity:** MODERATE (ranking quality slightly off; not data loss) + +--- + +### Pitfall 10: Observability RPC Cardinality Explosion (Phase 38) + +**What goes wrong:** Observability RPCs (Phase 38) expose metrics like dedup buffer, ranking config, lifecycle state. If metrics include fine-grained breakdowns, cardinality explodes. Example: per-memory-kind stats (5 kinds) × per-layer (5 layers) × per-time-window (10) = 250 metric combinations. When exported to Prometheus, this balloons metric storage. + +**Why it happens:** +- It's natural to expose detailed breakdowns: "How many Constraint memories vs Observations?" +- Each metric combination creates new series in Prometheus +- Cardinality grows over time (especially with agent/time-window dimensions) +- Operators don't realize impact until Prometheus storage becomes expensive + +**Codebase evidence:** +- Phase 38 RPC design TBD; proto messages not yet finalized + +**Consequences:** +- Prometheus metric storage becomes expensive/hard to manage +- Monitoring dashboards become bloated and slow +- Time-series database struggles with high cardinality +- Difficult to diagnose which metrics are problematic + +**Prevention:** +1. **Design decision in Phase 38 plan (Observability Foundation):** + - Limit metrics to high-level aggregates ONLY + - Recommended metrics (total <30 series): + - `dedup_events_checked_total`, `dedup_events_deduplicated_total` (no per-threshold breakdown) + - `ranking_salience_enabled`, `ranking_usage_decay_enabled` (booleans, not counts) + - `lifecycle_vector_pruned_total`, `lifecycle_bm25_pruned_total` (totals, no per-level) + - Accept that detailed analysis requires RPC calls, not Prometheus metrics +2. **Configuration:** + - Make metrics opt-in (disabled by default) + - Cardinality budgets (e.g., max 50 metric series total) +3. **Testing:** + - Cardinality test: Enable all monitoring, check metric count + - Verify < 50 total metric series (not per-agent or per-window) + +**Detection:** +- Warning sign 1: Prometheus storage grows 10MB+/day (excessive) +- Warning sign 2: Cardinality warnings from Prometheus +- Warning sign 3: GetRankingStatus response includes fine-grained breakdowns + +**Phase to address:** Phase 38 (Observability RPCs) — design with explicit cardinality limits in plan + +**Severity:** MODERATE (operational overhead; not functional issue) + +--- + +## Minor Pitfalls + +Pitfalls that are easy to miss or have subtle consequences. + +### Pitfall 11: Proto Field Numbers Collision (Phase 36+) + +**What goes wrong:** Phase 36+ add new fields to proto messages. Field numbers 1-200 are already in use; new fields use 201+. If a developer accidentally reuses number 1-100, it silently corrupts serialized data. + +**Prevention:** +- Document reserved ranges in proto comments (e.g., "Fields 1-200 reserved for v1.0-v2.5") +- Code review: Check all new proto fields use >= 201 +- Use a linting tool to prevent reuse + +**Phase to address:** Phase 36 (Episode Storage Foundation) — add proto comments documenting reserved ranges + +**Severity:** MINOR (caught by code review) + +--- + +### Pitfall 12: Config.toml Backward Compatibility (Phase 36+) + +**What goes wrong:** New config sections (e.g., `[episodic_memory]`, `[lifecycle]`) are added. Old `config.toml` files don't have these sections; deserialization fails. + +**Prevention:** +- Use serde `#[serde(default)]` for all new config structs (already pattern in v2.5) +- Test: Load v2.5 `config.toml` with v2.6 code; should succeed +- Document that all new sections are optional + +**Phase to address:** Any config-adding phase (36+) — test backward compat in setup + +**Severity:** MINOR (tested during setup) + +--- + +## Phase-Specific Warnings + +| Phase | Topic | Likely Pitfall | Mitigation | +|-------|-------|-------|-------| +| 36 | Episode Storage | CF_EPISODES missing on upgrade | Add CF to column_families.rs; test upgrade | +| 36 | Episode Schema | Storage bloat and slow retrieval | Lazy-load details, compress, don't embed full context | +| 36 | Value Scoring | Inconsistent scores over time | Document immutability, store score_version | +| 37 | Lifecycle Automation | Aggressive pruning loses search | Conservative 180d default, explicit config | +| 37 | StaleFilter | Wrong reference point | Newest result (locked in v2.5 research) | +| 37 | Score Composition | Penalty stacking | Design formula with floor, coordinate with 40 | +| 38 | Observability | Metric cardinality | Aggregates only, < 50 series | +| 39 | Hybrid Search | Score normalization | Choose RRF or min-max, validate in E2E | +| 39 | BM25 Rebuild | Index consistency | Checkpoint state, explicit rebuild commands | +| 40 | Usage Tracking | Cache eviction | Validate size formula, load test | + +--- + +## Summary + +**Critical issues (affect correctness or data):** +1. **Pitfall 1** (Hybrid score normalization) — Must choose strategy upfront or ranking is wrong +2. **Pitfall 2** (CF_EPISODES missing) — Blocks upgrade; all v2.5 users affected +3. **Pitfall 3** (Aggressive pruning) — Permanent loss of search capability (raw data safe) +4. **Pitfall 4** (Score collapse) — Ranking breaks; difficult to debug +5. **Pitfall 5** (BM25 rebuild) — Index staleness/corruption risk + +**Moderate issues (operational/UX):** +6. **Pitfall 6** (Usage cache eviction) — Ranking unreliable under load +7. **Pitfall 7** (Episode bloat) — Performance degradation +8. **Pitfall 8** (Value score inconsistency) — Limitation (not a bug) +9. **Pitfall 9** (Keyword false positives) — Ranking slightly off +10. **Pitfall 10** (Metric cardinality) — Operational overhead + +**Research-backed recommendations:** +- Episodic memory design follows MemGPT pattern (value scoring + selective retention) +- Hybrid search should use RRF per 2026 industry consensus (Weaviate, OpenSearch, Elastic) +- Usage tracking cache must scale with load (formula-based sizing) +- Lifecycle pruning must be conservative (180 days default, not 90) +- Penalty composition must be designed holistically (three independent phases need coordination) + +--- + +## Sources + +### Episodic Memory & Value-Based Retention +- [MemGPT: Engineering Semantic Memory through Adaptive Retention and Context Summarization](https://informationmatters.org/2025/10/memgpt-engineering-semantic-memory-through-adaptive-retention-and-context-summarization/) +- [AI Agent Memory: Build Stateful AI Systems That Remember](https://redis.io/blog/ai-agent-memory-stateful-systems/) +- [How to Implement Long-Term Memory](https://oneuptime.com/blog/post/2026-01-30-long-term-memory/view) +- [ICLR 2026 Workshop Proposal MemAgents: Memory for LLM-Based Agentic Systems](https://openreview.net/pdf?id=U51WxL382H) + +### Hybrid Search & Score Normalization +- [Hybrid Search Explained | Weaviate](https://weaviate.io/blog/hybrid-search-explained) +- [Building effective hybrid search in OpenSearch: Techniques and best practices](https://opensearch.org/blog/building-effective-hybrid-search-in-opensearch-techniques-and-best-practices/) +- [Advanced RAG — Understanding Reciprocal Rank Fusion in Hybrid Search](https://glaforge.dev/posts/2026/02/10/advanced-rag-understanding-reciprocal-rank-fusion-in-hybrid-search/) +- [Reciprocal Rank Fusion (RRF) for Hybrid Search](https://apxml.com/courses/advanced-vector-search-llms/chapter-3-hybrid-search-approaches/rrf-fusion-algorithms) +- [Hybrid Search Scoring (RRF) - Azure AI Search | Microsoft Learn](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking) + +### RocksDB & Column Family Migration +- [RocksDB Compatibility Between Different Releases](https://github.com/facebook/rocksdb/wiki/RocksDB-Compatibility-Between-Different-Releases) +- [Column Families · facebook/rocksdb Wiki | GitHub](https://github.com/facebook/rocksdb/wiki/column-families) +- [How to maintain the forward/backward compatibility between different versions · apache/kvrocks · Discussion #1678](https://github.com/apache/kvrocks/discussions/1678) + +### Internal References (v2.5 & Phase 16) +- `.planning/milestones/v2.5-phases/37-stale-filter/37-RESEARCH.md` — Time-decay formula, supersession penalty, MemoryKind exemptions +- `.planning/milestones/v2.0-phases/16-memory-ranking-enhancements/16-RESEARCH.md` — Salience scoring, usage tracking patterns +- `.planning/PROJECT.md` — Append-only invariant, column family design + +--- + +*Research document created: 2026-03-11* +*Confidence: HIGH (verified against v2.5 patterns, Phase 16 ranking research, 2026 industry literature)* diff --git a/.planning/research/v2.6-SUMMARY.md b/.planning/research/v2.6-SUMMARY.md new file mode 100644 index 0000000..1407698 --- /dev/null +++ b/.planning/research/v2.6-SUMMARY.md @@ -0,0 +1,155 @@ +# v2.6 Research Summary: Episodic Memory, Ranking, Lifecycle Automation + +**Project:** Agent Memory — Local agentic memory with episodic learning +**Researched:** 2026-03-11 +**Overall Confidence:** HIGH + +## Executive Summary + +The v2.6 milestone builds on a mature, tested foundation (v2.5 shipped semantic dedup + stale filtering). The goal is adding episodic memory (task outcome tracking), salience/usage-based ranking refinements, and lifecycle automation (vector/BM25 pruning). + +**Key finding: Zero new external dependencies required.** + +All features use existing stack: +- **RocksDB 0.22** for new CF_EPISODES column family (append-only) +- **Candle + all-MiniLM-L6-v2** for episode embeddings (already proven) +- **usearch 2** for episode similarity search (existing HNSW) +- **Tokio scheduler** for lifecycle jobs (existing framework) +- **Tantivy 0.25** for BM25 hybrid wiring (just routing, no new package) + +The work is primarily: +1. Proto schema extensions (Episode messages, field numbers > 200) +2. New crate memory-episodes (uses existing storage APIs) +3. Wiring existing APIs into scheduler (vector prune, BM25 rebuild) +4. Threading salience/usage through retrieval handlers +5. Configuration additions to lifecycle section + +## Key Findings + +### Technology Stack + +See `.planning/research/STACK.md` for full detail. + +**Summary:** +- No new Rust dependencies added to workspace +- Tantivy (BM25), usearch (HNSW), Candle (embeddings), RocksDB all proven in production +- All v2.5 assumptions validated: append-only RocksDB, local embeddings, in-memory dedup buffer +- Configuration-driven lifecycle (no code churn per deployment) + +### Architecture Decisions + +**Episodic Storage:** +- RocksDB column family CF_EPISODES (append-only, same pattern as events/toc_nodes) +- Episode embeddings stored in existing HNSW index with metadata tag +- No separate vector database needed + +**Salience Integration:** +- SalienceScorer already defined in memory-types (since v2.0 prep) +- Usage stats via dashmap cache (already in memory-storage) +- Thread through BM25, vector, topic ranking layers +- Formula: `score = base * (0.55 + 0.45 * salience) * usage_penalty(access_count)` + +**Lifecycle Automation:** +- Vector pruning job: existing API `VectorIndexPipeline::prune()`, needs scheduler wiring +- BM25 lifecycle: rebuild job with level filter (segments/grips only after 30d, then day+ only) +- Episode pruning: value-based (threshold 0.18 for "sweet spot" difficulty episodes) +- Config-driven with cron expressions + +**Hybrid Search:** +- HybridSearch RPC exists in proto, handler partially implemented +- BM25 not wired yet (hardcoded `false` in some paths) +- Apply RRF fusion: `score = 60/(60+rank_bm25) + 60/(60+rank_vector)` +- No algorithmic changes, just wiring + RRF ranking + +### Risks and Mitigations + +| Risk | Severity | Mitigation | +|------|----------|-----------| +| Episode value threshold tuning | MEDIUM | Dry-run mode, metrics exposed via GetDedupStatus analogue | +| Salience weighting constants | MEDIUM | Config-driven, not hardcoded; can A/B test | +| Lifecycle job conflicts | LOW | Use explicit locking (existing pattern from TOC rollups) | +| Vector index corruption during prune | LOW | Soft-delete pattern (filter during rebuild), not physical deletion | +| Agent filtering in hybrid search | LOW | Metadata already threaded, just needs sanity check in fusion | + +## Implications for Roadmap + +### Recommended Phase Structure + +**Phase 1: Proto + Episodic Storage (1 week)** +- Add Episode, StartEpisodeRequest/Response to proto (field numbers > 200) +- Create memory-episodes crate with RocksDB operations +- Write tests for episode creation, retrieval, embedding + +**Phase 2: Salience + Usage Ranking (1 week)** +- Thread SalienceScorer through BM25TeleportHandler, VectorTeleportHandler, TopicHandler +- Integrate usage stats from storage into result ranking +- Update GetRankingStatus RPC to expose salience/usage metrics + +**Phase 3: Lifecycle Automation (1 week)** +- Wire vector prune job into scheduler (existing API) +- Implement BM25 rebuild job with level filter +- Implement episode value-based prune job +- Add [lifecycle.*] config sections + +**Phase 4: Hybrid Search Wiring (3 days)** +- Finish HybridSearch handler (RRF fusion) +- Wire into retrieval routing +- Update GetRetrievalCapabilities to detect hybrid tier +- Test with agent filters + +**Phase 5: Observability + Testing (3 days)** +- Add observability RPCs (episode-focused analogue of GetDedupStatus) +- E2E tests for episodic workflow (StartEpisode → RecordAction → CompleteEpisode → GetSimilar) +- Performance benchmarks for episode search +- CLI tests for all adapters + +### Why This Order + +1. **Proto + storage first** — foundational; all other phases depend on it +2. **Ranking second** — builds on existing config/types, no external APIs +3. **Lifecycle third** — scheduler already runs; just activating existing job APIs +4. **Hybrid search fourth** — optional acceleration layer; doesn't block core features +5. **Tests last** — validates entire stack, can parallelize with Phase 4 + +### Research Flags for Phases + +- **Phase 1:** Episode embedding dimensionality — use same 384-dim as TOC nodes? (answer: yes, validated in code) +- **Phase 2:** Salience weight constants (0.55/0.45 split) — may need A/B testing in production +- **Phase 3:** BM25 level filter granularity — should segments ever be indexed after rollup? (design doc says no) +- **Phase 4:** RRF rank normalization constant (60) — sensitive to index sizes; may need dynamic tuning +- **Phase 5:** Episode similarity threshold — separate from content dedup threshold? (recommendation: yes, 0.75) + +## Confidence Assessment + +| Area | Level | Notes | +|------|-------|-------| +| **Stack** | HIGH | All dependencies existing; versions stable | +| **Episodic Storage** | HIGH | Straightforward append-only RocksDB pattern | +| **Embeddings** | HIGH | all-MiniLM-L6-v2 proven in v2.0+ | +| **Ranking Integration** | HIGH | SalienceScorer exists; formula is standard | +| **Lifecycle Automation** | HIGH | Scheduler framework operational; APIs exist | +| **Hybrid Search Wiring** | MEDIUM | RRF algorithm tested; implementation just needs careful routing | +| **Value-Based Retention** | MEDIUM | Heuristic (0.65 target difficulty); needs validation in practice | +| **Configuration** | HIGH | Pattern established in memory-types/config.rs | + +## Gaps to Address in Phase-Specific Research + +1. **Episode action schema** — what constitutes an "Action" in an episode? (design doc has ActionType, input, result, timestamp) +2. **Outcome score calibration** — how does agent framework measure task success? (out of scope for memory; assume 0.0-1.0 from caller) +3. **Agent framework integration** — who calls StartEpisode/RecordAction/CompleteEpisode? (design doc says "framework integration required", leave for Phase 5) +4. **Consolidation deferral** — is extracting "durable knowledge" (preferences, constraints) truly v2.7+? (yes, confirmed as Tier 3 out of scope) + +## Files Created + +- `.planning/research/STACK.md` — Full technology stack with versions, integration points, anti-patterns + +## Next Steps for Roadmap Phase Planning + +1. Review STACK.md for dependency strategy and schema extensions +2. Create PLAN.md files per phase (Phase 1-5 breakdown) +3. Identify proto field numbers to avoid (already safe >200, but validate) +4. Set up performance baselines (episode search latency targets) +5. Document agent framework contract (StartEpisode API expectations) + +--- +*Research completed 2026-03-11. All findings high confidence based on existing codebase validation. Ready for phase planning.* diff --git a/crates/e2e-tests/src/lib.rs b/crates/e2e-tests/src/lib.rs index e5622e3..9e711b8 100644 --- a/crates/e2e-tests/src/lib.rs +++ b/crates/e2e-tests/src/lib.rs @@ -243,7 +243,7 @@ pub fn create_proto_event_structural( session_id: session_id.to_string(), timestamp_ms, event_type: 1, // SessionStart - role: 1, // User + role: 1, // User text: String::new(), metadata: HashMap::new(), agent: Some("claude".to_string()), diff --git a/crates/e2e-tests/tests/dedup_test.rs b/crates/e2e-tests/tests/dedup_test.rs index 135dccd..db08e0a 100644 --- a/crates/e2e-tests/tests/dedup_test.rs +++ b/crates/e2e-tests/tests/dedup_test.rs @@ -120,7 +120,10 @@ async fn test_dedup_duplicate_stored_but_not_indexed() { .await .expect("Second ingest should succeed"); let resp2 = resp2.into_inner(); - assert_eq!(resp2.created, true, "Second event should be created (stored)"); + assert_eq!( + resp2.created, true, + "Second event should be created (stored)" + ); assert_eq!( resp2.deduplicated, true, "Second event should be deduplicated" @@ -164,8 +167,7 @@ async fn test_dedup_novel_events_all_indexed() { *v = 1.0 / ((dim / 2) as f32).sqrt(); } - let embedder: Arc = - Arc::new(SequentialEmbedder::new(vec![vec_a, vec_b])); + let embedder: Arc = Arc::new(SequentialEmbedder::new(vec![vec_a, vec_b])); let buffer = Arc::new(RwLock::new(InFlightBuffer::new(256, dim))); let checker = Arc::new(NoveltyChecker::with_in_flight_buffer( Some(embedder), @@ -214,7 +216,10 @@ async fn test_dedup_novel_events_all_indexed() { .await .unwrap() .into_inner(); - assert_eq!(resp2.deduplicated, false, "Second event should also be novel"); + assert_eq!( + resp2.deduplicated, false, + "Second event should also be novel" + ); // Both events should have outbox entries let outbox_entries = harness.storage.get_outbox_entries(0, 100).unwrap(); diff --git a/crates/e2e-tests/tests/degradation_test.rs b/crates/e2e-tests/tests/degradation_test.rs index 4d07f0c..1e48265 100644 --- a/crates/e2e-tests/tests/degradation_test.rs +++ b/crates/e2e-tests/tests/degradation_test.rs @@ -38,7 +38,13 @@ async fn test_degradation_all_indexes_missing() { let _toc_node = build_toc_segment(harness.storage.clone(), events).await; // 4. Create RetrievalHandler with NO indexes - let handler = RetrievalHandler::with_services(harness.storage.clone(), None, None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + None, + None, + None, + Default::default(), + ); // 5. Call get_retrieval_capabilities let response = handler @@ -127,7 +133,13 @@ async fn test_degradation_no_bm25_index() { let _toc_node = build_toc_segment(harness.storage.clone(), events).await; // 3. Create RetrievalHandler with NO indexes (BM25 not configured) - let handler = RetrievalHandler::with_services(harness.storage.clone(), None, None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + None, + None, + None, + Default::default(), + ); // 4. Call get_retrieval_capabilities let response = handler @@ -221,8 +233,13 @@ async fn test_degradation_bm25_present_vector_missing() { let bm25_searcher = Arc::new(TeleportSearcher::new(&bm25_index).unwrap()); // 4. Create RetrievalHandler with BM25 present, vector and topics absent - let handler = - RetrievalHandler::with_services(harness.storage.clone(), Some(bm25_searcher), None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + Some(bm25_searcher), + None, + None, + Default::default(), + ); // 5. Call get_retrieval_capabilities let response = handler @@ -304,7 +321,13 @@ async fn test_degradation_capabilities_warnings_contain_context() { let harness = TestHarness::new(); // 2. Create RetrievalHandler with NO indexes - let handler = RetrievalHandler::with_services(harness.storage.clone(), None, None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + None, + None, + None, + Default::default(), + ); // 3. Call get_retrieval_capabilities let response = handler diff --git a/crates/e2e-tests/tests/episodic_test.rs b/crates/e2e-tests/tests/episodic_test.rs new file mode 100644 index 0000000..fa006d7 --- /dev/null +++ b/crates/e2e-tests/tests/episodic_test.rs @@ -0,0 +1,359 @@ +//! E2E tests for episodic memory (Phase 44). +//! +//! Validates: +//! - Episode lifecycle: start -> record actions -> complete -> verify storage +//! - Value-based retention: multiple episodes with varying scores, verify pruning +//! - Disabled config: RPCs return appropriate error when episodic memory is disabled + +use std::sync::Arc; + +use pretty_assertions::assert_eq; +use tonic::Request; + +use e2e_tests::TestHarness; +use memory_service::pb::memory_service_server::MemoryService; +use memory_service::pb::{ + ActionResultStatus, CompleteEpisodeRequest, EpisodeAction, RecordActionRequest, + StartEpisodeRequest, +}; +use memory_service::{EpisodeHandler, MemoryServiceImpl}; +use memory_types::config::EpisodicConfig; + +/// Create a MemoryServiceImpl with episodic memory enabled. +fn create_episodic_service(harness: &TestHarness, config: EpisodicConfig) -> MemoryServiceImpl { + let handler = Arc::new(EpisodeHandler::new(harness.storage.clone(), config)); + let mut service = MemoryServiceImpl::new(harness.storage.clone()); + service.set_episode_handler(handler); + service +} + +/// E2E test: Full episode lifecycle through gRPC service layer. +/// +/// Validates: StartEpisode -> RecordAction (x2) -> CompleteEpisode -> verify storage. +#[tokio::test] +async fn test_episode_lifecycle_e2e() { + let harness = TestHarness::new(); + let config = EpisodicConfig { + enabled: true, + ..Default::default() + }; + let service = create_episodic_service(&harness, config); + + // 1. Start episode + let start_resp = service + .start_episode(Request::new(StartEpisodeRequest { + task: "Implement authentication module".to_string(), + plan: vec![ + "Design JWT schema".to_string(), + "Implement token validation".to_string(), + "Add refresh token rotation".to_string(), + ], + agent: Some("claude".to_string()), + })) + .await + .unwrap() + .into_inner(); + + assert!(start_resp.created); + let episode_id = start_resp.episode_id.clone(); + assert!(!episode_id.is_empty()); + + // 2. Record first action (success) + let action1_resp = service + .record_action(Request::new(RecordActionRequest { + episode_id: episode_id.clone(), + action: Some(EpisodeAction { + action_type: "tool_call".to_string(), + input: "Read existing auth code".to_string(), + result_status: ActionResultStatus::ActionResultSuccess.into(), + result_detail: "Found existing JWT utils".to_string(), + timestamp_ms: chrono::Utc::now().timestamp_millis(), + }), + })) + .await + .unwrap() + .into_inner(); + + assert!(action1_resp.recorded); + assert_eq!(action1_resp.action_count, 1); + + // 3. Record second action (failure then retry) + let action2_resp = service + .record_action(Request::new(RecordActionRequest { + episode_id: episode_id.clone(), + action: Some(EpisodeAction { + action_type: "api_request".to_string(), + input: "Test token endpoint".to_string(), + result_status: ActionResultStatus::ActionResultFailure.into(), + result_detail: "Connection refused".to_string(), + timestamp_ms: chrono::Utc::now().timestamp_millis(), + }), + })) + .await + .unwrap() + .into_inner(); + + assert!(action2_resp.recorded); + assert_eq!(action2_resp.action_count, 2); + + // 4. Complete episode with moderate success + let complete_resp = service + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: episode_id.clone(), + outcome_score: 0.65, + failed: false, + lessons_learned: vec![ + "JWT refresh rotation prevents token theft".to_string(), + "Always test endpoints before deploying".to_string(), + ], + failure_modes: vec!["API connectivity issues in test environment".to_string()], + })) + .await + .unwrap() + .into_inner(); + + assert!(complete_resp.completed); + // At midpoint (0.65), value score = 1.0 + assert!( + (complete_resp.value_score - 1.0).abs() < f32::EPSILON, + "Expected value_score 1.0 at midpoint, got {}", + complete_resp.value_score + ); + + // 5. Verify episode in storage + let stored = harness + .storage + .get_episode(&episode_id) + .unwrap() + .expect("Episode should be in storage"); + + assert_eq!(stored.task, "Implement authentication module"); + assert_eq!(stored.plan.len(), 3); + assert_eq!(stored.actions.len(), 2); + assert_eq!(stored.status, memory_types::EpisodeStatus::Completed); + assert_eq!(stored.lessons_learned.len(), 2); + assert_eq!(stored.failure_modes.len(), 1); + assert_eq!(stored.agent, Some("claude".to_string())); + assert!(stored.outcome_score.is_some()); + assert!(stored.value_score.is_some()); + assert!(stored.completed_at.is_some()); +} + +/// E2E test: Value-based retention pruning. +/// +/// Creates episodes exceeding max_episodes limit and verifies lowest-value +/// episodes are pruned after completion. +#[tokio::test] +async fn test_value_based_retention_pruning_e2e() { + let harness = TestHarness::new(); + let config = EpisodicConfig { + enabled: true, + max_episodes: 3, + midpoint_target: 0.65, + ..Default::default() + }; + let service = create_episodic_service(&harness, config); + + // Create episodes with different outcome scores (and thus different value scores) + // Score 0.1 -> far from midpoint -> low value + // Score 0.65 -> at midpoint -> highest value + // Score 0.9 -> far from midpoint -> medium value + // Score 0.5 -> near midpoint -> high value + let scores = [0.1, 0.65, 0.9, 0.5]; + let mut episode_ids = Vec::new(); + + for (i, score) in scores.iter().enumerate() { + let start_resp = service + .start_episode(Request::new(StartEpisodeRequest { + task: format!("Task {} with score {}", i, score), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + episode_ids.push(start_resp.episode_id.clone()); + + // Small delay to ensure distinct ULIDs + std::thread::sleep(std::time::Duration::from_millis(2)); + + let complete_resp = service + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id, + outcome_score: *score, + failed: false, + lessons_learned: vec![], + failure_modes: vec![], + })) + .await + .unwrap() + .into_inner(); + + assert!(complete_resp.completed); + } + + // After 4th episode completion, should have pruned 1 (down to max_episodes=3) + let remaining = harness.storage.list_episodes(100).unwrap(); + assert_eq!( + remaining.len(), + 3, + "Should have pruned to max_episodes=3, got {}", + remaining.len() + ); + + // The episode with score=0.1 (value_score = 1.0 - |0.1 - 0.65| = 0.45) should be pruned + // because it has the lowest value score among all four. + // Score 0.65 -> value 1.0 (highest) + // Score 0.5 -> value 1.0 - |0.5 - 0.65| = 0.85 + // Score 0.9 -> value 1.0 - |0.9 - 0.65| = 0.75 + // Score 0.1 -> value 1.0 - |0.1 - 0.65| = 0.45 (lowest -- pruned) + let remaining_ids: Vec<&str> = remaining.iter().map(|e| e.episode_id.as_str()).collect(); + assert!( + !remaining_ids.contains(&episode_ids[0].as_str()), + "Episode with lowest value (score=0.1) should have been pruned" + ); +} + +/// E2E test: Disabled episodic memory returns FailedPrecondition. +/// +/// When EpisodicConfig.enabled=false, all episode RPCs should return +/// appropriate error status. +#[tokio::test] +async fn test_episodic_disabled_returns_error() { + let harness = TestHarness::new(); + let config = EpisodicConfig::default(); // disabled by default + let service = create_episodic_service(&harness, config); + + let start_result = service + .start_episode(Request::new(StartEpisodeRequest { + task: "should fail".to_string(), + plan: vec![], + agent: None, + })) + .await; + + assert!(start_result.is_err()); + assert_eq!( + start_result.unwrap_err().code(), + tonic::Code::FailedPrecondition + ); +} + +/// E2E test: No episode handler returns FailedPrecondition. +/// +/// When episode_handler is None (not configured), all episode RPCs should return +/// appropriate error status. +#[tokio::test] +async fn test_episodic_no_handler_returns_error() { + let harness = TestHarness::new(); + let service = MemoryServiceImpl::new(harness.storage.clone()); + + let start_result = service + .start_episode(Request::new(StartEpisodeRequest { + task: "should fail".to_string(), + plan: vec![], + agent: None, + })) + .await; + + assert!(start_result.is_err()); + assert_eq!( + start_result.unwrap_err().code(), + tonic::Code::FailedPrecondition + ); +} + +/// E2E test: Cannot record action on completed episode. +#[tokio::test] +async fn test_record_action_on_completed_episode() { + let harness = TestHarness::new(); + let config = EpisodicConfig { + enabled: true, + ..Default::default() + }; + let service = create_episodic_service(&harness, config); + + let start_resp = service + .start_episode(Request::new(StartEpisodeRequest { + task: "test task".to_string(), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + // Complete it + service + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id.clone(), + outcome_score: 0.5, + failed: false, + lessons_learned: vec![], + failure_modes: vec![], + })) + .await + .unwrap(); + + // Try to record action on completed episode + let result = service + .record_action(Request::new(RecordActionRequest { + episode_id: start_resp.episode_id, + action: Some(EpisodeAction { + action_type: "tool_call".to_string(), + input: "should fail".to_string(), + result_status: ActionResultStatus::ActionResultSuccess.into(), + result_detail: "ok".to_string(), + timestamp_ms: 0, + }), + })) + .await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err().code(), tonic::Code::FailedPrecondition); +} + +/// E2E test: Failed episode has correct status. +#[tokio::test] +async fn test_episode_failure_status() { + let harness = TestHarness::new(); + let config = EpisodicConfig { + enabled: true, + ..Default::default() + }; + let service = create_episodic_service(&harness, config); + + let start_resp = service + .start_episode(Request::new(StartEpisodeRequest { + task: "risky operation".to_string(), + plan: vec![], + agent: Some("opencode".to_string()), + })) + .await + .unwrap() + .into_inner(); + + let complete_resp = service + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id.clone(), + outcome_score: 0.15, + failed: true, + lessons_learned: vec!["Need better error handling".to_string()], + failure_modes: vec!["Unhandled null pointer".to_string()], + })) + .await + .unwrap() + .into_inner(); + + assert!(complete_resp.completed); + + let stored = harness + .storage + .get_episode(&start_resp.episode_id) + .unwrap() + .expect("Episode should exist"); + + assert_eq!(stored.status, memory_types::EpisodeStatus::Failed); + assert_eq!(stored.agent, Some("opencode".to_string())); +} diff --git a/crates/e2e-tests/tests/error_path_test.rs b/crates/e2e-tests/tests/error_path_test.rs index 5749338..dda921b 100644 --- a/crates/e2e-tests/tests/error_path_test.rs +++ b/crates/e2e-tests/tests/error_path_test.rs @@ -167,7 +167,13 @@ async fn test_ingest_valid_event_succeeds() { #[tokio::test] async fn test_route_query_empty_query() { let harness = TestHarness::new(); - let handler = RetrievalHandler::with_services(harness.storage.clone(), None, None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + None, + None, + None, + Default::default(), + ); let result = handler .route_query(Request::new(RouteQueryRequest { @@ -194,7 +200,13 @@ async fn test_route_query_empty_query() { #[tokio::test] async fn test_classify_intent_empty_query() { let harness = TestHarness::new(); - let handler = RetrievalHandler::with_services(harness.storage.clone(), None, None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + None, + None, + None, + Default::default(), + ); let result = handler .classify_query_intent(Request::new(ClassifyQueryIntentRequest { diff --git a/crates/e2e-tests/tests/fail_open_test.rs b/crates/e2e-tests/tests/fail_open_test.rs index 7a80c3f..d66b9f3 100644 --- a/crates/e2e-tests/tests/fail_open_test.rs +++ b/crates/e2e-tests/tests/fail_open_test.rs @@ -83,9 +83,7 @@ async fn test_fail_open_embedder_disabled_events_still_stored() { ), ); let resp = service - .ingest_event(Request::new(IngestEventRequest { - event: Some(event), - })) + .ingest_event(Request::new(IngestEventRequest { event: Some(event) })) .await .unwrap(); responses.push(resp.into_inner()); @@ -98,10 +96,7 @@ async fn test_fail_open_embedder_disabled_events_still_stored() { !resp.deduplicated, "Event {i} should NOT be marked deduplicated when embedder is None" ); - assert!( - resp.created, - "Event {i} should be created successfully" - ); + assert!(resp.created, "Event {i} should be created successfully"); } // 6. Assert all 5 events stored in RocksDB @@ -113,11 +108,7 @@ async fn test_fail_open_embedder_disabled_events_still_stored() { // 7. Assert all 5 have outbox entries (proving normal ingest path) let outbox = harness.storage.get_outbox_entries(0, 100).unwrap(); - assert_eq!( - outbox.len(), - 5, - "All 5 events should have outbox entries" - ); + assert_eq!(outbox.len(), 5, "All 5 events should have outbox entries"); } /// TEST-03 (2/3): Embedder errors -- events pass through unchanged. @@ -155,9 +146,7 @@ async fn test_fail_open_embedder_error_events_pass_through() { ), ); let resp = service - .ingest_event(Request::new(IngestEventRequest { - event: Some(event), - })) + .ingest_event(Request::new(IngestEventRequest { event: Some(event) })) .await .unwrap(); responses.push(resp.into_inner()); @@ -183,11 +172,7 @@ async fn test_fail_open_embedder_error_events_pass_through() { ); let outbox = harness.storage.get_outbox_entries(0, 100).unwrap(); - assert_eq!( - outbox.len(), - 3, - "All 3 events should have outbox entries" - ); + assert_eq!(outbox.len(), 3, "All 3 events should have outbox entries"); } /// TEST-03 (3/3): StaleFilter fail-open -- results returned even without timestamp metadata. diff --git a/crates/e2e-tests/tests/hybrid_search_test.rs b/crates/e2e-tests/tests/hybrid_search_test.rs new file mode 100644 index 0000000..50e596a --- /dev/null +++ b/crates/e2e-tests/tests/hybrid_search_test.rs @@ -0,0 +1,188 @@ +//! E2E hybrid search tests for agent-memory. +//! +//! Verifies that HybridSearchHandler returns combined BM25 + vector results +//! via RRF fusion, and gracefully degrades to BM25-only when vector is unavailable. + +use std::sync::Arc; + +use pretty_assertions::assert_eq; +use tonic::Request; + +use e2e_tests::{build_toc_segment, create_test_events, ingest_events, TestHarness}; +use memory_search::{SearchIndex, SearchIndexConfig, SearchIndexer, TeleportSearcher}; +use memory_service::hybrid::HybridSearchHandler; +use memory_service::pb::{HybridMode, HybridSearchRequest}; +use memory_service::VectorTeleportHandler; +use memory_vector::{HnswConfig, HnswIndex, VectorMetadata}; + +/// Minimal VectorTeleportHandler whose index is empty so `is_available()` returns false. +fn empty_vector_handler(harness: &TestHarness) -> Arc { + let embedder = + memory_embeddings::CandleEmbedder::load_default().expect("Failed to load embedding model"); + let hnsw_config = HnswConfig::new(384, &harness.vector_index_path).with_capacity(10); + let hnsw = HnswIndex::open_or_create(hnsw_config).expect("HNSW create"); + let meta_path = harness.vector_index_path.join("metadata"); + let metadata = VectorMetadata::open(&meta_path).expect("metadata"); + Arc::new(VectorTeleportHandler::new( + Arc::new(embedder), + Arc::new(std::sync::RwLock::new(hnsw)), + Arc::new(metadata), + )) +} + +/// Build a BM25 searcher from indexed TOC nodes. +fn build_bm25_searcher( + harness: &TestHarness, + nodes: &[&memory_types::TocNode], +) -> Arc { + let bm25_config = SearchIndexConfig::new(&harness.bm25_index_path); + let bm25_index = SearchIndex::open_or_create(bm25_config).unwrap(); + let indexer = SearchIndexer::new(&bm25_index).unwrap(); + + for node in nodes { + indexer.index_toc_node(node).unwrap(); + for bullet in &node.bullets { + for grip_id in &bullet.grip_ids { + if let Some(grip) = harness.storage.get_grip(grip_id).unwrap() { + indexer.index_grip(&grip).unwrap(); + } + } + } + } + indexer.commit().unwrap(); + + Arc::new(TeleportSearcher::new(&bm25_index).unwrap()) +} + +/// E2E: BM25-only fallback when vector index is empty/unavailable. +#[tokio::test] +#[ignore = "requires model download (~80MB on first run)"] +async fn test_hybrid_bm25_fallback_when_vector_unavailable() { + let harness = TestHarness::new(); + + let events_rust = create_test_events( + "session-rust", + 6, + "Rust ownership and borrow checker ensures memory safety without garbage collection", + ); + let events_python = create_test_events( + "session-python", + 6, + "Python web frameworks like Django and Flask provide rapid development for web apps", + ); + + ingest_events(&harness.storage, &events_rust); + ingest_events(&harness.storage, &events_python); + + let node_rust = build_toc_segment(harness.storage.clone(), events_rust).await; + let node_python = build_toc_segment(harness.storage.clone(), events_python).await; + + let searcher = build_bm25_searcher(&harness, &[&node_rust, &node_python]); + let vector_handler = empty_vector_handler(&harness); + + let handler = HybridSearchHandler::new(vector_handler, Some(searcher)); + + assert!(handler.bm25_available(), "BM25 should be available"); + + let request = Request::new(HybridSearchRequest { + query: "rust ownership borrow".to_string(), + top_k: 10, + mode: HybridMode::Hybrid as i32, + bm25_weight: 0.5, + vector_weight: 0.5, + time_filter: None, + target: 0, + agent_filter: None, + }); + + let response = handler.hybrid_search(request).await.unwrap(); + let inner = response.into_inner(); + + assert_eq!( + inner.mode_used, + HybridMode::Bm25Only as i32, + "Should fall back to BM25-only mode" + ); + assert!(inner.bm25_available, "bm25_available should be true"); + assert!( + !inner.matches.is_empty(), + "BM25 fallback should return results" + ); + + for i in 1..inner.matches.len() { + assert!( + inner.matches[i - 1].score >= inner.matches[i].score, + "Results should be in descending score order" + ); + } +} + +/// E2E: bm25_available reports correctly based on searcher presence. +#[tokio::test] +#[ignore = "requires model download (~80MB on first run)"] +async fn test_hybrid_bm25_available_reports_true() { + let harness = TestHarness::new(); + + let events = create_test_events( + "session-test", + 4, + "Test content for BM25 availability check", + ); + ingest_events(&harness.storage, &events); + let node = build_toc_segment(harness.storage.clone(), events).await; + + let searcher = build_bm25_searcher(&harness, &[&node]); + let vector_handler = empty_vector_handler(&harness); + + let handler_with = HybridSearchHandler::new(vector_handler.clone(), Some(searcher)); + assert!( + handler_with.bm25_available(), + "bm25_available should be true when searcher is present" + ); + + let handler_without = HybridSearchHandler::new(vector_handler, None); + assert!( + !handler_without.bm25_available(), + "bm25_available should be false when searcher is absent" + ); +} + +/// E2E: BM25-only mode returns real BM25 results. +#[tokio::test] +#[ignore = "requires model download (~80MB on first run)"] +async fn test_hybrid_bm25_only_mode() { + let harness = TestHarness::new(); + + let events_rust = create_test_events( + "session-rust", + 6, + "Rust ownership and borrow checker ensures memory safety without garbage collection", + ); + ingest_events(&harness.storage, &events_rust); + let node_rust = build_toc_segment(harness.storage.clone(), events_rust).await; + + let searcher = build_bm25_searcher(&harness, &[&node_rust]); + let vector_handler = empty_vector_handler(&harness); + + let handler = HybridSearchHandler::new(vector_handler, Some(searcher)); + + let request = Request::new(HybridSearchRequest { + query: "rust ownership borrow".to_string(), + top_k: 10, + mode: HybridMode::Bm25Only as i32, + bm25_weight: 0.5, + vector_weight: 0.5, + time_filter: None, + target: 0, + agent_filter: None, + }); + + let response = handler.hybrid_search(request).await.unwrap(); + let inner = response.into_inner(); + + assert!( + !inner.matches.is_empty(), + "BM25-only mode should return results" + ); + assert!(inner.bm25_available, "bm25_available should be true"); +} diff --git a/crates/e2e-tests/tests/pipeline_test.rs b/crates/e2e-tests/tests/pipeline_test.rs index 6039376..1ab3233 100644 --- a/crates/e2e-tests/tests/pipeline_test.rs +++ b/crates/e2e-tests/tests/pipeline_test.rs @@ -89,8 +89,13 @@ async fn test_full_pipeline_ingest_toc_grip_route_query() { let bm25_searcher = Arc::new(TeleportSearcher::new(&bm25_index).unwrap()); // 10. Create RetrievalHandler with BM25 searcher - let handler = - RetrievalHandler::with_services(harness.storage.clone(), Some(bm25_searcher), None, None, Default::default()); + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + Some(bm25_searcher), + None, + None, + Default::default(), + ); // 11. Call route_query let response = handler diff --git a/crates/e2e-tests/tests/ranking_test.rs b/crates/e2e-tests/tests/ranking_test.rs new file mode 100644 index 0000000..186cf4c --- /dev/null +++ b/crates/e2e-tests/tests/ranking_test.rs @@ -0,0 +1,441 @@ +//! End-to-end ranking tests for agent-memory (RANK-09, RANK-10). +//! +//! Verifies that: +//! - High-salience items rank higher than low-salience items of similar similarity +//! - Usage decay penalizes frequently-accessed results +//! - Score floor prevents total suppression +//! - Ranking composes correctly with StaleFilter through route_query + +use std::collections::HashMap; +use std::sync::Arc; + +use pretty_assertions::assert_eq; +use tonic::Request; + +use e2e_tests::{build_toc_segment, create_test_events, ingest_events, TestHarness}; +use memory_retrieval::{ + executor::SearchResult, + ranking::{apply_combined_ranking, RankingConfig}, + stale_filter::StaleFilter, + types::RetrievalLayer, +}; +use memory_search::{SearchIndex, SearchIndexConfig, SearchIndexer, TeleportSearcher}; +use memory_service::pb::RouteQueryRequest; +use memory_service::RetrievalHandler; +use memory_types::config::StalenessConfig; +use memory_types::salience::MemoryKind; + +fn make_result( + doc_id: &str, + score: f32, + salience: f32, + access_count: u32, + memory_kind: &str, +) -> SearchResult { + let mut metadata = HashMap::new(); + metadata.insert("salience_score".to_string(), salience.to_string()); + metadata.insert("access_count".to_string(), access_count.to_string()); + metadata.insert("memory_kind".to_string(), memory_kind.to_string()); + SearchResult { + doc_id: doc_id.to_string(), + doc_type: "toc_node".to_string(), + score, + text_preview: format!("Preview for {doc_id}"), + source_layer: RetrievalLayer::BM25, + metadata, + } +} + +/// RANK-09: Pinned/high-salience items rank higher than low-salience items. +#[test] +fn test_salience_ranking_order() { + let config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: false, + ..Default::default() + }; + + // All items have same base similarity score + let results = vec![ + // Observation, short text -> low salience (~0.35-0.40) + make_result("low_obs", 0.85, 0.38, 0, "observation"), + // Constraint, medium text -> high salience (~0.75+) + make_result("high_constraint", 0.85, 0.78, 0, "constraint"), + // Pinned item -> very high salience (~1.0+) + make_result("pinned_item", 0.85, 1.05, 0, "preference"), + ]; + + let ranked = apply_combined_ranking(results, &config); + + // Pinned item should be first (highest salience factor) + assert_eq!( + ranked[0].doc_id, "pinned_item", + "Pinned item should rank first" + ); + // Constraint should be second + assert_eq!( + ranked[1].doc_id, "high_constraint", + "High-salience constraint should rank second" + ); + // Low observation should be last + assert_eq!( + ranked[2].doc_id, "low_obs", + "Low-salience observation should rank last" + ); +} + +/// RANK-10: Frequently-accessed items decay in ranking. +#[test] +fn test_usage_decay_ranking_order() { + let config = RankingConfig { + salience_enabled: false, + usage_decay_enabled: true, + decay_factor: 0.15, + ..Default::default() + }; + + // All items have same base similarity and salience + let results = vec![ + make_result("fresh", 0.85, 0.5, 0, "observation"), + make_result("used_5", 0.85, 0.5, 5, "observation"), + make_result("used_20", 0.85, 0.5, 20, "observation"), + ]; + + let ranked = apply_combined_ranking(results, &config); + + // Fresh item should rank first (no decay) + assert_eq!(ranked[0].doc_id, "fresh", "Fresh item should rank first"); + // Moderately used should be second + assert_eq!( + ranked[1].doc_id, "used_5", + "Moderately used should rank second" + ); + // Heavily used should be last + assert_eq!(ranked[2].doc_id, "used_20", "Heavily used should rank last"); + + // Verify scores are strictly decreasing + assert!(ranked[0].score > ranked[1].score); + assert!(ranked[1].score > ranked[2].score); +} + +/// Score floor prevents complete suppression. +#[test] +fn test_score_floor_prevents_collapse() { + let config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: true, + decay_factor: 0.15, + score_floor: 0.50, + }; + + // Worst case: low salience + extremely high access count + let results = vec![make_result( + "heavily_used_low_sal", + 0.9, + 0.1, + 200, + "observation", + )]; + + let ranked = apply_combined_ranking(results, &config); + + // Floor = 0.9 * 0.50 = 0.45 + let floor = 0.9 * 0.50; + assert!( + ranked[0].score >= floor - 0.001, + "Score {} should be >= floor {:.3}", + ranked[0].score, + floor + ); +} + +/// Combined formula composes properly: salience + usage + similarity all factor in. +#[test] +fn test_combined_ranking_composition() { + let config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: true, + decay_factor: 0.15, + score_floor: 0.50, + }; + + // High-salience but heavily used vs low-salience but fresh + let results = vec![ + make_result("high_sal_used", 0.85, 1.0, 15, "constraint"), + make_result("low_sal_fresh", 0.85, 0.3, 0, "observation"), + ]; + + let ranked = apply_combined_ranking(results, &config); + + // Both should have reasonable scores (not collapsed) + for r in &ranked { + assert!( + r.score > 0.3, + "Score for {} should be > 0.3, got {}", + r.doc_id, + r.score + ); + } +} + +// ============================================================================ +// E2E tests: Full route_query pipeline with Storage-backed enrichment +// ============================================================================ + +const TOPIC: &str = "Rust ownership borrow checker lifetime annotation patterns"; + +fn make_route_query() -> Request { + Request::new(RouteQueryRequest { + query: "Rust ownership borrow checker lifetime".to_string(), + intent_override: None, + stop_conditions: None, + mode_override: None, + limit: 20, + agent_filter: None, + }) +} + +/// Set up a pipeline with multiple sessions indexed into BM25. +/// Returns (harness, searcher, toc_node_ids). +async fn setup_salience_pipeline() -> (TestHarness, Arc, Vec) { + let harness = TestHarness::new(); + + let bm25_config = SearchIndexConfig::new(&harness.bm25_index_path); + let bm25_index = SearchIndex::open_or_create(bm25_config).unwrap(); + let indexer = SearchIndexer::new(&bm25_index).unwrap(); + + let sessions = ["session-high", "session-mid", "session-low"]; + let mut node_ids = Vec::new(); + + for session_id in &sessions { + let events = create_test_events(session_id, 8, TOPIC); + ingest_events(&harness.storage, &events); + let toc_node = build_toc_segment(harness.storage.clone(), events).await; + + indexer.index_toc_node(&toc_node).unwrap(); + + let grip_ids: Vec = toc_node + .bullets + .iter() + .flat_map(|b| b.grip_ids.iter().cloned()) + .collect(); + for grip_id in &grip_ids { + if let Some(grip) = harness.storage.get_grip(grip_id).unwrap() { + indexer.index_grip(&grip).unwrap(); + } + } + + node_ids.push(toc_node.node_id.clone()); + } + + indexer.commit().unwrap(); + let searcher = Arc::new(TeleportSearcher::new(&bm25_index).unwrap()); + (harness, searcher, node_ids) +} + +/// RANK-09 E2E: Salience enrichment flows through route_query and affects ranking. +/// +/// Mutates TocNode salience scores in Storage, queries via route_query, +/// and verifies that the high-salience node outranks the low-salience one. +#[tokio::test] +async fn test_e2e_salience_enrichment_affects_ranking() { + let (harness, searcher, node_ids) = setup_salience_pipeline().await; + + // Mutate TocNode salience in storage + let salience_values: [(f32, MemoryKind, bool); 3] = [ + (1.0, MemoryKind::Constraint, true), + (0.5, MemoryKind::Observation, false), + (0.1, MemoryKind::Observation, false), + ]; + + for (i, (score, kind, pinned)) in salience_values.iter().enumerate() { + if let Ok(Some(mut node)) = harness.storage.get_toc_node(&node_ids[i]) { + node.salience_score = *score; + node.memory_kind = *kind; + node.is_pinned = *pinned; + harness.storage.put_toc_node(&node).unwrap(); + } + } + + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + Some(searcher), + None, + None, + StalenessConfig::default(), + ); + + let resp = handler + .route_query(make_route_query()) + .await + .unwrap() + .into_inner(); + + assert!(resp.has_results, "Should have search results"); + + // Find scores for our mutated nodes + let score_high = resp + .results + .iter() + .find(|r| r.doc_id == node_ids[0]) + .map(|r| r.score); + let score_low = resp + .results + .iter() + .find(|r| r.doc_id == node_ids[2]) + .map(|r| r.score); + + if let (Some(high), Some(low)) = (score_high, score_low) { + assert!( + high > low, + "High-salience node ({:.4}) should outrank low-salience node ({:.4})", + high, + low + ); + } +} + +/// RANK-10 E2E: Access count enrichment flows through route_query. +/// +/// Verifies that access_count metadata is enriched from Storage and +/// all results have valid positive scores through the pipeline. +/// Note: usage_decay is off by default in RankingConfig, so this test +/// validates the enrichment path rather than decay ordering (which is +/// covered by the unit-level test_usage_decay_ranking_order above). +#[tokio::test] +async fn test_e2e_access_count_enrichment() { + let (harness, searcher, node_ids) = setup_salience_pipeline().await; + + // Set different access counts; keep salience neutral + let access_counts: [u32; 3] = [0, 10, 50]; + for (i, &count) in access_counts.iter().enumerate() { + if let Ok(Some(mut node)) = harness.storage.get_toc_node(&node_ids[i]) { + node.salience_score = 0.5; + node.access_count = count; + harness.storage.put_toc_node(&node).unwrap(); + } + } + + let handler = RetrievalHandler::with_services( + harness.storage.clone(), + Some(searcher), + None, + None, + StalenessConfig::default(), + ); + + let resp = handler + .route_query(make_route_query()) + .await + .unwrap() + .into_inner(); + + assert!(resp.has_results, "Should have search results"); + + // All returned results should have positive scores + for result in &resp.results { + assert!( + result.score > 0.0, + "Result {} should have positive score, got {}", + result.doc_id, + result.score + ); + } + + // Verify the pipeline returns results for our nodes (enrichment didn't break anything) + let found_count = resp + .results + .iter() + .filter(|r| node_ids.contains(&r.doc_id)) + .count(); + assert!( + found_count > 0, + "Should find at least one of our TocNodes in results" + ); +} + +/// Composition: ranking composes with StaleFilter — old high-salience constraint +/// is exempt from staleness and still ranks well. +#[test] +fn test_ranking_composes_with_stale_filter() { + let now_ms = 1_706_540_400_000i64; + let day_ms = 86_400_000i64; + + let mut meta_old = HashMap::new(); + meta_old.insert( + "timestamp_ms".to_string(), + (now_ms - 30 * day_ms).to_string(), + ); + meta_old.insert("memory_kind".to_string(), "constraint".to_string()); + meta_old.insert("salience_score".to_string(), "1.0".to_string()); + meta_old.insert("access_count".to_string(), "0".to_string()); + + let mut meta_new = HashMap::new(); + meta_new.insert("timestamp_ms".to_string(), now_ms.to_string()); + meta_new.insert("memory_kind".to_string(), "observation".to_string()); + meta_new.insert("salience_score".to_string(), "0.2".to_string()); + meta_new.insert("access_count".to_string(), "0".to_string()); + + let results = vec![ + SearchResult { + doc_id: "old-constraint".to_string(), + doc_type: "toc_node".to_string(), + score: 0.85, + text_preview: "Old but important constraint".to_string(), + source_layer: RetrievalLayer::BM25, + metadata: meta_old, + }, + SearchResult { + doc_id: "new-observation".to_string(), + doc_type: "toc_node".to_string(), + score: 0.85, + text_preview: "Recent low-salience observation".to_string(), + source_layer: RetrievalLayer::BM25, + metadata: meta_new, + }, + ]; + + // Apply stale filter first (like route_query does) + let stale_filter = StaleFilter::new(StalenessConfig { + enabled: true, + half_life_days: 14.0, + max_penalty: 0.30, + ..Default::default() + }); + let after_stale = stale_filter.apply(results); + + // Constraint should be exempt from staleness decay + let constraint = after_stale + .iter() + .find(|r| r.doc_id == "old-constraint") + .unwrap(); + assert!( + (constraint.score - 0.85).abs() < f32::EPSILON, + "Constraint should be exempt from stale decay, got {:.4}", + constraint.score + ); + + // Apply combined ranking + let ranking_config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: false, + ..Default::default() + }; + let ranked = apply_combined_ranking(after_stale, &ranking_config); + + let constraint_final = ranked + .iter() + .find(|r| r.doc_id == "old-constraint") + .unwrap(); + let observation_final = ranked + .iter() + .find(|r| r.doc_id == "new-observation") + .unwrap(); + + assert!( + constraint_final.score > observation_final.score, + "High-salience constraint ({:.4}) should outrank low-salience observation ({:.4})", + constraint_final.score, + observation_final.score + ); +} diff --git a/crates/e2e-tests/tests/stale_filter_test.rs b/crates/e2e-tests/tests/stale_filter_test.rs index 99ecd88..ebb1e81 100644 --- a/crates/e2e-tests/tests/stale_filter_test.rs +++ b/crates/e2e-tests/tests/stale_filter_test.rs @@ -154,7 +154,10 @@ async fn test_stale_results_downranked_relative_to_newer() { .into_inner(); assert!(resp_on.has_results, "RouteQuery should have results"); - assert!(resp_off.has_results, "Baseline RouteQuery should have results"); + assert!( + resp_off.has_results, + "Baseline RouteQuery should have results" + ); // Build score maps: doc_id -> score for each run let scores_off: HashMap = resp_off @@ -268,10 +271,7 @@ async fn test_kind_exemption_constraint_not_penalized() { source_layer: RetrievalLayer::BM25, metadata: { let mut m = HashMap::new(); - m.insert( - "timestamp_ms".to_string(), - (now - 42 * DAY_MS).to_string(), - ); + m.insert("timestamp_ms".to_string(), (now - 42 * DAY_MS).to_string()); m.insert("memory_kind".to_string(), "constraint".to_string()); m }, @@ -284,10 +284,7 @@ async fn test_kind_exemption_constraint_not_penalized() { source_layer: RetrievalLayer::BM25, metadata: { let mut m = HashMap::new(); - m.insert( - "timestamp_ms".to_string(), - (now - 42 * DAY_MS).to_string(), - ); + m.insert("timestamp_ms".to_string(), (now - 42 * DAY_MS).to_string()); m.insert("memory_kind".to_string(), "observation".to_string()); m }, @@ -359,10 +356,7 @@ async fn test_kind_exemption_constraint_not_penalized() { source_layer: RetrievalLayer::BM25, metadata: { let mut m = HashMap::new(); - m.insert( - "timestamp_ms".to_string(), - (now - 42 * DAY_MS).to_string(), - ); + m.insert("timestamp_ms".to_string(), (now - 42 * DAY_MS).to_string()); m.insert("memory_kind".to_string(), kind.to_string()); m }, diff --git a/crates/memory-client/src/client.rs b/crates/memory-client/src/client.rs index ec4edc1..be8ef3b 100644 --- a/crates/memory-client/src/client.rs +++ b/crates/memory-client/src/client.rs @@ -7,12 +7,13 @@ use tracing::{debug, info}; use memory_service::pb::{ memory_service_client::MemoryServiceClient, BrowseTocRequest, Event as ProtoEvent, - EventRole as ProtoEventRole, EventType as ProtoEventType, ExpandGripRequest, GetEventsRequest, - GetNodeRequest, GetRelatedTopicsRequest, GetTocRootRequest, GetTopTopicsRequest, - GetTopicGraphStatusRequest, GetTopicsByQueryRequest, GetVectorIndexStatusRequest, - Grip as ProtoGrip, HybridSearchRequest, HybridSearchResponse, IngestEventRequest, - TeleportSearchRequest, TeleportSearchResponse, TocNode as ProtoTocNode, Topic as ProtoTopic, - VectorIndexStatus, VectorTeleportRequest, VectorTeleportResponse, + EventRole as ProtoEventRole, EventType as ProtoEventType, ExpandGripRequest, + GetDedupStatusRequest, GetDedupStatusResponse, GetEventsRequest, GetNodeRequest, + GetRankingStatusRequest, GetRankingStatusResponse, GetRelatedTopicsRequest, GetTocRootRequest, + GetTopTopicsRequest, GetTopicGraphStatusRequest, GetTopicsByQueryRequest, + GetVectorIndexStatusRequest, Grip as ProtoGrip, HybridSearchRequest, HybridSearchResponse, + IngestEventRequest, TeleportSearchRequest, TeleportSearchResponse, TocNode as ProtoTocNode, + Topic as ProtoTopic, VectorIndexStatus, VectorTeleportRequest, VectorTeleportResponse, }; use memory_types::{Event, EventRole, EventType}; @@ -292,6 +293,24 @@ impl MemoryClient { Ok(response.into_inner()) } + // ===== Observability Methods (Phase 42) ===== + + /// Get dedup gate status and metrics. + pub async fn get_dedup_status(&mut self) -> Result { + debug!("GetDedupStatus request"); + let request = tonic::Request::new(GetDedupStatusRequest {}); + let response = self.inner.get_dedup_status(request).await?; + Ok(response.into_inner()) + } + + /// Get ranking status and metrics (salience, usage, novelty, lifecycle). + pub async fn get_ranking_status(&mut self) -> Result { + debug!("GetRankingStatus request"); + let request = tonic::Request::new(GetRankingStatusRequest {}); + let response = self.inner.get_ranking_status(request).await?; + Ok(response.into_inner()) + } + // ===== Topic Graph Methods (Phase 14) ===== /// Get topic graph status and statistics. diff --git a/crates/memory-daemon/src/cli.rs b/crates/memory-daemon/src/cli.rs index 43a1f23..210a544 100644 --- a/crates/memory-daemon/src/cli.rs +++ b/crates/memory-daemon/src/cli.rs @@ -46,7 +46,15 @@ pub enum Commands { Stop, /// Show daemon status - Status, + Status { + /// Show detailed metrics (dedup, ranking, vector, lifecycle) + #[arg(short, long)] + verbose: bool, + + /// gRPC endpoint for verbose mode (default: `http://127.0.0.1:50051`) + #[arg(short, long, default_value = "http://127.0.0.1:50051")] + endpoint: String, + }, /// Query the memory system Query { @@ -276,6 +284,32 @@ pub enum AdminCommands { #[arg(long)] vector_path: Option, }, + + /// Prune old vectors from the HNSW index by age + PruneVectors { + /// Remove vectors older than this many days + #[arg(long, default_value = "30")] + age_days: u32, + + /// Path to vector index directory (default from config) + #[arg(long)] + vector_path: Option, + + /// Dry run - show what would be pruned + #[arg(long)] + dry_run: bool, + }, + + /// Rebuild BM25 index with level filtering + RebuildBm25 { + /// Minimum TOC level to keep: segment, day, week, month, year + #[arg(long, default_value = "day")] + min_level: String, + + /// Path to search index directory (default from config) + #[arg(long)] + search_path: Option, + }, } /// Scheduler subcommands @@ -620,7 +654,16 @@ mod tests { #[test] fn test_cli_status() { let cli = Cli::parse_from(["memory-daemon", "status"]); - assert!(matches!(cli.command, Commands::Status)); + assert!(matches!(cli.command, Commands::Status { .. })); + } + + #[test] + fn test_cli_status_verbose() { + let cli = Cli::parse_from(["memory-daemon", "status", "--verbose"]); + match cli.command { + Commands::Status { verbose, .. } => assert!(verbose), + _ => panic!("Expected Status command"), + } } #[test] diff --git a/crates/memory-daemon/src/commands.rs b/crates/memory-daemon/src/commands.rs index f151a12..87295d1 100644 --- a/crates/memory-daemon/src/commands.rs +++ b/crates/memory-daemon/src/commands.rs @@ -168,8 +168,9 @@ async fn register_indexing_job( async fn register_prune_jobs(scheduler: &SchedulerService, db_path: &Path) -> Result<()> { use memory_embeddings::EmbeddingModel; use memory_scheduler::{ - register_bm25_prune_job, register_vector_prune_job, Bm25PruneJob, Bm25PruneJobConfig, - VectorPruneJob, VectorPruneJobConfig, + register_bm25_prune_job, register_bm25_rebuild_job, register_vector_prune_job, + Bm25PruneJob, Bm25PruneJobConfig, Bm25RebuildJob, Bm25RebuildJobConfig, VectorPruneJob, + VectorPruneJobConfig, }; use memory_search::{SearchIndex, SearchIndexConfig, SearchIndexer}; use memory_vector::{ @@ -180,7 +181,7 @@ async fn register_prune_jobs(scheduler: &SchedulerService, db_path: &Path) -> Re let search_dir = db_path.join("search"); let vector_dir = db_path.join("vector"); - // Register BM25 prune job if search index exists + // Register BM25 prune and rebuild jobs if search index exists if search_dir.exists() { let search_config = SearchIndexConfig::new(&search_dir); match SearchIndex::open_or_create(search_config) { @@ -190,10 +191,11 @@ async fn register_prune_jobs(scheduler: &SchedulerService, db_path: &Path) -> Re let indexer = Arc::new(indexer); // Create prune job with callback + let indexer_for_prune = Arc::clone(&indexer); let bm25_job = Bm25PruneJob::with_prune_fn( Bm25PruneJobConfig::default(), move |age_days, level, dry_run| { - let idx = Arc::clone(&indexer); + let idx = Arc::clone(&indexer_for_prune); async move { idx.prune_and_commit(age_days, level.as_deref(), dry_run) .map_err(|e| e.to_string()) @@ -206,6 +208,25 @@ async fn register_prune_jobs(scheduler: &SchedulerService, db_path: &Path) -> Re .context("Failed to register BM25 prune job")?; info!("BM25 prune job registered"); + + // Register BM25 rebuild job (for lifecycle level-filtering) + let indexer_for_rebuild = Arc::clone(&indexer); + let rebuild_job = Bm25RebuildJob::with_rebuild_fn( + Bm25RebuildJobConfig::default(), + move |min_level| { + let idx = Arc::clone(&indexer_for_rebuild); + async move { + idx.rebuild_with_filter(&min_level) + .map_err(|e| e.to_string()) + } + }, + ); + + register_bm25_rebuild_job(scheduler, rebuild_job) + .await + .context("Failed to register BM25 rebuild job")?; + + info!("BM25 rebuild job registered"); } Err(e) => { warn!(error = %e, "Failed to create search indexer for BM25 prune job"); @@ -497,7 +518,9 @@ pub async fn start_daemon( info!( " Staleness filter: enabled={}, half_life={}d, max_penalty={}", - settings.staleness.enabled, settings.staleness.half_life_days, settings.staleness.max_penalty + settings.staleness.enabled, + settings.staleness.half_life_days, + settings.staleness.max_penalty ); // Start server with scheduler @@ -570,6 +593,83 @@ pub fn show_status() -> Result<()> { } } +/// Show verbose status by querying the running daemon for detailed metrics. +/// +/// Calls GetDedupStatus, GetRankingStatus, and GetVectorIndexStatus RPCs +/// to display dedup, ranking, vector, and lifecycle health information. +pub async fn show_verbose_status(endpoint: &str) -> Result<()> { + let mut client = MemoryClient::connect(endpoint) + .await + .context("Failed to connect to daemon for verbose status")?; + + println!(); + println!("Detailed Status"); + println!("================"); + + // Dedup status + match client.get_dedup_status().await { + Ok(dedup) => { + let hit_rate = if dedup.events_checked > 0 { + (dedup.events_deduplicated as f64 / dedup.events_checked as f64) * 100.0 + } else { + 0.0 + }; + println!( + "Dedup: enabled={}, buffer_size={}/{}, hit_rate={:.1}%, events_skipped={}", + dedup.enabled, + dedup.buffer_size, + dedup.buffer_capacity, + hit_rate, + dedup.events_skipped, + ); + } + Err(e) => println!("Dedup: error - {}", e), + } + + // Ranking status + match client.get_ranking_status().await { + Ok(ranking) => { + println!( + "Ranking: avg_salience={:.2}, high_salience_nodes={}, avg_usage_decay={:.2}", + ranking.avg_salience_score, ranking.high_salience_count, ranking.avg_usage_decay, + ); + println!( + "Novelty: enabled={}, checked={}, rejected={}", + ranking.novelty_enabled, + ranking.novelty_checked_total, + ranking.novelty_rejected_total, + ); + println!( + "Lifecycle: vector={}, bm25={}", + if ranking.vector_lifecycle_enabled { + "enabled" + } else { + "disabled" + }, + if ranking.bm25_lifecycle_enabled { + "enabled" + } else { + "disabled" + }, + ); + } + Err(e) => println!("Ranking: error - {}", e), + } + + // Vector index status + match client.get_vector_index_status().await { + Ok(vector) => { + println!( + "Vector: vectors={}, available={}", + vector.vector_count, vector.available, + ); + } + Err(e) => println!("Vector: error - {}", e), + } + + Ok(()) +} + /// Handle query commands. pub async fn handle_query(endpoint: &str, command: QueryCommands) -> Result<()> { let mut client = MemoryClient::connect(endpoint) @@ -1052,11 +1152,189 @@ pub fn handle_admin(db_path: Option, command: AdminCommands) -> Result<( } => { handle_clear_index(&index, force, search_path, vector_path, &expanded_path)?; } + + AdminCommands::PruneVectors { + age_days, + vector_path, + dry_run, + } => { + handle_prune_vectors(&expanded_path, age_days, vector_path, dry_run)?; + } + + AdminCommands::RebuildBm25 { + min_level, + search_path, + } => { + handle_rebuild_bm25(&expanded_path, &min_level, search_path)?; + } } Ok(()) } +/// Handle the prune-vectors command. +/// +/// Prunes old vectors from the HNSW index based on age. +fn handle_prune_vectors( + db_path: &str, + age_days: u32, + vector_path: Option, + dry_run: bool, +) -> Result<()> { + use memory_embeddings::EmbeddingModel; + use memory_vector::{ + HnswConfig, HnswIndex, PipelineConfig as VectorPipelineConfig, VectorIndexPipeline, + VectorMetadata, + }; + + let vector_dir = vector_path + .map(PathBuf::from) + .unwrap_or_else(|| PathBuf::from(db_path).join("vector")); + + if !vector_dir.exists() { + anyhow::bail!("Vector index directory not found at {:?}", vector_dir); + } + + println!("Vector Index Pruning"); + println!("===================="); + println!("Vector path: {:?}", vector_dir); + println!("Age threshold: {} days", age_days); + println!("Dry run: {}", dry_run); + println!(); + + // Load embedder + let embedder = memory_embeddings::CandleEmbedder::load_default() + .context("Failed to load embedding model")?; + let embedder = Arc::new(embedder); + let hnsw_config = HnswConfig::new(embedder.info().dimension, &vector_dir); + + let hnsw_index = HnswIndex::open_or_create(hnsw_config).context("Failed to open HNSW index")?; + let hnsw_index = Arc::new(std::sync::RwLock::new(hnsw_index)); + + let metadata_path = vector_dir.join("metadata"); + if !metadata_path.exists() { + anyhow::bail!("Vector metadata directory not found at {:?}", metadata_path); + } + + let metadata = + VectorMetadata::open(&metadata_path).context("Failed to open vector metadata")?; + let metadata = Arc::new(metadata); + + let pipeline = VectorIndexPipeline::new( + embedder, + hnsw_index, + metadata, + VectorPipelineConfig::default(), + ); + + // Prune each non-protected level + let levels = ["segment", "grip", "day", "week"]; + let mut total_pruned = 0usize; + + for level in &levels { + if dry_run { + println!( + " [DRY RUN] Would prune '{}' vectors older than {} days", + level, age_days + ); + } else { + match pipeline.prune_level(age_days as u64, Some(level)) { + Ok(count) => { + println!( + " Pruned {} '{}' vectors older than {} days", + count, level, age_days + ); + total_pruned += count; + } + Err(e) => { + warn!(level, error = %e, "Failed to prune level"); + println!(" ERROR pruning '{}': {}", level, e); + } + } + } + } + + println!(); + if dry_run { + println!("Dry run complete. No vectors were removed."); + } else { + println!("Pruning complete. Total vectors removed: {}", total_pruned); + } + + Ok(()) +} + +/// Handle the rebuild-bm25 command. +/// +/// Rebuilds the BM25 index keeping only documents at or above the specified level. +fn handle_rebuild_bm25(db_path: &str, min_level: &str, search_path: Option) -> Result<()> { + use memory_search::{SearchIndex, SearchIndexConfig, SearchIndexer}; + + let search_dir = search_path + .map(PathBuf::from) + .unwrap_or_else(|| PathBuf::from(db_path).join("search")); + + if !search_dir.exists() { + anyhow::bail!("Search index directory not found at {:?}", search_dir); + } + + // Validate min_level + let valid_levels = ["segment", "grip", "day", "week", "month", "year"]; + if !valid_levels.contains(&min_level) { + anyhow::bail!( + "Invalid min_level '{}'. Must be one of: {}", + min_level, + valid_levels.join(", ") + ); + } + + println!("BM25 Index Rebuild"); + println!("=================="); + println!("Search path: {:?}", search_dir); + println!("Min level: {} (excluding docs below this level)", min_level); + println!(); + + let search_config = SearchIndexConfig::new(&search_dir); + let search_index = + SearchIndex::open_or_create(search_config).context("Failed to open search index")?; + let indexer = SearchIndexer::new(&search_index).context("Failed to create search indexer")?; + + // Prune documents below min_level by filtering each level below the threshold + let level_order = ["segment", "grip", "day", "week", "month", "year"]; + let min_idx = level_order + .iter() + .position(|l| *l == min_level) + .unwrap_or(0); + + let mut total_pruned: u32 = 0; + for level in &level_order[..min_idx] { + // Prune all docs at this level (age_days=0 would prune everything, + // but we use a very large age to catch all docs at this level regardless of age) + match indexer.prune(0, Some(level), false) { + Ok(stats) => { + let count = stats.total(); + println!(" Removed {} '{}' documents", count, level); + total_pruned += count; + } + Err(e) => { + println!(" ERROR removing '{}' documents: {}", level, e); + } + } + } + + if total_pruned > 0 { + indexer.commit().context("Failed to commit BM25 changes")?; + } + + println!(); + println!( + "Rebuild complete. Removed {} documents below '{}' level.", + total_pruned, min_level + ); + + Ok(()) +} + /// Handle the rebuild-indexes command. fn handle_rebuild_indexes( storage: Arc, diff --git a/crates/memory-daemon/src/lib.rs b/crates/memory-daemon/src/lib.rs index 4c3c467..7a681e1 100644 --- a/crates/memory-daemon/src/lib.rs +++ b/crates/memory-daemon/src/lib.rs @@ -18,5 +18,5 @@ pub use cli::{ pub use commands::{ handle_admin, handle_agents_command, handle_clod_command, handle_query, handle_retrieval_command, handle_scheduler, handle_teleport_command, handle_topics_command, - show_status, start_daemon, stop_daemon, + show_status, show_verbose_status, start_daemon, stop_daemon, }; diff --git a/crates/memory-daemon/src/main.rs b/crates/memory-daemon/src/main.rs index 30a70a7..fce261e 100644 --- a/crates/memory-daemon/src/main.rs +++ b/crates/memory-daemon/src/main.rs @@ -24,7 +24,7 @@ use clap::Parser; use memory_daemon::{ handle_admin, handle_agents_command, handle_clod_command, handle_query, handle_retrieval_command, handle_scheduler, handle_teleport_command, handle_topics_command, - show_status, start_daemon, stop_daemon, Cli, Commands, + show_status, show_verbose_status, start_daemon, stop_daemon, Cli, Commands, }; #[tokio::main] @@ -49,8 +49,11 @@ async fn main() -> Result<()> { Commands::Stop => { stop_daemon()?; } - Commands::Status => { + Commands::Status { verbose, endpoint } => { show_status()?; + if verbose { + show_verbose_status(&endpoint).await?; + } } Commands::Query { endpoint, command } => { handle_query(&endpoint, command).await?; diff --git a/crates/memory-retrieval/src/lib.rs b/crates/memory-retrieval/src/lib.rs index 4c7c662..0f3da59 100644 --- a/crates/memory-retrieval/src/lib.rs +++ b/crates/memory-retrieval/src/lib.rs @@ -66,6 +66,7 @@ pub mod classifier; pub mod contracts; pub mod executor; +pub mod ranking; pub mod stale_filter; pub mod tier; pub mod types; @@ -80,6 +81,7 @@ pub use executor::{ ExecutionResult, FallbackChain, LayerExecutor, LayerResults, MockLayerExecutor, RetrievalExecutor, SearchResult, }; +pub use ranking::{apply_combined_ranking, RankingConfig}; pub use stale_filter::StaleFilter; pub use tier::{LayerStatusProvider, MockLayerStatusProvider, TierDetectionResult, TierDetector}; pub use types::{ diff --git a/crates/memory-retrieval/src/ranking.rs b/crates/memory-retrieval/src/ranking.rs new file mode 100644 index 0000000..75010fc --- /dev/null +++ b/crates/memory-retrieval/src/ranking.rs @@ -0,0 +1,231 @@ +//! Combined ranking formula for retrieval results. +//! +//! Applies salience boosting and usage decay to search results. +//! +//! ## Formula +//! +//! ```text +//! salience_factor = 0.55 + 0.45 * salience_score +//! usage_penalty = 1.0 / (1.0 + decay_factor * access_count) +//! combined_score = similarity * salience_factor * usage_penalty +//! final_score = max(combined_score, similarity * 0.50) // 50% floor +//! ``` + +use crate::executor::SearchResult; + +/// Configuration for combined ranking. +#[derive(Debug, Clone)] +pub struct RankingConfig { + /// Whether salience boosting is enabled. + pub salience_enabled: bool, + /// Whether usage decay is enabled. + pub usage_decay_enabled: bool, + /// Decay factor for usage penalty (higher = more aggressive). + pub decay_factor: f32, + /// Minimum score floor as fraction of original similarity (0.0-1.0). + pub score_floor: f32, +} + +impl Default for RankingConfig { + fn default() -> Self { + Self { + salience_enabled: true, + usage_decay_enabled: false, // Off by default until validated + decay_factor: 0.15, + score_floor: 0.50, + } + } +} + +/// Applies combined ranking formula to search results. +/// +/// Reads `salience_score` and `access_count` from result metadata. +/// Re-sorts results by adjusted score after applying the formula. +pub fn apply_combined_ranking( + mut results: Vec, + config: &RankingConfig, +) -> Vec { + if results.is_empty() { + return results; + } + + for result in &mut results { + let original_score = result.score; + + // Salience factor: 0.55 + 0.45 * salience_score + let salience_factor = if config.salience_enabled { + let salience_score: f32 = result + .metadata + .get("salience_score") + .and_then(|v| v.parse().ok()) + .unwrap_or(0.5); // Default neutral + 0.55 + 0.45 * salience_score + } else { + 1.0 + }; + + // Usage penalty: 1 / (1 + decay_factor * access_count) + let usage_penalty = if config.usage_decay_enabled { + let access_count: u32 = result + .metadata + .get("access_count") + .and_then(|v| v.parse().ok()) + .unwrap_or(0); + 1.0 / (1.0 + config.decay_factor * access_count as f32) + } else { + 1.0 + }; + + // Combined score with floor + let combined = original_score * salience_factor * usage_penalty; + let floor = original_score * config.score_floor; + result.score = combined.max(floor); + } + + // Re-sort by adjusted score + results.sort_by(|a, b| { + b.score + .partial_cmp(&a.score) + .unwrap_or(std::cmp::Ordering::Equal) + }); + + results +} + +#[cfg(test)] +mod tests { + use super::*; + use std::collections::HashMap; + + use crate::types::RetrievalLayer; + + fn make_result(doc_id: &str, score: f32, salience: f32, access_count: u32) -> SearchResult { + let mut metadata = HashMap::new(); + metadata.insert("salience_score".to_string(), salience.to_string()); + metadata.insert("access_count".to_string(), access_count.to_string()); + SearchResult { + doc_id: doc_id.to_string(), + doc_type: "toc_node".to_string(), + score, + text_preview: format!("Preview for {doc_id}"), + source_layer: RetrievalLayer::BM25, + metadata, + } + } + + #[test] + fn test_empty_results() { + let config = RankingConfig::default(); + let results = apply_combined_ranking(vec![], &config); + assert!(results.is_empty()); + } + + #[test] + fn test_salience_boost() { + let config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: false, + ..Default::default() + }; + + let results = vec![ + make_result("high_sal", 0.8, 1.0, 0), // salience_factor = 0.55 + 0.45 = 1.0 + make_result("low_sal", 0.8, 0.0, 0), // salience_factor = 0.55 + make_result("mid_sal", 0.8, 0.5, 0), // salience_factor = 0.55 + 0.225 = 0.775 + ]; + + let ranked = apply_combined_ranking(results, &config); + + assert_eq!(ranked[0].doc_id, "high_sal"); + assert_eq!(ranked[1].doc_id, "mid_sal"); + assert_eq!(ranked[2].doc_id, "low_sal"); + } + + #[test] + fn test_usage_decay() { + let config = RankingConfig { + salience_enabled: false, + usage_decay_enabled: true, + decay_factor: 0.15, + ..Default::default() + }; + + let results = vec![ + make_result("fresh", 0.8, 0.5, 0), // penalty = 1.0 + make_result("used_1", 0.8, 0.5, 5), // penalty = 1/(1+0.75) = 0.571 + make_result("used_10", 0.8, 0.5, 10), // penalty = 1/(1+1.5) = 0.4 + ]; + + let ranked = apply_combined_ranking(results, &config); + + assert_eq!(ranked[0].doc_id, "fresh"); + assert_eq!(ranked[1].doc_id, "used_1"); + assert_eq!(ranked[2].doc_id, "used_10"); + } + + #[test] + fn test_score_floor_prevents_collapse() { + let config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: true, + decay_factor: 0.15, + score_floor: 0.50, + }; + + // Very low salience + high usage: combined would be very low + // but floor prevents collapse + let results = vec![make_result("heavily_used", 0.9, 0.0, 100)]; + + let ranked = apply_combined_ranking(results, &config); + + // Floor = 0.9 * 0.50 = 0.45 + // Combined = 0.9 * 0.55 * (1/16) = 0.031 -> floored to 0.45 + assert!( + ranked[0].score >= 0.44, + "Score should be at or above floor, got {}", + ranked[0].score + ); + } + + #[test] + fn test_combined_formula() { + let config = RankingConfig { + salience_enabled: true, + usage_decay_enabled: true, + decay_factor: 0.15, + score_floor: 0.50, + }; + + let results = vec![make_result("test", 0.8, 0.7, 3)]; + // salience_factor = 0.55 + 0.45 * 0.7 = 0.55 + 0.315 = 0.865 + // usage_penalty = 1 / (1 + 0.15 * 3) = 1 / 1.45 = 0.6897 + // combined = 0.8 * 0.865 * 0.6897 = 0.477 + // floor = 0.8 * 0.50 = 0.4 + // final = max(0.477, 0.4) = 0.477 + + let ranked = apply_combined_ranking(results, &config); + assert!( + (ranked[0].score - 0.477).abs() < 0.01, + "Expected ~0.477, got {}", + ranked[0].score + ); + } + + #[test] + fn test_disabled_passthrough() { + let config = RankingConfig { + salience_enabled: false, + usage_decay_enabled: false, + ..Default::default() + }; + + let results = vec![make_result("test", 0.8, 1.0, 100)]; + let ranked = apply_combined_ranking(results, &config); + + // Both disabled, score should be unchanged + assert!( + (ranked[0].score - 0.8).abs() < f32::EPSILON, + "Score should be unchanged when both disabled" + ); + } +} diff --git a/crates/memory-retrieval/src/stale_filter.rs b/crates/memory-retrieval/src/stale_filter.rs index 8ff9cfd..91d659d 100644 --- a/crates/memory-retrieval/src/stale_filter.rs +++ b/crates/memory-retrieval/src/stale_filter.rs @@ -92,11 +92,7 @@ impl StaleFilter { } /// Apply time-decay to each result based on age relative to newest_ts. - fn apply_time_decay( - &self, - results: Vec, - newest_ts: i64, - ) -> Vec { + fn apply_time_decay(&self, results: Vec, newest_ts: i64) -> Vec { let half_life = self.config.half_life_days as f64; let max_penalty = self.config.max_penalty as f64; @@ -130,8 +126,7 @@ impl StaleFilter { // Apply decay formula: // score * (1.0 - max_penalty * (1.0 - exp(-age_days / half_life))) - let decay_factor = - 1.0 - max_penalty * (1.0 - (-age_days / half_life).exp()); + let decay_factor = 1.0 - max_penalty * (1.0 - (-age_days / half_life).exp()); r.score = (r.score as f64 * decay_factor) as f32; r @@ -414,12 +409,20 @@ mod tests { // "old_high" starts higher but is much older, should drop below "new_low" let results = vec![ - make_result("old_high", 0.95, Some(now - 60 * DAY_MS), Some("observation")), + make_result( + "old_high", + 0.95, + Some(now - 60 * DAY_MS), + Some("observation"), + ), make_result("new_low", 0.70, Some(now), Some("observation")), ]; let output = filter.apply(results); // After decay, new_low (0.70) should be above old_high (~0.95 * 0.717 ~ 0.681) - assert_eq!(output[0].doc_id, "new_low", "Newer result should be ranked first"); + assert_eq!( + output[0].doc_id, "new_low", + "Newer result should be ranked first" + ); } #[test] @@ -429,8 +432,18 @@ mod tests { let results = vec![ make_result("recent_obs", 0.85, Some(now), Some("observation")), - make_result("old_obs", 0.90, Some(now - 28 * DAY_MS), Some("observation")), - make_result("old_constraint", 0.80, Some(now - 28 * DAY_MS), Some("constraint")), + make_result( + "old_obs", + 0.90, + Some(now - 28 * DAY_MS), + Some("observation"), + ), + make_result( + "old_constraint", + 0.80, + Some(now - 28 * DAY_MS), + Some("constraint"), + ), make_result("no_ts", 0.75, None, Some("observation")), ]; let output = filter.apply(results); @@ -444,7 +457,10 @@ mod tests { assert!(old.score < 0.90); // old_constraint: exempt - let constraint = output.iter().find(|r| r.doc_id == "old_constraint").unwrap(); + let constraint = output + .iter() + .find(|r| r.doc_id == "old_constraint") + .unwrap(); assert!((constraint.score - 0.80).abs() < f32::EPSILON); // no_ts: no penalty (fail-open) @@ -460,7 +476,12 @@ mod tests { // Very old result (365 days) should approach but not exceed 30% let results = vec![ make_result("new", 1.0, Some(now), Some("observation")), - make_result("ancient", 1.0, Some(now - 365 * DAY_MS), Some("observation")), + make_result( + "ancient", + 1.0, + Some(now - 365 * DAY_MS), + Some("observation"), + ), ]; let output = filter.apply(results); let ancient = output.iter().find(|r| r.doc_id == "ancient").unwrap(); @@ -598,7 +619,12 @@ mod tests { let results = vec![ make_result("newer_obs", 0.9, Some(now), Some("observation")), - make_result("older_constraint", 0.85, Some(now - DAY_MS), Some("constraint")), + make_result( + "older_constraint", + 0.85, + Some(now - DAY_MS), + Some("constraint"), + ), ]; let (emb_a, emb_b) = similar_pair(16); diff --git a/crates/memory-scheduler/src/jobs/bm25_rebuild.rs b/crates/memory-scheduler/src/jobs/bm25_rebuild.rs new file mode 100644 index 0000000..d78d0a4 --- /dev/null +++ b/crates/memory-scheduler/src/jobs/bm25_rebuild.rs @@ -0,0 +1,288 @@ +//! BM25 rebuild scheduler job for lifecycle automation. +//! +//! Rebuilds the BM25 index with level filtering, removing fine-grained +//! segment/grip docs after rollup has created day+ level summaries. +//! DISABLED by default - opt-in via `[lifecycle.bm25]` config section. + +use std::future::Future; +use std::pin::Pin; +use std::sync::Arc; + +use tokio_util::sync::CancellationToken; +use tracing; + +/// Rebuild function type for BM25 rebuild. +/// Takes min_level filter and returns count of documents removed. +pub type Bm25RebuildFn = + Arc Pin> + Send>> + Send + Sync>; + +/// Configuration for BM25 rebuild job. +#[derive(Clone)] +pub struct Bm25RebuildJobConfig { + /// Cron schedule (default: "0 4 * * 0" - weekly Sunday 4 AM). + pub cron_schedule: String, + /// Minimum level to keep (default: "day"). + pub min_level: String, + /// Whether the job is enabled (default: false). + pub enabled: bool, + /// Optional rebuild callback. + pub rebuild_fn: Option, +} + +impl std::fmt::Debug for Bm25RebuildJobConfig { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + f.debug_struct("Bm25RebuildJobConfig") + .field("cron_schedule", &self.cron_schedule) + .field("min_level", &self.min_level) + .field("enabled", &self.enabled) + .field("rebuild_fn", &self.rebuild_fn.is_some()) + .finish() + } +} + +impl Default for Bm25RebuildJobConfig { + fn default() -> Self { + Self { + cron_schedule: "0 4 * * 0".to_string(), + min_level: "day".to_string(), + enabled: false, + rebuild_fn: None, + } + } +} + +/// BM25 rebuild job - rebuilds BM25 index with level filtering. +pub struct Bm25RebuildJob { + config: Bm25RebuildJobConfig, +} + +impl Bm25RebuildJob { + pub fn new(config: Bm25RebuildJobConfig) -> Self { + Self { config } + } + + /// Create a job with a rebuild callback. + /// + /// The callback should call `SearchIndexer::rebuild_with_filter()` and return + /// the count of removed documents. + pub fn with_rebuild_fn(mut config: Bm25RebuildJobConfig, rebuild_fn: F) -> Self + where + F: Fn(String) -> Fut + Send + Sync + 'static, + Fut: Future> + Send + 'static, + { + config.rebuild_fn = Some(Arc::new(move |min_level| Box::pin(rebuild_fn(min_level)))); + Self { config } + } + + /// Execute the rebuild job. + pub async fn run(&self, cancel: CancellationToken) -> Result { + if cancel.is_cancelled() { + return Ok(0); + } + + if !self.config.enabled { + tracing::debug!("BM25 rebuild job disabled, skipping"); + return Ok(0); + } + + tracing::info!( + min_level = %self.config.min_level, + "Starting BM25 rebuild job" + ); + + if let Some(ref rebuild_fn) = self.config.rebuild_fn { + let result = rebuild_fn(self.config.min_level.clone()).await; + match &result { + Ok(count) => { + tracing::info!(removed = count, "BM25 rebuild job completed"); + } + Err(e) => { + tracing::error!(error = %e, "BM25 rebuild job failed"); + } + } + result + } else { + tracing::info!( + min_level = %self.config.min_level, + "Would rebuild BM25 index (no rebuild_fn configured)" + ); + Ok(0) + } + } + + /// Get job name. + pub fn name(&self) -> &str { + "bm25_rebuild" + } + + /// Get cron schedule. + pub fn cron_schedule(&self) -> &str { + &self.config.cron_schedule + } + + /// Get configuration. + pub fn config(&self) -> &Bm25RebuildJobConfig { + &self.config + } +} + +/// Create BM25 rebuild job for registration with scheduler. +pub fn create_bm25_rebuild_job(config: Bm25RebuildJobConfig) -> Bm25RebuildJob { + Bm25RebuildJob::new(config) +} + +/// Register the BM25 rebuild job with the scheduler. +pub async fn register_bm25_rebuild_job( + scheduler: &crate::SchedulerService, + job: Bm25RebuildJob, +) -> Result<(), crate::SchedulerError> { + use crate::{JitterConfig, JobOutput, OverlapPolicy, TimeoutConfig}; + + let config = job.config().clone(); + + // Convert 5-field cron to 6-field + let cron = convert_5field_to_6field(&config.cron_schedule); + let job = Arc::new(job); + + scheduler + .register_job_with_metadata( + "bm25_rebuild", + &cron, + Some("UTC"), + OverlapPolicy::Skip, + JitterConfig::new(60), // Up to 60 seconds jitter + TimeoutConfig::new(3600), // 1 hour timeout + move || { + let job = Arc::clone(&job); + async move { + let cancel = CancellationToken::new(); + job.run(cancel) + .await + .map(|count| { + tracing::info!(removed = count, "BM25 rebuild job completed"); + JobOutput::new() + .with_prune_count(count) + .with_metadata("documents_removed", count.to_string()) + }) + .map_err(|e| format!("BM25 rebuild failed: {}", e)) + } + }, + ) + .await?; + + tracing::info!( + enabled = config.enabled, + schedule = %config.cron_schedule, + min_level = %config.min_level, + "Registered BM25 rebuild job" + ); + Ok(()) +} + +/// Convert 5-field cron to 6-field (add seconds). +fn convert_5field_to_6field(cron_5field: &str) -> String { + let parts: Vec<&str> = cron_5field.split_whitespace().collect(); + if parts.len() == 5 { + format!("0 {}", cron_5field) + } else { + cron_5field.to_string() + } +} + +#[cfg(test)] +mod tests { + use super::*; + use std::sync::atomic::{AtomicU32, Ordering}; + + #[tokio::test] + async fn test_job_disabled_by_default() { + let config = Bm25RebuildJobConfig::default(); + assert!(!config.enabled); + + let job = Bm25RebuildJob::new(config); + let cancel = CancellationToken::new(); + + let result = job.run(cancel).await; + assert!(result.is_ok()); + assert_eq!(result.unwrap(), 0); + } + + #[tokio::test] + async fn test_job_respects_cancel() { + let config = Bm25RebuildJobConfig { + enabled: true, + ..Default::default() + }; + let job = Bm25RebuildJob::new(config); + let cancel = CancellationToken::new(); + cancel.cancel(); + + let result = job.run(cancel).await; + assert!(result.is_ok()); + assert_eq!(result.unwrap(), 0); + } + + #[tokio::test] + async fn test_job_calls_rebuild_fn() { + let call_count = Arc::new(AtomicU32::new(0)); + let call_count_clone = call_count.clone(); + + let rebuild_fn = move |_min_level: String| { + let count = call_count_clone.clone(); + async move { + count.fetch_add(1, Ordering::SeqCst); + Ok(42u32) + } + }; + + let config = Bm25RebuildJobConfig { + enabled: true, + ..Default::default() + }; + let job = Bm25RebuildJob::with_rebuild_fn(config, rebuild_fn); + let cancel = CancellationToken::new(); + + let result = job.run(cancel).await; + assert!(result.is_ok()); + assert_eq!(result.unwrap(), 42); + assert_eq!(call_count.load(Ordering::SeqCst), 1); + } + + #[tokio::test] + async fn test_job_handles_rebuild_error() { + let rebuild_fn = |_min_level: String| async { Err("test error".to_string()) }; + + let config = Bm25RebuildJobConfig { + enabled: true, + ..Default::default() + }; + let job = Bm25RebuildJob::with_rebuild_fn(config, rebuild_fn); + let cancel = CancellationToken::new(); + + let result = job.run(cancel).await; + assert!(result.is_err()); + } + + #[test] + fn test_default_config() { + let config = Bm25RebuildJobConfig::default(); + assert_eq!(config.cron_schedule, "0 4 * * 0"); + assert_eq!(config.min_level, "day"); + assert!(!config.enabled); + assert!(config.rebuild_fn.is_none()); + } + + #[test] + fn test_job_name() { + let job = Bm25RebuildJob::new(Bm25RebuildJobConfig::default()); + assert_eq!(job.name(), "bm25_rebuild"); + } + + #[test] + fn test_config_debug() { + let config = Bm25RebuildJobConfig::default(); + let debug_str = format!("{:?}", config); + assert!(debug_str.contains("Bm25RebuildJobConfig")); + assert!(debug_str.contains("rebuild_fn: false")); + } +} diff --git a/crates/memory-scheduler/src/jobs/mod.rs b/crates/memory-scheduler/src/jobs/mod.rs index 794f96a..5f0c957 100644 --- a/crates/memory-scheduler/src/jobs/mod.rs +++ b/crates/memory-scheduler/src/jobs/mod.rs @@ -18,6 +18,8 @@ pub mod rollup; #[cfg(feature = "jobs")] pub mod bm25_prune; #[cfg(feature = "jobs")] +pub mod bm25_rebuild; +#[cfg(feature = "jobs")] pub mod indexing; #[cfg(feature = "jobs")] pub mod search; @@ -30,6 +32,10 @@ pub use rollup::{create_rollup_jobs, RollupJobConfig}; #[cfg(feature = "jobs")] pub use bm25_prune::{create_bm25_prune_job, Bm25PruneJob, Bm25PruneJobConfig}; #[cfg(feature = "jobs")] +pub use bm25_rebuild::{ + create_bm25_rebuild_job, register_bm25_rebuild_job, Bm25RebuildJob, Bm25RebuildJobConfig, +}; +#[cfg(feature = "jobs")] pub use indexing::{create_indexing_job, IndexingJobConfig}; #[cfg(feature = "jobs")] pub use search::{create_index_commit_job, IndexCommitJobConfig}; diff --git a/crates/memory-scheduler/src/lib.rs b/crates/memory-scheduler/src/lib.rs index bb66be7..e852994 100644 --- a/crates/memory-scheduler/src/lib.rs +++ b/crates/memory-scheduler/src/lib.rs @@ -60,6 +60,10 @@ pub use jobs::bm25_prune::{ create_bm25_prune_job, register_bm25_prune_job, Bm25PruneJob, Bm25PruneJobConfig, }; #[cfg(feature = "jobs")] +pub use jobs::bm25_rebuild::{ + create_bm25_rebuild_job, register_bm25_rebuild_job, Bm25RebuildJob, Bm25RebuildJobConfig, +}; +#[cfg(feature = "jobs")] pub use jobs::compaction::{create_compaction_job, CompactionJobConfig}; #[cfg(feature = "jobs")] pub use jobs::indexing::{create_indexing_job, IndexingJobConfig}; diff --git a/crates/memory-search/src/indexer.rs b/crates/memory-search/src/indexer.rs index 8a5b52e..d272692 100644 --- a/crates/memory-search/src/indexer.rs +++ b/crates/memory-search/src/indexer.rs @@ -344,6 +344,44 @@ impl SearchIndexer { Ok(stats) } + /// Rebuild the index keeping only documents at or above the specified level. + /// + /// This removes all documents below `min_level` from the index. + /// For example, with `min_level = "day"`, all segment and grip documents + /// are deleted, keeping only day, week, month, and year docs. + /// + /// This is useful after TOC rollup when fine-grained segments are no longer + /// needed in the search index. + /// + /// Returns the count of documents removed. + pub fn rebuild_with_filter(&self, min_level: &str) -> Result { + let level_order = ["segment", "grip", "day", "week", "month", "year"]; + let min_idx = level_order + .iter() + .position(|l| *l == min_level) + .unwrap_or(0); + + let mut total_removed: u32 = 0; + + // Prune all levels below the minimum (age_days=0 means "prune everything at this level") + for level in &level_order[..min_idx] { + let stats = self.prune(0, Some(level), false)?; + total_removed += stats.total(); + } + + if total_removed > 0 { + self.commit()?; + } + + info!( + min_level = min_level, + removed = total_removed, + "Rebuild with filter complete" + ); + + Ok(total_removed) + } + /// Prune and commit in one operation. /// /// Convenience method that calls prune() followed by commit(). diff --git a/crates/memory-service/src/episodes.rs b/crates/memory-service/src/episodes.rs new file mode 100644 index 0000000..c4d73a2 --- /dev/null +++ b/crates/memory-service/src/episodes.rs @@ -0,0 +1,738 @@ +//! Episode RPC handlers for episodic memory. +//! +//! Implements Phase 44 Episodic Memory RPCs: +//! - StartEpisode: Begin tracking a task execution +//! - RecordAction: Record an action within an episode +//! - CompleteEpisode: Finish an episode with outcome and lessons +//! - GetSimilarEpisodes: Find similar episodes via cosine similarity +//! +//! Follows the AgentDiscoveryHandler/TopicGraphHandler pattern with `Arc`. + +use std::sync::Arc; + +use chrono::{TimeZone, Utc}; +use tonic::{Request, Response, Status}; +use tracing::{debug, info, warn}; + +use memory_storage::Storage; +use memory_types::config::EpisodicConfig; +use memory_types::{Action, ActionResult, Episode, EpisodeStatus}; + +use crate::novelty::EmbedderTrait; +use crate::pb::{ + ActionResultStatus, CompleteEpisodeRequest, CompleteEpisodeResponse, EpisodeAction, + EpisodeStatusProto, EpisodeSummary, GetSimilarEpisodesRequest, GetSimilarEpisodesResponse, + RecordActionRequest, RecordActionResponse, StartEpisodeRequest, StartEpisodeResponse, +}; + +/// Handler for episodic memory RPCs. +pub struct EpisodeHandler { + storage: Arc, + config: EpisodicConfig, + embedder: Option>, +} + +impl EpisodeHandler { + /// Create a new episode handler. + pub fn new(storage: Arc, config: EpisodicConfig) -> Self { + Self { + storage, + config, + embedder: None, + } + } + + /// Set the embedder for generating episode embeddings. + pub fn with_embedder(mut self, embedder: Arc) -> Self { + self.embedder = Some(embedder); + self + } + + /// Handle StartEpisode RPC. + pub async fn start_episode( + &self, + request: Request, + ) -> Result, Status> { + if !self.config.enabled { + return Err(Status::failed_precondition( + "Episodic memory is not enabled", + )); + } + + let req = request.into_inner(); + + if req.task.is_empty() { + return Err(Status::invalid_argument("task is required")); + } + + let episode_id = ulid::Ulid::new().to_string(); + let mut episode = Episode::new(episode_id.clone(), req.task).with_plan(req.plan); + + if let Some(agent) = req.agent { + episode = episode.with_agent(agent); + } + + self.storage + .store_episode(&episode) + .map_err(|e| Status::internal(format!("Failed to store episode: {e}")))?; + + info!(episode_id = %episode_id, "Started episode"); + + Ok(Response::new(StartEpisodeResponse { + episode_id, + created: true, + })) + } + + /// Handle RecordAction RPC. + pub async fn record_action( + &self, + request: Request, + ) -> Result, Status> { + if !self.config.enabled { + return Err(Status::failed_precondition( + "Episodic memory is not enabled", + )); + } + + let req = request.into_inner(); + + if req.episode_id.is_empty() { + return Err(Status::invalid_argument("episode_id is required")); + } + + let proto_action = req + .action + .ok_or_else(|| Status::invalid_argument("action is required"))?; + + let mut episode = self + .storage + .get_episode(&req.episode_id) + .map_err(|e| Status::internal(format!("Failed to get episode: {e}")))? + .ok_or_else(|| Status::not_found("Episode not found"))?; + + if episode.status != EpisodeStatus::InProgress { + return Err(Status::failed_precondition( + "Cannot record actions on a completed or failed episode", + )); + } + + let action = convert_proto_action(proto_action)?; + episode.add_action(action); + + self.storage + .update_episode(&episode) + .map_err(|e| Status::internal(format!("Failed to update episode: {e}")))?; + + let action_count = episode.actions.len() as u32; + debug!(episode_id = %req.episode_id, action_count, "Recorded action"); + + Ok(Response::new(RecordActionResponse { + recorded: true, + action_count, + })) + } + + /// Handle CompleteEpisode RPC. + pub async fn complete_episode( + &self, + request: Request, + ) -> Result, Status> { + if !self.config.enabled { + return Err(Status::failed_precondition( + "Episodic memory is not enabled", + )); + } + + let req = request.into_inner(); + + if req.episode_id.is_empty() { + return Err(Status::invalid_argument("episode_id is required")); + } + + if !(0.0..=1.0).contains(&req.outcome_score) { + return Err(Status::invalid_argument( + "outcome_score must be between 0.0 and 1.0", + )); + } + + let mut episode = self + .storage + .get_episode(&req.episode_id) + .map_err(|e| Status::internal(format!("Failed to get episode: {e}")))? + .ok_or_else(|| Status::not_found("Episode not found"))?; + + if episode.status != EpisodeStatus::InProgress { + return Err(Status::failed_precondition( + "Episode is already completed or failed", + )); + } + + // Complete or fail the episode + let midpoint = self.config.midpoint_target; + if req.failed { + episode.fail(req.outcome_score, midpoint); + } else { + episode.complete(req.outcome_score, midpoint); + } + + episode.lessons_learned = req.lessons_learned; + episode.failure_modes = req.failure_modes; + + // Generate embedding from task + lessons + if let Some(ref embedder) = self.embedder { + let text = build_embedding_text(&episode); + match embedder.embed(&text).await { + Ok(embedding) => { + episode.embedding = Some(embedding); + } + Err(e) => { + warn!(episode_id = %req.episode_id, "Failed to generate episode embedding: {e}"); + // Fail-open: continue without embedding + } + } + } + + let value_score = episode.value_score.unwrap_or(0.0); + + self.storage + .update_episode(&episode) + .map_err(|e| Status::internal(format!("Failed to update episode: {e}")))?; + + // Value-based retention pruning + let episodes_pruned = self.prune_if_over_limit()?; + + info!( + episode_id = %req.episode_id, + value_score, + episodes_pruned, + "Completed episode" + ); + + Ok(Response::new(CompleteEpisodeResponse { + completed: true, + value_score, + episodes_pruned, + })) + } + + /// Handle GetSimilarEpisodes RPC. + pub async fn get_similar_episodes( + &self, + request: Request, + ) -> Result, Status> { + if !self.config.enabled { + return Err(Status::failed_precondition( + "Episodic memory is not enabled", + )); + } + + let req = request.into_inner(); + + if req.query.is_empty() { + return Err(Status::invalid_argument("query is required")); + } + + let top_k = if req.top_k == 0 { 5 } else { req.top_k } as usize; + let min_score = req.min_score; + + // Embed the query + let embedder = self.embedder.as_ref().ok_or_else(|| { + Status::unavailable("Embedder not configured for episode similarity search") + })?; + + let query_embedding = embedder + .embed(&req.query) + .await + .map_err(|e| Status::internal(format!("Failed to embed query: {e}")))?; + + // Load all episodes and compute cosine similarity + let episodes = self + .storage + .list_episodes(self.config.max_episodes) + .map_err(|e| Status::internal(format!("Failed to list episodes: {e}")))?; + + let mut scored: Vec<(f32, &Episode)> = episodes + .iter() + .filter_map(|ep| { + let embedding = ep.embedding.as_ref()?; + let sim = cosine_similarity(&query_embedding, embedding); + if sim >= min_score { + Some((sim, ep)) + } else { + None + } + }) + .collect(); + + // Sort by similarity descending + scored.sort_by(|a, b| b.0.partial_cmp(&a.0).unwrap_or(std::cmp::Ordering::Equal)); + scored.truncate(top_k); + + let summaries: Vec = scored + .iter() + .map(|(sim, ep)| episode_to_summary(ep, *sim)) + .collect(); + + debug!(results = summaries.len(), "Found similar episodes"); + + Ok(Response::new(GetSimilarEpisodesResponse { + episodes: summaries, + })) + } + + /// Prune lowest-value episodes if total exceeds max_episodes. + #[allow(clippy::result_large_err)] + fn prune_if_over_limit(&self) -> Result { + let all_episodes = self + .storage + .list_episodes(self.config.max_episodes + 100) // fetch a bit more + .map_err(|e| Status::internal(format!("Failed to list episodes: {e}")))?; + + if all_episodes.len() <= self.config.max_episodes { + return Ok(0); + } + + let excess = all_episodes.len() - self.config.max_episodes; + + // Sort by value_score ascending (lowest first) to find prune candidates + let mut sortable: Vec<&Episode> = all_episodes.iter().collect(); + sortable.sort_by(|a, b| { + let va = a.value_score.unwrap_or(0.0); + let vb = b.value_score.unwrap_or(0.0); + va.partial_cmp(&vb).unwrap_or(std::cmp::Ordering::Equal) + }); + + let mut pruned = 0u32; + for ep in sortable.iter().take(excess) { + if let Err(e) = self.storage.delete_episode(&ep.episode_id) { + warn!(episode_id = %ep.episode_id, "Failed to prune episode: {e}"); + continue; + } + pruned += 1; + } + + if pruned > 0 { + info!(pruned, "Pruned low-value episodes"); + } + + Ok(pruned) + } +} + +/// Convert a proto EpisodeAction to a domain Action. +#[allow(clippy::result_large_err)] +fn convert_proto_action(proto: EpisodeAction) -> Result { + let result = match ActionResultStatus::try_from(proto.result_status) { + Ok(ActionResultStatus::ActionResultSuccess) => ActionResult::Success(proto.result_detail), + Ok(ActionResultStatus::ActionResultFailure) => ActionResult::Failure(proto.result_detail), + Ok(ActionResultStatus::ActionResultPending) + | Ok(ActionResultStatus::ActionResultUnspecified) => ActionResult::Pending, + Err(_) => ActionResult::Pending, + }; + + let timestamp = if proto.timestamp_ms > 0 { + Utc.timestamp_millis_opt(proto.timestamp_ms) + .single() + .unwrap_or_else(Utc::now) + } else { + Utc::now() + }; + + Ok(Action { + action_type: proto.action_type, + input: proto.input, + result, + timestamp, + }) +} + +/// Build embedding text from episode task + lessons. +fn build_embedding_text(episode: &Episode) -> String { + let mut parts = vec![episode.task.clone()]; + for lesson in &episode.lessons_learned { + parts.push(lesson.clone()); + } + for mode in &episode.failure_modes { + parts.push(mode.clone()); + } + parts.join(". ") +} + +/// Compute cosine similarity between two vectors. +/// +/// Assumes vectors are pre-normalized (dot product = cosine similarity). +fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 { + if a.len() != b.len() || a.is_empty() { + return 0.0; + } + a.iter().zip(b.iter()).map(|(x, y)| x * y).sum() +} + +/// Convert an Episode to a proto EpisodeSummary. +fn episode_to_summary(episode: &Episode, similarity_score: f32) -> EpisodeSummary { + let status = match episode.status { + EpisodeStatus::InProgress => EpisodeStatusProto::EpisodeStatusInProgress, + EpisodeStatus::Completed => EpisodeStatusProto::EpisodeStatusCompleted, + EpisodeStatus::Failed => EpisodeStatusProto::EpisodeStatusFailed, + }; + + EpisodeSummary { + episode_id: episode.episode_id.clone(), + task: episode.task.clone(), + status: status.into(), + outcome_score: episode.outcome_score.unwrap_or(0.0), + value_score: episode.value_score.unwrap_or(0.0), + similarity_score, + lessons_learned: episode.lessons_learned.clone(), + failure_modes: episode.failure_modes.clone(), + action_count: episode.actions.len() as u32, + created_at_ms: episode.created_at.timestamp_millis(), + agent: episode.agent.clone(), + } +} + +#[cfg(test)] +mod tests { + use super::*; + use memory_types::config::EpisodicConfig; + use tempfile::TempDir; + + fn create_test_handler() -> (EpisodeHandler, Arc, TempDir) { + let temp_dir = TempDir::new().unwrap(); + let storage = Arc::new(Storage::open(temp_dir.path()).unwrap()); + let config = EpisodicConfig { + enabled: true, + ..Default::default() + }; + let handler = EpisodeHandler::new(storage.clone(), config); + (handler, storage, temp_dir) + } + + fn create_disabled_handler() -> (EpisodeHandler, TempDir) { + let temp_dir = TempDir::new().unwrap(); + let storage = Arc::new(Storage::open(temp_dir.path()).unwrap()); + let config = EpisodicConfig::default(); // disabled + let handler = EpisodeHandler::new(storage, config); + (handler, temp_dir) + } + + #[tokio::test] + async fn test_start_episode() { + let (handler, _, _temp) = create_test_handler(); + + let response = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "Build auth system".to_string(), + plan: vec!["Design schema".to_string(), "Implement JWT".to_string()], + agent: Some("claude".to_string()), + })) + .await + .unwrap(); + + let resp = response.into_inner(); + assert!(resp.created); + assert!(!resp.episode_id.is_empty()); + } + + #[tokio::test] + async fn test_start_episode_disabled() { + let (handler, _temp) = create_disabled_handler(); + + let result = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "test".to_string(), + plan: vec![], + agent: None, + })) + .await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err().code(), tonic::Code::FailedPrecondition); + } + + #[tokio::test] + async fn test_start_episode_empty_task() { + let (handler, _, _temp) = create_test_handler(); + + let result = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "".to_string(), + plan: vec![], + agent: None, + })) + .await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err().code(), tonic::Code::InvalidArgument); + } + + #[tokio::test] + async fn test_record_action() { + let (handler, _, _temp) = create_test_handler(); + + // Start episode + let start_resp = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "test task".to_string(), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + // Record action + let response = handler + .record_action(Request::new(RecordActionRequest { + episode_id: start_resp.episode_id.clone(), + action: Some(EpisodeAction { + action_type: "tool_call".to_string(), + input: "read file".to_string(), + result_status: ActionResultStatus::ActionResultSuccess.into(), + result_detail: "file contents".to_string(), + timestamp_ms: Utc::now().timestamp_millis(), + }), + })) + .await + .unwrap(); + + let resp = response.into_inner(); + assert!(resp.recorded); + assert_eq!(resp.action_count, 1); + } + + #[tokio::test] + async fn test_record_action_not_found() { + let (handler, _, _temp) = create_test_handler(); + + let result = handler + .record_action(Request::new(RecordActionRequest { + episode_id: "nonexistent".to_string(), + action: Some(EpisodeAction { + action_type: "tool_call".to_string(), + input: "test".to_string(), + result_status: ActionResultStatus::ActionResultSuccess.into(), + result_detail: "ok".to_string(), + timestamp_ms: 0, + }), + })) + .await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err().code(), tonic::Code::NotFound); + } + + #[tokio::test] + async fn test_complete_episode() { + let (handler, storage, _temp) = create_test_handler(); + + // Start episode + let start_resp = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "test task".to_string(), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + // Complete episode + let response = handler + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id.clone(), + outcome_score: 0.65, + failed: false, + lessons_learned: vec!["Always test first".to_string()], + failure_modes: vec![], + })) + .await + .unwrap(); + + let resp = response.into_inner(); + assert!(resp.completed); + // At midpoint (0.65), value score = 1.0 + assert!((resp.value_score - 1.0).abs() < f32::EPSILON); + + // Verify storage + let stored = storage + .get_episode(&start_resp.episode_id) + .unwrap() + .unwrap(); + assert_eq!(stored.status, EpisodeStatus::Completed); + assert_eq!(stored.lessons_learned, vec!["Always test first"]); + } + + #[tokio::test] + async fn test_complete_episode_failed() { + let (handler, storage, _temp) = create_test_handler(); + + let start_resp = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "failing task".to_string(), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + let response = handler + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id.clone(), + outcome_score: 0.2, + failed: true, + lessons_learned: vec![], + failure_modes: vec!["Timeout on API".to_string()], + })) + .await + .unwrap(); + + let resp = response.into_inner(); + assert!(resp.completed); + + let stored = storage + .get_episode(&start_resp.episode_id) + .unwrap() + .unwrap(); + assert_eq!(stored.status, EpisodeStatus::Failed); + assert_eq!(stored.failure_modes, vec!["Timeout on API"]); + } + + #[tokio::test] + async fn test_complete_episode_invalid_score() { + let (handler, _, _temp) = create_test_handler(); + + let start_resp = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "test".to_string(), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + let result = handler + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id, + outcome_score: 1.5, + failed: false, + lessons_learned: vec![], + failure_modes: vec![], + })) + .await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err().code(), tonic::Code::InvalidArgument); + } + + #[tokio::test] + async fn test_complete_already_completed() { + let (handler, _, _temp) = create_test_handler(); + + let start_resp = handler + .start_episode(Request::new(StartEpisodeRequest { + task: "test".to_string(), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + // Complete once + handler + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id.clone(), + outcome_score: 0.5, + failed: false, + lessons_learned: vec![], + failure_modes: vec![], + })) + .await + .unwrap(); + + // Try again + let result = handler + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id, + outcome_score: 0.8, + failed: false, + lessons_learned: vec![], + failure_modes: vec![], + })) + .await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err().code(), tonic::Code::FailedPrecondition); + } + + #[test] + fn test_cosine_similarity() { + let a = vec![1.0, 0.0, 0.0]; + let b = vec![1.0, 0.0, 0.0]; + assert!((cosine_similarity(&a, &b) - 1.0).abs() < f32::EPSILON); + + let c = vec![0.0, 1.0, 0.0]; + assert!((cosine_similarity(&a, &c) - 0.0).abs() < f32::EPSILON); + + // Empty or mismatched + assert!((cosine_similarity(&[], &[]) - 0.0).abs() < f32::EPSILON); + assert!((cosine_similarity(&[1.0], &[1.0, 2.0]) - 0.0).abs() < f32::EPSILON); + } + + #[test] + fn test_build_embedding_text() { + let mut episode = Episode::new("test".to_string(), "Build auth".to_string()); + episode.lessons_learned = vec!["Use JWT".to_string()]; + episode.failure_modes = vec!["Timeout".to_string()]; + + let text = build_embedding_text(&episode); + assert_eq!(text, "Build auth. Use JWT. Timeout"); + } + + #[tokio::test] + async fn test_value_based_pruning() { + let temp_dir = TempDir::new().unwrap(); + let storage = Arc::new(Storage::open(temp_dir.path()).unwrap()); + let config = EpisodicConfig { + enabled: true, + max_episodes: 3, + ..Default::default() + }; + let handler = EpisodeHandler::new(storage.clone(), config); + + // Create 4 episodes with different value scores + for (i, score) in [0.1, 0.9, 0.5, 0.65].iter().enumerate() { + let start_resp = handler + .start_episode(Request::new(StartEpisodeRequest { + task: format!("task {i}"), + plan: vec![], + agent: None, + })) + .await + .unwrap() + .into_inner(); + + // Small delay so ULIDs are distinct + std::thread::sleep(std::time::Duration::from_millis(2)); + + handler + .complete_episode(Request::new(CompleteEpisodeRequest { + episode_id: start_resp.episode_id, + outcome_score: *score, + failed: false, + lessons_learned: vec![], + failure_modes: vec![], + })) + .await + .unwrap(); + } + + // After 4th episode, pruning should have removed 1 + let remaining = storage.list_episodes(100).unwrap(); + assert_eq!(remaining.len(), 3); + } +} diff --git a/crates/memory-service/src/hybrid.rs b/crates/memory-service/src/hybrid.rs index 1c5857d..c7c5f1c 100644 --- a/crates/memory-service/src/hybrid.rs +++ b/crates/memory-service/src/hybrid.rs @@ -10,6 +10,8 @@ use std::sync::Arc; use tonic::{Request, Response, Status}; use tracing::{debug, info}; +use memory_search::{SearchOptions, TeleportSearcher}; + use crate::pb::{ HybridMode, HybridSearchRequest, HybridSearchResponse, VectorMatch, VectorTeleportRequest, }; @@ -21,19 +23,24 @@ const RRF_K: f32 = 60.0; /// Handler for hybrid search operations. pub struct HybridSearchHandler { vector_handler: Arc, - // BM25 integration will be added when Phase 11 is complete + searcher: Option>, } impl HybridSearchHandler { /// Create a new hybrid search handler. - pub fn new(vector_handler: Arc) -> Self { - Self { vector_handler } + pub fn new( + vector_handler: Arc, + searcher: Option>, + ) -> Self { + Self { + vector_handler, + searcher, + } } /// Check if BM25 search is available. pub fn bm25_available(&self) -> bool { - // TODO: Will be true when Phase 11 is integrated - false + self.searcher.is_some() } /// Check if vector search is available. @@ -126,9 +133,26 @@ impl HybridSearchHandler { } /// Perform BM25-only search. - async fn bm25_search(&self, _query: &str, _top_k: usize) -> Result, Status> { - // TODO: Integrate with Phase 11 BM25 when complete - Ok(vec![]) + async fn bm25_search(&self, query: &str, top_k: usize) -> Result, Status> { + let Some(searcher) = &self.searcher else { + return Ok(vec![]); + }; + + let results = searcher + .search(query, SearchOptions::new().with_limit(top_k)) + .map_err(|e| Status::internal(format!("BM25 search error: {e}")))?; + + Ok(results + .into_iter() + .map(|r| VectorMatch { + doc_id: r.doc_id, + doc_type: r.doc_type.as_str().to_string(), + score: r.score, + text_preview: r.keywords.unwrap_or_default(), + timestamp_ms: r.timestamp_ms.unwrap_or(0), + agent: r.agent, + }) + .collect()) } /// Fuse results using Reciprocal Rank Fusion. diff --git a/crates/memory-service/src/ingest.rs b/crates/memory-service/src/ingest.rs index c4a257d..aa5e387 100644 --- a/crates/memory-service/src/ingest.rs +++ b/crates/memory-service/src/ingest.rs @@ -20,26 +20,29 @@ use memory_types::{ }; use crate::agents::AgentDiscoveryHandler; +use crate::episodes::EpisodeHandler; use crate::hybrid::HybridSearchHandler; use crate::novelty::NoveltyChecker; use crate::pb::{ memory_service_server::MemoryService, BrowseTocRequest, BrowseTocResponse, - ClassifyQueryIntentRequest, ClassifyQueryIntentResponse, Event as ProtoEvent, - EventRole as ProtoEventRole, EventType as ProtoEventType, ExpandGripRequest, - ExpandGripResponse, GetAgentActivityRequest, GetAgentActivityResponse, GetDedupStatusRequest, - GetDedupStatusResponse, GetEventsRequest, GetEventsResponse, GetNodeRequest, GetNodeResponse, - GetRankingStatusRequest, GetRankingStatusResponse, GetRelatedTopicsRequest, - GetRelatedTopicsResponse, GetRetrievalCapabilitiesRequest, GetRetrievalCapabilitiesResponse, - GetSchedulerStatusRequest, GetSchedulerStatusResponse, GetTocRootRequest, GetTocRootResponse, - GetTopTopicsRequest, GetTopTopicsResponse, GetTopicGraphStatusRequest, - GetTopicGraphStatusResponse, GetTopicsByQueryRequest, GetTopicsByQueryResponse, - GetVectorIndexStatusRequest, HybridSearchRequest, HybridSearchResponse, IngestEventRequest, - IngestEventResponse, ListAgentsRequest, ListAgentsResponse, PauseJobRequest, PauseJobResponse, - PruneBm25IndexRequest, PruneBm25IndexResponse, PruneVectorIndexRequest, - PruneVectorIndexResponse, ResumeJobRequest, ResumeJobResponse, RouteQueryRequest, + ClassifyQueryIntentRequest, ClassifyQueryIntentResponse, CompleteEpisodeRequest, + CompleteEpisodeResponse, Event as ProtoEvent, EventRole as ProtoEventRole, + EventType as ProtoEventType, ExpandGripRequest, ExpandGripResponse, GetAgentActivityRequest, + GetAgentActivityResponse, GetDedupStatusRequest, GetDedupStatusResponse, GetEventsRequest, + GetEventsResponse, GetNodeRequest, GetNodeResponse, GetRankingStatusRequest, + GetRankingStatusResponse, GetRelatedTopicsRequest, GetRelatedTopicsResponse, + GetRetrievalCapabilitiesRequest, GetRetrievalCapabilitiesResponse, GetSchedulerStatusRequest, + GetSchedulerStatusResponse, GetSimilarEpisodesRequest, GetSimilarEpisodesResponse, + GetTocRootRequest, GetTocRootResponse, GetTopTopicsRequest, GetTopTopicsResponse, + GetTopicGraphStatusRequest, GetTopicGraphStatusResponse, GetTopicsByQueryRequest, + GetTopicsByQueryResponse, GetVectorIndexStatusRequest, HybridSearchRequest, + HybridSearchResponse, IngestEventRequest, IngestEventResponse, ListAgentsRequest, + ListAgentsResponse, PauseJobRequest, PauseJobResponse, PruneBm25IndexRequest, + PruneBm25IndexResponse, PruneVectorIndexRequest, PruneVectorIndexResponse, RecordActionRequest, + RecordActionResponse, ResumeJobRequest, ResumeJobResponse, RouteQueryRequest, RouteQueryResponse, SearchChildrenRequest, SearchChildrenResponse, SearchNodeRequest, - SearchNodeResponse, TeleportSearchRequest, TeleportSearchResponse, VectorIndexStatus, - VectorTeleportRequest, VectorTeleportResponse, + SearchNodeResponse, StartEpisodeRequest, StartEpisodeResponse, TeleportSearchRequest, + TeleportSearchResponse, VectorIndexStatus, VectorTeleportRequest, VectorTeleportResponse, }; use crate::query; use crate::retrieval::RetrievalHandler; @@ -60,6 +63,7 @@ pub struct MemoryServiceImpl { retrieval_service: Option>, agent_service: Arc, novelty_checker: Option>, + episode_handler: Option>, } impl MemoryServiceImpl { @@ -77,6 +81,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -107,6 +112,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -137,6 +143,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -164,6 +171,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -175,7 +183,7 @@ impl MemoryServiceImpl { vector_handler: Arc, staleness_config: StalenessConfig, ) -> Self { - let hybrid_handler = Arc::new(HybridSearchHandler::new(vector_handler.clone())); + let hybrid_handler = Arc::new(HybridSearchHandler::new(vector_handler.clone(), None)); let retrieval = Arc::new(RetrievalHandler::with_services( storage.clone(), None, @@ -194,6 +202,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -223,6 +232,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -234,7 +244,10 @@ impl MemoryServiceImpl { vector_handler: Arc, staleness_config: StalenessConfig, ) -> Self { - let hybrid_handler = Arc::new(HybridSearchHandler::new(vector_handler.clone())); + let hybrid_handler = Arc::new(HybridSearchHandler::new( + vector_handler.clone(), + Some(searcher.clone()), + )); let retrieval = Arc::new(RetrievalHandler::with_services( storage.clone(), Some(searcher.clone()), @@ -253,6 +266,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -265,7 +279,10 @@ impl MemoryServiceImpl { topic_handler: Arc, staleness_config: StalenessConfig, ) -> Self { - let hybrid_handler = Arc::new(HybridSearchHandler::new(vector_handler.clone())); + let hybrid_handler = Arc::new(HybridSearchHandler::new( + vector_handler.clone(), + Some(searcher.clone()), + )); let retrieval = Arc::new(RetrievalHandler::with_services( storage.clone(), Some(searcher.clone()), @@ -284,6 +301,7 @@ impl MemoryServiceImpl { retrieval_service: Some(retrieval), agent_service: agent_svc, novelty_checker: None, + episode_handler: None, } } @@ -295,6 +313,14 @@ impl MemoryServiceImpl { self.novelty_checker = Some(checker); } + /// Set the episode handler for episodic memory RPCs. + /// + /// Called during daemon startup after construction. + /// When set, episodic memory RPCs will be functional. + pub fn set_episode_handler(&mut self, handler: Arc) { + self.episode_handler = Some(handler); + } + /// Convert proto EventRole to domain EventRole fn convert_role(proto_role: ProtoEventRole) -> EventRole { match proto_role { @@ -356,6 +382,53 @@ impl MemoryServiceImpl { Ok(event) } + + /// Compute ranking metrics from recent day-level TOC nodes. + /// + /// Returns (avg_salience, high_salience_count, total_access_count, avg_usage_decay). + /// Scans day-level nodes from the last 30 days for a bounded, representative sample. + fn compute_ranking_metrics(&self) -> (f32, u32, u64, f32) { + use memory_types::{usage::usage_penalty, TocLevel, UsageConfig}; + + let now = chrono::Utc::now(); + let thirty_days_ago = now - chrono::Duration::days(30); + + let nodes = match self.storage.get_toc_nodes_by_level( + TocLevel::Day, + Some(thirty_days_ago), + Some(now), + ) { + Ok(nodes) => nodes, + Err(_) => return (0.0, 0, 0, 1.0), + }; + + if nodes.is_empty() { + return (0.0, 0, 0, 1.0); + } + + let usage_config = UsageConfig::default(); + let count = nodes.len() as f32; + let mut total_salience = 0.0f32; + let mut high_salience = 0u32; + let mut total_access = 0u64; + let mut total_decay = 0.0f32; + + for node in &nodes { + total_salience += node.salience_score; + if node.salience_score > 0.5 { + high_salience += 1; + } + total_access += node.access_count as u64; + total_decay += usage_penalty(node.access_count, usage_config.decay_factor); + } + + ( + total_salience / count, + high_salience, + total_access, + total_decay / count, + ) + } } #[tonic::async_trait] @@ -985,14 +1058,30 @@ impl MemoryService for MemoryServiceImpl { let salience_config = SalienceConfig::default(); let novelty_config = NoveltyConfig::default(); + // Compute ranking metrics from recent day-level TOC nodes (bounded scan) + let (avg_salience, high_salience_count, total_access, avg_decay) = + self.compute_ranking_metrics(); + + // Get novelty metrics if checker is available + let (novelty_checked, novelty_rejected, novelty_skipped) = + if let Some(ref checker) = self.novelty_checker { + let snapshot = checker.metrics().snapshot(); + ( + snapshot.total_checked() as i64, + snapshot.total_rejected() as i64, + (snapshot.total_stored() - snapshot.stored_novel) as i64, + ) + } else { + (0, 0, 0) + }; + Ok(Response::new(GetRankingStatusResponse { salience_enabled: salience_config.enabled, usage_decay_enabled: true, // Always active per Phase 16 design novelty_enabled: novelty_config.enabled, - // In-memory only counters; return 0 for a fresh/stateless query - novelty_checked_total: 0, - novelty_rejected_total: 0, - novelty_skipped_total: 0, + novelty_checked_total: novelty_checked, + novelty_rejected_total: novelty_rejected, + novelty_skipped_total: novelty_skipped, // Vector lifecycle: enabled if vector service is configured vector_lifecycle_enabled: self.vector_service.is_some(), vector_last_prune_timestamp: 0, // No persistent prune history yet @@ -1001,6 +1090,11 @@ impl MemoryService for MemoryServiceImpl { bm25_lifecycle_enabled: false, bm25_last_prune_timestamp: 0, bm25_last_prune_count: 0, + // Phase 42: Ranking metrics + avg_salience_score: avg_salience, + high_salience_count, + total_access_count: total_access, + avg_usage_decay: avg_decay, })) } @@ -1034,13 +1128,14 @@ impl MemoryService for MemoryServiceImpl { let response = if let Some(ref checker) = self.novelty_checker { let config = checker.config(); let snapshot = checker.metrics().snapshot(); + let buffer_size = checker.buffer_len() as u32; GetDedupStatusResponse { enabled: config.enabled, threshold: config.threshold, events_checked: snapshot.total_checked(), events_deduplicated: snapshot.total_rejected(), events_skipped: snapshot.total_stored() - snapshot.stored_novel, - buffer_size: 0, + buffer_size, buffer_capacity: config.buffer_capacity as u32, } } else { @@ -1056,6 +1151,66 @@ impl MemoryService for MemoryServiceImpl { }; Ok(Response::new(response)) } + + /// Start a new episode for tracking a task execution. + /// + /// Per Phase 44: Episodic memory lifecycle. + async fn start_episode( + &self, + request: Request, + ) -> Result, Status> { + match &self.episode_handler { + Some(handler) => handler.start_episode(request).await, + None => Err(Status::failed_precondition( + "Episodic memory is not enabled", + )), + } + } + + /// Record an action taken during an in-progress episode. + /// + /// Per Phase 44: Episodic memory action tracking. + async fn record_action( + &self, + request: Request, + ) -> Result, Status> { + match &self.episode_handler { + Some(handler) => handler.record_action(request).await, + None => Err(Status::failed_precondition( + "Episodic memory is not enabled", + )), + } + } + + /// Complete an episode with outcome score and lessons. + /// + /// Per Phase 44: Episodic memory completion and value scoring. + async fn complete_episode( + &self, + request: Request, + ) -> Result, Status> { + match &self.episode_handler { + Some(handler) => handler.complete_episode(request).await, + None => Err(Status::failed_precondition( + "Episodic memory is not enabled", + )), + } + } + + /// Find episodes similar to a query. + /// + /// Per Phase 44: Episodic memory similarity search. + async fn get_similar_episodes( + &self, + request: Request, + ) -> Result, Status> { + match &self.episode_handler { + Some(handler) => handler.get_similar_episodes(request).await, + None => Err(Status::failed_precondition( + "Episodic memory is not enabled", + )), + } + } } #[cfg(test)] diff --git a/crates/memory-service/src/lib.rs b/crates/memory-service/src/lib.rs index b904f59..063bd1c 100644 --- a/crates/memory-service/src/lib.rs +++ b/crates/memory-service/src/lib.rs @@ -11,6 +11,7 @@ //! - Reflection endpoint for debugging (GRPC-04) pub mod agents; +pub mod episodes; pub mod hybrid; pub mod ingest; pub mod novelty; @@ -30,6 +31,7 @@ pub mod pb { } pub use agents::AgentDiscoveryHandler; +pub use episodes::EpisodeHandler; pub use hybrid::HybridSearchHandler; pub use ingest::MemoryServiceImpl; pub use novelty::{ diff --git a/crates/memory-service/src/novelty.rs b/crates/memory-service/src/novelty.rs index ed2405a..2d7a738 100644 --- a/crates/memory-service/src/novelty.rs +++ b/crates/memory-service/src/novelty.rs @@ -539,6 +539,17 @@ impl NoveltyChecker { pub fn config(&self) -> &DedupConfig { &self.config } + + /// Get the current number of entries in the in-flight buffer. + /// + /// Returns 0 if no buffer is configured or if the lock cannot be acquired. + pub fn buffer_len(&self) -> usize { + self.in_flight_buffer + .as_ref() + .and_then(|buf| buf.read().ok()) + .map(|buf| buf.len()) + .unwrap_or(0) + } } #[cfg(test)] diff --git a/crates/memory-service/src/query.rs b/crates/memory-service/src/query.rs index 6a262ac..19ab033 100644 --- a/crates/memory-service/src/query.rs +++ b/crates/memory-service/src/query.rs @@ -325,6 +325,9 @@ fn domain_to_proto_node(node: DomainTocNode) -> ProtoTocNode { salience_score: 0.5, memory_kind: ProtoMemoryKind::Observation as i32, is_pinned: false, + // Phase 40: Usage tracking + access_count: node.access_count, + last_accessed_ms: node.last_accessed_ms.unwrap_or(0), } } diff --git a/crates/memory-service/src/retrieval.rs b/crates/memory-service/src/retrieval.rs index 2149a64..66158d3 100644 --- a/crates/memory-service/src/retrieval.rs +++ b/crates/memory-service/src/retrieval.rs @@ -18,6 +18,7 @@ use tracing::{debug, info}; use memory_retrieval::{ classifier::IntentClassifier, executor::{FallbackChain, LayerExecutor, RetrievalExecutor, SearchResult}, + ranking::{apply_combined_ranking, RankingConfig}, stale_filter::StaleFilter, types::{ CapabilityTier as CrateTier, CombinedStatus, ExecutionMode as CrateExecMode, @@ -272,24 +273,31 @@ impl RetrievalHandler { .execute(&req.query, chain, &stop_conditions, mode, tier) .await; + // Enrich metadata with salience scores from Storage lookups + let enriched_results = enrich_with_salience(&self.storage, result.results); + // Apply staleness filter post-merge, pre-return let stale_filter = StaleFilter::new(self.staleness_config.clone()); let filtered_results = if self.staleness_config.enabled { // Look up embeddings for supersession detection (fail-open) let embeddings = self.vector_handler.as_ref().map(|vh| { let doc_ids: Vec = - result.results.iter().map(|r| r.doc_id.clone()).collect(); + enriched_results.iter().map(|r| r.doc_id.clone()).collect(); vh.get_embeddings_for_doc_ids(&doc_ids) }); - stale_filter.apply_with_supersession(result.results, embeddings.as_ref()) + stale_filter.apply_with_supersession(enriched_results, embeddings.as_ref()) } else { - result.results + enriched_results }; + // Apply combined ranking (salience + usage decay) after stale filter + let ranking_config = RankingConfig::default(); + let ranked_results = apply_combined_ranking(filtered_results, &ranking_config); + let total_time_ms = start.elapsed().as_millis() as u64; // Convert results to proto - let results: Vec = filtered_results + let results: Vec = ranked_results .iter() .take(limit) .map(|r| ProtoResult { @@ -607,6 +615,41 @@ impl LayerExecutor for SimpleLayerExecutor { } } +/// Enrich search results with salience and usage data from Storage lookups. +/// +/// For each result, looks up the TocNode or Grip by doc_id and injects +/// `salience_score`, `memory_kind`, and `access_count` into the metadata. +/// These fields are used by `apply_combined_ranking` downstream. +/// +/// Lookups that fail are silently ignored (fail-open). +fn enrich_with_salience(storage: &Storage, mut results: Vec) -> Vec { + for result in &mut results { + // Try to look up as TocNode first (most common), then as Grip + if let Ok(Some(node)) = storage.get_toc_node(&result.doc_id) { + result.metadata.insert( + "salience_score".to_string(), + node.salience_score.to_string(), + ); + result + .metadata + .insert("memory_kind".to_string(), node.memory_kind.to_string()); + result + .metadata + .insert("access_count".to_string(), node.access_count.to_string()); + } else if let Ok(Some(grip)) = storage.get_grip(&result.doc_id) { + result.metadata.insert( + "salience_score".to_string(), + grip.salience_score.to_string(), + ); + result + .metadata + .insert("memory_kind".to_string(), grip.memory_kind.to_string()); + // Grips don't have access_count — default to 0 + } + } + results +} + /// Build metadata map for SearchResult enrichment. /// /// Populates timestamp_ms, agent, and memory_kind fields so that diff --git a/crates/memory-storage/src/column_families.rs b/crates/memory-storage/src/column_families.rs index 63420e5..3e53152 100644 --- a/crates/memory-storage/src/column_families.rs +++ b/crates/memory-storage/src/column_families.rs @@ -41,6 +41,10 @@ pub const CF_TOPIC_RELS: &str = "topic_rels"; /// Per Phase 16 Plan 02: Track access patterns WITHOUT mutating immutable nodes. pub const CF_USAGE_COUNTERS: &str = "usage_counters"; +/// Column family for episodic memory records (Phase 43). +/// Stores complete task execution episodes with actions, outcomes, and lessons. +pub const CF_EPISODES: &str = "episodes"; + /// All column family names pub const ALL_CF_NAMES: &[&str] = &[ CF_EVENTS, @@ -53,6 +57,7 @@ pub const ALL_CF_NAMES: &[&str] = &[ CF_TOPIC_LINKS, CF_TOPIC_RELS, CF_USAGE_COUNTERS, + CF_EPISODES, ]; /// Create column family options for events (append-only, compressed) @@ -86,5 +91,6 @@ pub fn build_cf_descriptors() -> Vec { ColumnFamilyDescriptor::new(CF_TOPIC_LINKS, Options::default()), ColumnFamilyDescriptor::new(CF_TOPIC_RELS, Options::default()), ColumnFamilyDescriptor::new(CF_USAGE_COUNTERS, Options::default()), + ColumnFamilyDescriptor::new(CF_EPISODES, Options::default()), ] } diff --git a/crates/memory-storage/src/db.rs b/crates/memory-storage/src/db.rs index 3441191..4152b25 100644 --- a/crates/memory-storage/src/db.rs +++ b/crates/memory-storage/src/db.rs @@ -24,7 +24,7 @@ pub use memory_types::TocLevel; /// Main storage interface for agent-memory pub struct Storage { - db: DB, + pub(crate) db: DB, /// Outbox sequence counter for monotonic ordering outbox_sequence: AtomicU64, } diff --git a/crates/memory-storage/src/episodes.rs b/crates/memory-storage/src/episodes.rs new file mode 100644 index 0000000..4d953a0 --- /dev/null +++ b/crates/memory-storage/src/episodes.rs @@ -0,0 +1,247 @@ +//! Episode storage operations for episodic memory. +//! +//! Provides CRUD operations for episodes in the CF_EPISODES column family. +//! Episodes are stored as JSON-serialized values keyed by episode_id. + +use crate::column_families::CF_EPISODES; +use crate::error::StorageError; +use crate::Storage; +use memory_types::Episode; +use tracing::debug; + +impl Storage { + /// Store an episode in the episodes column family. + /// + /// The episode is serialized to JSON and stored with its episode_id as key. + pub fn store_episode(&self, episode: &Episode) -> Result<(), StorageError> { + let bytes = + serde_json::to_vec(episode).map_err(|e| StorageError::Serialization(e.to_string()))?; + + self.put(CF_EPISODES, episode.episode_id.as_bytes(), &bytes)?; + debug!(episode_id = %episode.episode_id, "Stored episode"); + Ok(()) + } + + /// Get an episode by its ID. + pub fn get_episode(&self, episode_id: &str) -> Result, StorageError> { + match self.get(CF_EPISODES, episode_id.as_bytes())? { + Some(bytes) => { + let episode: Episode = serde_json::from_slice(&bytes) + .map_err(|e| StorageError::Serialization(e.to_string()))?; + Ok(Some(episode)) + } + None => Ok(None), + } + } + + /// List episodes, newest first (by ULID lexicographic order, reversed). + /// + /// Returns up to `limit` episodes. Uses reverse iteration over the + /// CF_EPISODES column family, so ULID-keyed episodes come out newest first. + pub fn list_episodes(&self, limit: usize) -> Result, StorageError> { + let cf = self + .db + .cf_handle(CF_EPISODES) + .ok_or_else(|| StorageError::ColumnFamilyNotFound(CF_EPISODES.to_string()))?; + + let mut episodes = Vec::new(); + let iter = self.db.iterator_cf(&cf, rocksdb::IteratorMode::End); + + for item in iter.take(limit) { + let (_, value) = item?; + let episode: Episode = serde_json::from_slice(&value) + .map_err(|e| StorageError::Serialization(e.to_string()))?; + episodes.push(episode); + } + + Ok(episodes) + } + + /// Update an episode (overwrite by ID). + /// + /// This is equivalent to store_episode but semantically indicates an update. + pub fn update_episode(&self, episode: &Episode) -> Result<(), StorageError> { + self.store_episode(episode) + } + + /// Delete an episode by its ID. + pub fn delete_episode(&self, episode_id: &str) -> Result<(), StorageError> { + self.delete(CF_EPISODES, episode_id.as_bytes())?; + debug!(episode_id = %episode_id, "Deleted episode"); + Ok(()) + } +} + +#[cfg(test)] +mod tests { + use memory_types::{Action, ActionResult, Episode, EpisodeStatus}; + use tempfile::TempDir; + + use crate::Storage; + + fn create_test_storage() -> (Storage, TempDir) { + let temp_dir = TempDir::new().unwrap(); + let storage = Storage::open(temp_dir.path()).unwrap(); + (storage, temp_dir) + } + + #[test] + fn test_episode_store_and_get() { + let (storage, _tmp) = create_test_storage(); + + let episode = Episode::new( + ulid::Ulid::new().to_string(), + "Build auth system".to_string(), + ) + .with_plan(vec![ + "Design schema".to_string(), + "Implement JWT".to_string(), + ]) + .with_agent("claude"); + + storage.store_episode(&episode).unwrap(); + + let retrieved = storage.get_episode(&episode.episode_id).unwrap(); + assert!(retrieved.is_some()); + let retrieved = retrieved.unwrap(); + assert_eq!(retrieved.episode_id, episode.episode_id); + assert_eq!(retrieved.task, "Build auth system"); + assert_eq!(retrieved.plan.len(), 2); + assert_eq!(retrieved.agent, Some("claude".to_string())); + } + + #[test] + fn test_episode_get_not_found() { + let (storage, _tmp) = create_test_storage(); + + let result = storage.get_episode("nonexistent").unwrap(); + assert!(result.is_none()); + } + + #[test] + fn test_episode_update() { + let (storage, _tmp) = create_test_storage(); + + let mut episode = Episode::new( + ulid::Ulid::new().to_string(), + "Build auth system".to_string(), + ); + + storage.store_episode(&episode).unwrap(); + + // Update with action and completion + episode.add_action(Action { + action_type: "tool_call".to_string(), + input: "read auth.rs".to_string(), + result: ActionResult::Success("file contents".to_string()), + timestamp: chrono::Utc::now(), + }); + episode.complete(0.8, 0.65); + + storage.update_episode(&episode).unwrap(); + + let retrieved = storage.get_episode(&episode.episode_id).unwrap().unwrap(); + assert_eq!(retrieved.status, EpisodeStatus::Completed); + assert_eq!(retrieved.actions.len(), 1); + assert!(retrieved.outcome_score.is_some()); + assert!(retrieved.value_score.is_some()); + } + + #[test] + fn test_episode_delete() { + let (storage, _tmp) = create_test_storage(); + + let episode = Episode::new(ulid::Ulid::new().to_string(), "test task".to_string()); + + storage.store_episode(&episode).unwrap(); + assert!(storage.get_episode(&episode.episode_id).unwrap().is_some()); + + storage.delete_episode(&episode.episode_id).unwrap(); + assert!(storage.get_episode(&episode.episode_id).unwrap().is_none()); + } + + #[test] + fn test_episode_list_newest_first() { + let (storage, _tmp) = create_test_storage(); + + // Create episodes with sequential ULIDs (newer = lexicographically later) + let ids: Vec = (0..5) + .map(|_| { + let id = ulid::Ulid::new().to_string(); + std::thread::sleep(std::time::Duration::from_millis(2)); + id + }) + .collect(); + + for (i, id) in ids.iter().enumerate() { + let episode = Episode::new(id.clone(), format!("task {i}")); + storage.store_episode(&episode).unwrap(); + } + + let listed = storage.list_episodes(3).unwrap(); + assert_eq!(listed.len(), 3); + + // Should be newest first (reverse ULID order) + assert_eq!(listed[0].episode_id, ids[4]); + assert_eq!(listed[1].episode_id, ids[3]); + assert_eq!(listed[2].episode_id, ids[2]); + } + + #[test] + fn test_episode_list_empty() { + let (storage, _tmp) = create_test_storage(); + + let listed = storage.list_episodes(10).unwrap(); + assert!(listed.is_empty()); + } + + #[test] + fn test_episode_roundtrip_with_actions() { + let (storage, _tmp) = create_test_storage(); + + let mut episode = Episode::new(ulid::Ulid::new().to_string(), "Complex task".to_string()) + .with_agent("claude"); + + episode.add_action(Action { + action_type: "tool_call".to_string(), + input: "read file".to_string(), + result: ActionResult::Success("contents".to_string()), + timestamp: chrono::Utc::now(), + }); + episode.add_action(Action { + action_type: "api_call".to_string(), + input: "create resource".to_string(), + result: ActionResult::Failure("timeout".to_string()), + timestamp: chrono::Utc::now(), + }); + episode.add_action(Action { + action_type: "retry".to_string(), + input: "create resource".to_string(), + result: ActionResult::Pending, + timestamp: chrono::Utc::now(), + }); + + episode + .lessons_learned + .push("Always set timeouts".to_string()); + episode + .failure_modes + .push("API timeout under load".to_string()); + + storage.store_episode(&episode).unwrap(); + + let retrieved = storage.get_episode(&episode.episode_id).unwrap().unwrap(); + assert_eq!(retrieved.actions.len(), 3); + assert_eq!(retrieved.lessons_learned.len(), 1); + assert_eq!(retrieved.failure_modes.len(), 1); + assert_eq!( + retrieved.actions[0].result, + ActionResult::Success("contents".to_string()) + ); + assert_eq!( + retrieved.actions[1].result, + ActionResult::Failure("timeout".to_string()) + ); + assert_eq!(retrieved.actions[2].result, ActionResult::Pending); + } +} diff --git a/crates/memory-storage/src/lib.rs b/crates/memory-storage/src/lib.rs index bb2d61d..e5775bb 100644 --- a/crates/memory-storage/src/lib.rs +++ b/crates/memory-storage/src/lib.rs @@ -10,13 +10,14 @@ pub mod column_families; pub mod db; +pub mod episodes; pub mod error; pub mod keys; pub mod usage; pub use column_families::{ - CF_CHECKPOINTS, CF_EVENTS, CF_GRIPS, CF_OUTBOX, CF_TOC_LATEST, CF_TOC_NODES, CF_TOPICS, - CF_TOPIC_LINKS, CF_TOPIC_RELS, CF_USAGE_COUNTERS, + CF_CHECKPOINTS, CF_EPISODES, CF_EVENTS, CF_GRIPS, CF_OUTBOX, CF_TOC_LATEST, CF_TOC_NODES, + CF_TOPICS, CF_TOPIC_LINKS, CF_TOPIC_RELS, CF_USAGE_COUNTERS, }; pub use db::{Storage, StorageStats}; pub use error::StorageError; diff --git a/crates/memory-types/src/config.rs b/crates/memory-types/src/config.rs index 7266ef8..3da7249 100644 --- a/crates/memory-types/src/config.rs +++ b/crates/memory-types/src/config.rs @@ -224,6 +224,77 @@ impl Default for SummarizerSettings { } } +/// Configuration for episodic memory (Phase 43). +/// +/// Controls whether episodic memory is enabled and how episodes are +/// scored and retained. Disabled by default -- must be explicitly enabled. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct EpisodicConfig { + /// Whether episodic memory is enabled (default: false). + #[serde(default)] + pub enabled: bool, + + /// Minimum value score for an episode to be retained in long-term storage. + /// Episodes below this threshold may be pruned. + #[serde(default = "default_episodic_value_threshold")] + pub value_threshold: f32, + + /// Target midpoint for value scoring (default: 0.65). + /// Episodes with outcome scores near this value are considered most valuable. + #[serde(default = "default_episodic_midpoint_target")] + pub midpoint_target: f32, + + /// Maximum number of episodes to retain (default: 1000). + /// Oldest low-value episodes are pruned first when this limit is reached. + #[serde(default = "default_episodic_max_episodes")] + pub max_episodes: usize, +} + +fn default_episodic_value_threshold() -> f32 { + 0.18 +} + +fn default_episodic_midpoint_target() -> f32 { + 0.65 +} + +fn default_episodic_max_episodes() -> usize { + 1000 +} + +impl Default for EpisodicConfig { + fn default() -> Self { + Self { + enabled: false, + value_threshold: default_episodic_value_threshold(), + midpoint_target: default_episodic_midpoint_target(), + max_episodes: default_episodic_max_episodes(), + } + } +} + +impl EpisodicConfig { + /// Validate configuration values. + pub fn validate(&self) -> Result<(), String> { + if !(0.0..=1.0).contains(&self.value_threshold) { + return Err(format!( + "value_threshold must be 0.0-1.0, got {}", + self.value_threshold + )); + } + if !(0.0..=1.0).contains(&self.midpoint_target) { + return Err(format!( + "midpoint_target must be 0.0-1.0, got {}", + self.midpoint_target + )); + } + if self.max_episodes == 0 { + return Err("max_episodes must be > 0".to_string()); + } + Ok(()) + } +} + /// Multi-agent storage mode (STOR-06) #[derive(Debug, Clone, Serialize, Deserialize, Default, PartialEq, Eq)] #[serde(rename_all = "snake_case")] @@ -282,6 +353,161 @@ pub struct Settings { /// Staleness-based score decay configuration. #[serde(default)] pub staleness: StalenessConfig, + + /// Salience scoring configuration. + #[serde(default)] + pub salience: crate::SalienceConfig, + + /// Usage decay configuration. + #[serde(default)] + pub usage: crate::UsageConfig, + + /// Lifecycle automation configuration. + #[serde(default)] + pub lifecycle: LifecycleConfig, + + /// Episodic memory configuration (Phase 43). + #[serde(default)] + pub episodic: EpisodicConfig, +} + +/// Lifecycle automation configuration for index pruning and rebuilding. +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct LifecycleConfig { + /// Vector index lifecycle settings. + #[serde(default)] + pub vector: VectorLifecycleSettings, + + /// BM25 index lifecycle settings. + #[serde(default)] + pub bm25: Bm25LifecycleSettings, +} + +/// Vector index lifecycle settings. +/// +/// Maps to `[lifecycle.vector]` section in config.toml. +/// Enabled by default - vector indexes grow unbounded without pruning. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct VectorLifecycleSettings { + /// Enable automatic vector pruning (default: true). + #[serde(default = "default_vector_enabled")] + pub enabled: bool, + + /// Retention days for segment-level vectors (default: 30). + #[serde(default = "default_segment_retention")] + pub segment_retention_days: u32, + + /// Retention days for grip-level vectors (default: 30). + #[serde(default = "default_grip_retention")] + pub grip_retention_days: u32, + + /// Retention days for day-level vectors (default: 365). + #[serde(default = "default_day_retention")] + pub day_retention_days: u32, + + /// Retention days for week-level vectors (default: 1825 = 5 years). + #[serde(default = "default_week_retention")] + pub week_retention_days: u32, + + /// Cron schedule for prune job (default: "0 3 * * *" = daily 3 AM). + #[serde(default = "default_vector_prune_schedule")] + pub prune_schedule: String, +} + +fn default_vector_enabled() -> bool { + true +} + +fn default_segment_retention() -> u32 { + 30 +} +fn default_grip_retention() -> u32 { + 30 +} +fn default_day_retention() -> u32 { + 365 +} +fn default_week_retention() -> u32 { + 1825 +} + +fn default_vector_prune_schedule() -> String { + "0 3 * * *".to_string() +} + +impl Default for VectorLifecycleSettings { + fn default() -> Self { + Self { + enabled: default_vector_enabled(), + segment_retention_days: default_segment_retention(), + grip_retention_days: default_grip_retention(), + day_retention_days: default_day_retention(), + week_retention_days: default_week_retention(), + prune_schedule: default_vector_prune_schedule(), + } + } +} + +/// BM25 index lifecycle settings. +/// +/// Maps to `[lifecycle.bm25]` section in config.toml. +/// DISABLED by default per PRD "append-only, no eviction" philosophy. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Bm25LifecycleSettings { + /// Whether BM25 lifecycle is enabled (default: false, opt-in). + #[serde(default)] + pub enabled: bool, + + /// Minimum TOC level to keep after rollup rebuild (default: "day"). + /// Segments and grips below this level are excluded from rebuilt index. + #[serde(default = "default_min_level")] + pub min_level_after_rollup: String, + + /// Cron schedule for rebuild job (default: "0 4 * * 0" = weekly Sunday 4 AM). + #[serde(default = "default_bm25_rebuild_schedule")] + pub rebuild_schedule: String, + + /// Retention days for segment-level docs (default: 30). + #[serde(default = "default_segment_retention")] + pub segment_retention_days: u32, + + /// Retention days for grip-level docs (default: 30). + #[serde(default = "default_grip_retention")] + pub grip_retention_days: u32, + + /// Retention days for day-level docs (default: 180). + #[serde(default = "default_bm25_day_retention")] + pub day_retention_days: u32, + + /// Retention days for week-level docs (default: 1825 = 5 years). + #[serde(default = "default_week_retention")] + pub week_retention_days: u32, +} + +fn default_min_level() -> String { + "day".to_string() +} + +fn default_bm25_rebuild_schedule() -> String { + "0 4 * * 0".to_string() +} + +fn default_bm25_day_retention() -> u32 { + 180 +} + +impl Default for Bm25LifecycleSettings { + fn default() -> Self { + Self { + enabled: false, + min_level_after_rollup: default_min_level(), + rebuild_schedule: default_bm25_rebuild_schedule(), + segment_retention_days: default_segment_retention(), + grip_retention_days: default_grip_retention(), + day_retention_days: default_bm25_day_retention(), + week_retention_days: default_week_retention(), + } + } } fn default_db_path() -> String { @@ -334,6 +560,10 @@ impl Default for Settings { vector_index_path: default_vector_index_path(), dedup: DedupConfig::default(), staleness: StalenessConfig::default(), + salience: crate::SalienceConfig::default(), + usage: crate::UsageConfig::default(), + lifecycle: LifecycleConfig::default(), + episodic: EpisodicConfig::default(), } } } @@ -596,4 +826,109 @@ mod tests { let config2: DedupConfig = serde_json::from_str(json_minimal).unwrap(); assert_eq!(config2.buffer_capacity, 256); } + + #[test] + fn test_lifecycle_config_defaults() { + let config = LifecycleConfig::default(); + + // Vector: enabled by default + assert!(config.vector.enabled); + assert_eq!(config.vector.segment_retention_days, 30); + assert_eq!(config.vector.grip_retention_days, 30); + assert_eq!(config.vector.day_retention_days, 365); + assert_eq!(config.vector.week_retention_days, 1825); + assert_eq!(config.vector.prune_schedule, "0 3 * * *"); + + // BM25: disabled by default (opt-in) + assert!(!config.bm25.enabled); + assert_eq!(config.bm25.min_level_after_rollup, "day"); + assert_eq!(config.bm25.rebuild_schedule, "0 4 * * 0"); + assert_eq!(config.bm25.segment_retention_days, 30); + assert_eq!(config.bm25.grip_retention_days, 30); + assert_eq!(config.bm25.day_retention_days, 180); + assert_eq!(config.bm25.week_retention_days, 1825); + } + + #[test] + fn test_lifecycle_config_serialization() { + let config = LifecycleConfig::default(); + let json = serde_json::to_string(&config).unwrap(); + let decoded: LifecycleConfig = serde_json::from_str(&json).unwrap(); + assert!(decoded.vector.enabled); + assert!(!decoded.bm25.enabled); + assert_eq!(decoded.bm25.min_level_after_rollup, "day"); + assert_eq!(decoded.vector.prune_schedule, "0 3 * * *"); + } + + #[test] + fn test_settings_lifecycle_default() { + let settings = Settings::default(); + assert!(settings.lifecycle.vector.enabled); + assert!(!settings.lifecycle.bm25.enabled); + } + + #[test] + fn test_episodic_config_defaults() { + let config = EpisodicConfig::default(); + assert!(!config.enabled); + assert!((config.value_threshold - 0.18).abs() < f32::EPSILON); + assert!((config.midpoint_target - 0.65).abs() < f32::EPSILON); + assert_eq!(config.max_episodes, 1000); + } + + #[test] + fn test_episodic_config_validation_pass() { + let config = EpisodicConfig::default(); + assert!(config.validate().is_ok()); + } + + #[test] + fn test_episodic_config_validation_fail() { + let config = EpisodicConfig { + value_threshold: 1.5, + ..Default::default() + }; + assert!(config.validate().is_err()); + + let config = EpisodicConfig { + midpoint_target: -0.1, + ..Default::default() + }; + assert!(config.validate().is_err()); + + let config = EpisodicConfig { + max_episodes: 0, + ..Default::default() + }; + assert!(config.validate().is_err()); + } + + #[test] + fn test_episodic_config_serialization() { + let config = EpisodicConfig::default(); + let json = serde_json::to_string(&config).unwrap(); + let decoded: EpisodicConfig = serde_json::from_str(&json).unwrap(); + assert!(!decoded.enabled); + assert!((decoded.value_threshold - 0.18).abs() < f32::EPSILON); + assert!((decoded.midpoint_target - 0.65).abs() < f32::EPSILON); + assert_eq!(decoded.max_episodes, 1000); + } + + #[test] + fn test_episodic_config_backward_compat() { + // Deserialize with missing episodic section (pre-phase-43 config) + let json = r#"{}"#; + let config: EpisodicConfig = serde_json::from_str(json).unwrap(); + assert!(!config.enabled); + assert_eq!(config.max_episodes, 1000); + } + + #[test] + fn test_settings_episodic_default() { + let settings = Settings::default(); + assert!(!settings.episodic.enabled); + assert!((settings.episodic.value_threshold - 0.18).abs() < f32::EPSILON); + assert!((settings.episodic.midpoint_target - 0.65).abs() < f32::EPSILON); + assert_eq!(settings.episodic.max_episodes, 1000); + } } diff --git a/crates/memory-types/src/episode.rs b/crates/memory-types/src/episode.rs new file mode 100644 index 0000000..8f841f5 --- /dev/null +++ b/crates/memory-types/src/episode.rs @@ -0,0 +1,309 @@ +//! Episodic memory types for recording agent task episodes. +//! +//! Episodes capture complete task execution sequences including: +//! - The task goal and plan +//! - Individual actions taken and their results +//! - Outcome scoring and lessons learned +//! - Value scoring for retrieval prioritization + +use chrono::{DateTime, Utc}; +use serde::{Deserialize, Serialize}; + +/// Status of an episode's execution. +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] +#[serde(rename_all = "snake_case")] +pub enum EpisodeStatus { + /// Episode is currently being executed. + InProgress, + /// Episode completed successfully. + Completed, + /// Episode failed during execution. + Failed, +} + +/// Result of an individual action within an episode. +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] +#[serde(rename_all = "snake_case", tag = "status", content = "detail")] +pub enum ActionResult { + /// Action completed successfully with output. + Success(String), + /// Action failed with error description. + Failure(String), + /// Action is still pending completion. + Pending, +} + +/// A single action taken during an episode. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Action { + /// Type of action performed (e.g., "tool_call", "api_request", "file_edit"). + pub action_type: String, + + /// Input or parameters for the action. + pub input: String, + + /// Result of the action. + pub result: ActionResult, + + /// When the action was performed. + #[serde(with = "chrono::serde::ts_milliseconds")] + pub timestamp: DateTime, +} + +/// A complete episode recording a task execution sequence. +/// +/// Episodes are the core unit of episodic memory. They capture what the agent +/// did, whether it worked, and what was learned. Value scoring determines +/// retrieval priority -- episodes near the midpoint (neither trivial nor +/// catastrophic) are most valuable for future learning. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Episode { + /// Unique identifier (ULID string). + pub episode_id: String, + + /// The task or goal being executed. + pub task: String, + + /// Planned steps for the task. + #[serde(default)] + pub plan: Vec, + + /// Actions taken during execution. + #[serde(default)] + pub actions: Vec, + + /// Current status of the episode. + pub status: EpisodeStatus, + + /// Outcome score (0.0 = total failure, 1.0 = perfect success). + #[serde(default)] + pub outcome_score: Option, + + /// Lessons learned from the episode. + #[serde(default)] + pub lessons_learned: Vec, + + /// Failure modes encountered. + #[serde(default)] + pub failure_modes: Vec, + + /// Embedding vector for semantic search. + #[serde(default)] + pub embedding: Option>, + + /// Value score for retrieval prioritization. + /// Computed from outcome_score using midpoint-distance formula. + #[serde(default)] + pub value_score: Option, + + /// When the episode was created. + #[serde(with = "chrono::serde::ts_milliseconds")] + pub created_at: DateTime, + + /// When the episode was completed (if finished). + #[serde(default)] + pub completed_at: Option>, + + /// Agent that executed the episode. + #[serde(default)] + pub agent: Option, +} + +impl Episode { + /// Create a new in-progress episode. + pub fn new(episode_id: String, task: String) -> Self { + Self { + episode_id, + task, + plan: Vec::new(), + actions: Vec::new(), + status: EpisodeStatus::InProgress, + outcome_score: None, + lessons_learned: Vec::new(), + failure_modes: Vec::new(), + embedding: None, + value_score: None, + created_at: Utc::now(), + completed_at: None, + agent: None, + } + } + + /// Set the plan steps. + pub fn with_plan(mut self, plan: Vec) -> Self { + self.plan = plan; + self + } + + /// Set the agent identifier. + pub fn with_agent(mut self, agent: impl Into) -> Self { + self.agent = Some(agent.into()); + self + } + + /// Add an action to the episode. + pub fn add_action(&mut self, action: Action) { + self.actions.push(action); + } + + /// Calculate value score from an outcome score. + /// + /// Formula: `(1.0 - (outcome_score - midpoint).abs()).max(0.0)` + /// + /// Episodes near the midpoint are most valuable for learning: + /// - Trivial successes (score near 1.0) teach little + /// - Catastrophic failures (score near 0.0) may be outliers + /// - Moderate outcomes (near midpoint) are most informative + pub fn calculate_value_score(outcome_score: f32, midpoint: f32) -> f32 { + (1.0 - (outcome_score - midpoint).abs()).max(0.0) + } + + /// Complete the episode with an outcome score, computing the value score. + pub fn complete(&mut self, outcome_score: f32, midpoint: f32) { + self.status = EpisodeStatus::Completed; + self.outcome_score = Some(outcome_score); + self.value_score = Some(Self::calculate_value_score(outcome_score, midpoint)); + self.completed_at = Some(Utc::now()); + } + + /// Mark the episode as failed with an outcome score. + pub fn fail(&mut self, outcome_score: f32, midpoint: f32) { + self.status = EpisodeStatus::Failed; + self.outcome_score = Some(outcome_score); + self.value_score = Some(Self::calculate_value_score(outcome_score, midpoint)); + self.completed_at = Some(Utc::now()); + } + + /// Serialize episode to JSON bytes for storage. + pub fn to_bytes(&self) -> Result, serde_json::Error> { + serde_json::to_vec(self) + } + + /// Deserialize episode from JSON bytes. + pub fn from_bytes(bytes: &[u8]) -> Result { + serde_json::from_slice(bytes) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_episode_serialization_roundtrip() { + let mut episode = Episode::new("01TEST".to_string(), "Build auth system".to_string()) + .with_plan(vec![ + "Design schema".to_string(), + "Implement JWT".to_string(), + ]) + .with_agent("claude"); + + episode.add_action(Action { + action_type: "tool_call".to_string(), + input: "read auth.rs".to_string(), + result: ActionResult::Success("file contents".to_string()), + timestamp: Utc::now(), + }); + + let bytes = episode.to_bytes().unwrap(); + let decoded = Episode::from_bytes(&bytes).unwrap(); + + assert_eq!(decoded.episode_id, "01TEST"); + assert_eq!(decoded.task, "Build auth system"); + assert_eq!(decoded.plan.len(), 2); + assert_eq!(decoded.actions.len(), 1); + assert_eq!(decoded.status, EpisodeStatus::InProgress); + assert_eq!(decoded.agent, Some("claude".to_string())); + } + + #[test] + fn test_episode_backward_compat_no_optional_fields() { + let json = r#"{ + "episode_id": "01TEST", + "task": "test task", + "status": "in_progress", + "created_at": 1704067200000 + }"#; + + let episode: Episode = serde_json::from_str(json).unwrap(); + assert_eq!(episode.episode_id, "01TEST"); + assert!(episode.plan.is_empty()); + assert!(episode.actions.is_empty()); + assert!(episode.outcome_score.is_none()); + assert!(episode.agent.is_none()); + } + + #[test] + fn test_episode_complete() { + let mut episode = Episode::new("01TEST".to_string(), "task".to_string()); + episode.complete(0.65, 0.65); + + assert_eq!(episode.status, EpisodeStatus::Completed); + assert!(episode.completed_at.is_some()); + // At midpoint, value score should be 1.0 + assert!((episode.value_score.unwrap() - 1.0).abs() < f32::EPSILON); + } + + #[test] + fn test_episode_fail() { + let mut episode = Episode::new("01TEST".to_string(), "task".to_string()); + episode.fail(0.0, 0.65); + + assert_eq!(episode.status, EpisodeStatus::Failed); + assert!(episode.completed_at.is_some()); + // Far from midpoint, value score should be 1.0 - 0.65 = 0.35 + assert!((episode.value_score.unwrap() - 0.35).abs() < f32::EPSILON); + } + + #[test] + fn test_action_result_serialization() { + let success = ActionResult::Success("done".to_string()); + let failure = ActionResult::Failure("error".to_string()); + let pending = ActionResult::Pending; + + let s_json = serde_json::to_string(&success).unwrap(); + let f_json = serde_json::to_string(&failure).unwrap(); + let p_json = serde_json::to_string(&pending).unwrap(); + + let s_decoded: ActionResult = serde_json::from_str(&s_json).unwrap(); + let f_decoded: ActionResult = serde_json::from_str(&f_json).unwrap(); + let p_decoded: ActionResult = serde_json::from_str(&p_json).unwrap(); + + assert_eq!(s_decoded, ActionResult::Success("done".to_string())); + assert_eq!(f_decoded, ActionResult::Failure("error".to_string())); + assert_eq!(p_decoded, ActionResult::Pending); + } + + #[test] + fn test_calculate_value_score_at_midpoint() { + // At midpoint: distance = 0, value = 1.0 + let score = Episode::calculate_value_score(0.65, 0.65); + assert!((score - 1.0).abs() < f32::EPSILON); + } + + #[test] + fn test_calculate_value_score_perfect_success() { + // Perfect success far from midpoint + let score = Episode::calculate_value_score(1.0, 0.65); + assert!((score - 0.65).abs() < f32::EPSILON); + } + + #[test] + fn test_calculate_value_score_total_failure() { + // Total failure far from midpoint + let score = Episode::calculate_value_score(0.0, 0.65); + assert!((score - 0.35).abs() < f32::EPSILON); + } + + #[test] + fn test_calculate_value_score_clamps_to_zero() { + // Edge case: outcome very far from midpoint with high midpoint + // outcome=0.0, midpoint=0.0 => distance=0 => value=1.0 + let score = Episode::calculate_value_score(0.0, 0.0); + assert!((score - 1.0).abs() < f32::EPSILON); + + // outcome=2.0 (out of range), midpoint=0.5 => distance=1.5 => value=max(1.0-1.5, 0) = 0 + let score = Episode::calculate_value_score(2.0, 0.5); + assert!((score - 0.0).abs() < f32::EPSILON); + } +} diff --git a/crates/memory-types/src/lib.rs b/crates/memory-types/src/lib.rs index 53a83d1..b2251ea 100644 --- a/crates/memory-types/src/lib.rs +++ b/crates/memory-types/src/lib.rs @@ -10,16 +10,19 @@ //! - Settings: Configuration types //! - Salience: Memory importance scoring (Phase 16) //! - Usage: Access pattern tracking (Phase 16) +//! - Episodes: Episodic memory for task execution sequences (Phase 43) //! //! ## Usage //! //! ```rust //! use memory_types::{Event, EventRole, EventType, Segment, Settings}; //! use memory_types::{MemoryKind, SalienceScorer, UsageStats}; +//! use memory_types::{Episode, Action, ActionResult, EpisodeStatus}; //! ``` pub mod config; pub mod dedup; +pub mod episode; pub mod error; pub mod event; pub mod grip; @@ -31,9 +34,11 @@ pub mod usage; // Re-export main types at crate root pub use config::{ - DedupConfig, MultiAgentMode, NoveltyConfig, Settings, StalenessConfig, SummarizerSettings, + Bm25LifecycleSettings, DedupConfig, EpisodicConfig, LifecycleConfig, MultiAgentMode, + NoveltyConfig, Settings, StalenessConfig, SummarizerSettings, VectorLifecycleSettings, }; pub use dedup::{BufferEntry, InFlightBuffer}; +pub use episode::{Action, ActionResult, Episode, EpisodeStatus}; pub use error::MemoryError; pub use event::{Event, EventRole, EventType}; pub use grip::Grip; diff --git a/crates/memory-types/src/toc.rs b/crates/memory-types/src/toc.rs index fe51065..e0f0557 100644 --- a/crates/memory-types/src/toc.rs +++ b/crates/memory-types/src/toc.rs @@ -166,6 +166,17 @@ pub struct TocNode { /// Default: empty Vec for pre-phase-18 nodes. #[serde(default)] pub contributing_agents: Vec, + + // === Phase 40: Usage Tracking === + /// Number of times this node was accessed in retrieval. + /// Default: 0 for backward compatibility. + #[serde(default)] + pub access_count: u32, + + /// Last access timestamp in milliseconds. + /// Default: None for backward compatibility. + #[serde(default)] + pub last_accessed_ms: Option, } impl TocNode { @@ -194,6 +205,9 @@ impl TocNode { is_pinned: false, // Phase 18: Multi-agent tracking contributing_agents: Vec::new(), + // Phase 40: Usage tracking + access_count: 0, + last_accessed_ms: None, } } diff --git a/proto/memory.proto b/proto/memory.proto index 201ff3e..a830802 100644 --- a/proto/memory.proto +++ b/proto/memory.proto @@ -115,6 +115,20 @@ service MemoryService { // Get dedup gate status and metrics rpc GetDedupStatus(GetDedupStatusRequest) returns (GetDedupStatusResponse); + + // ===== Episodic Memory RPCs (Phase 44) ===== + + // Start a new episode for tracking a task execution + rpc StartEpisode(StartEpisodeRequest) returns (StartEpisodeResponse); + + // Record an action taken during an in-progress episode + rpc RecordAction(RecordActionRequest) returns (RecordActionResponse); + + // Complete an episode with outcome score and lessons + rpc CompleteEpisode(CompleteEpisodeRequest) returns (CompleteEpisodeResponse); + + // Find episodes similar to a query (brute-force cosine similarity) + rpc GetSimilarEpisodes(GetSimilarEpisodesRequest) returns (GetSimilarEpisodesResponse); } // Role of the message author @@ -258,6 +272,12 @@ message TocNode { MemoryKind memory_kind = 102; // Whether node is pinned (boosted importance) bool is_pinned = 103; + + // Phase 40: Usage tracking fields (field numbers > 200) + // Number of times this node was accessed in retrieval + uint32 access_count = 201; + // Last access timestamp (ms), 0 if never accessed + int64 last_accessed_ms = 202; } // A grip providing provenance for a bullet @@ -836,6 +856,16 @@ message GetRankingStatusResponse { bool bm25_lifecycle_enabled = 10; int64 bm25_last_prune_timestamp = 11; uint32 bm25_last_prune_count = 12; + + // Phase 42: Ranking metrics (field numbers > 200) + // Average salience score across recent TOC nodes + float avg_salience_score = 201; + // Count of nodes with salience > 0.5 + uint32 high_salience_count = 202; + // Sum of all access counts across TOC nodes + uint64 total_access_count = 203; + // Average usage decay penalty factor + float avg_usage_decay = 204; } // ===== Agent Retrieval Policy Messages (Phase 17) ===== @@ -1048,3 +1078,135 @@ message GetDedupStatusResponse { // Maximum buffer capacity uint32 buffer_capacity = 7; } + +// ===== Episodic Memory Messages (Phase 44) ===== + +// Status of an episode +enum EpisodeStatusProto { + EPISODE_STATUS_UNSPECIFIED = 0; + EPISODE_STATUS_IN_PROGRESS = 1; + EPISODE_STATUS_COMPLETED = 2; + EPISODE_STATUS_FAILED = 3; +} + +// Result status of an action within an episode +enum ActionResultStatus { + ACTION_RESULT_UNSPECIFIED = 0; + ACTION_RESULT_SUCCESS = 1; + ACTION_RESULT_FAILURE = 2; + ACTION_RESULT_PENDING = 3; +} + +// A single action taken during an episode +message EpisodeAction { + // Type of action performed (e.g., "tool_call", "api_request", "file_edit") + string action_type = 1; + // Input or parameters for the action + string input = 2; + // Result status + ActionResultStatus result_status = 3; + // Result detail (output text or error message) + string result_detail = 4; + // When the action was performed (ms since epoch) + int64 timestamp_ms = 5; +} + +// Request to start a new episode +message StartEpisodeRequest { + // The task or goal being executed + string task = 1; + // Planned steps for the task (optional) + repeated string plan = 2; + // Agent executing the episode (optional) + optional string agent = 3; +} + +// Response from starting an episode +message StartEpisodeResponse { + // Unique episode ID (ULID) + string episode_id = 1; + // Whether the episode was created + bool created = 2; +} + +// Request to record an action in an episode +message RecordActionRequest { + // Episode ID to add the action to + string episode_id = 1; + // The action to record + EpisodeAction action = 2; +} + +// Response from recording an action +message RecordActionResponse { + // Whether the action was recorded + bool recorded = 1; + // Total actions in the episode after recording + uint32 action_count = 2; +} + +// Request to complete an episode +message CompleteEpisodeRequest { + // Episode ID to complete + string episode_id = 1; + // Outcome score (0.0 = total failure, 1.0 = perfect success) + float outcome_score = 2; + // Whether the episode failed (true = failed, false = completed) + bool failed = 3; + // Lessons learned from the episode + repeated string lessons_learned = 4; + // Failure modes encountered + repeated string failure_modes = 5; +} + +// Response from completing an episode +message CompleteEpisodeResponse { + // Whether the episode was completed + bool completed = 1; + // Computed value score for retrieval prioritization + float value_score = 2; + // Number of episodes pruned due to max_episodes limit + uint32 episodes_pruned = 3; +} + +// Request to find similar episodes +message GetSimilarEpisodesRequest { + // Query text to find similar episodes + string query = 1; + // Maximum results to return (default: 5) + uint32 top_k = 2; + // Minimum similarity score 0.0-1.0 (default: 0.0) + float min_score = 3; +} + +// Summary of an episode for search results +message EpisodeSummary { + // Episode ID + string episode_id = 1; + // The task or goal + string task = 2; + // Episode status + EpisodeStatusProto status = 3; + // Outcome score (if completed) + float outcome_score = 4; + // Value score for prioritization + float value_score = 5; + // Similarity score to the query + float similarity_score = 6; + // Lessons learned + repeated string lessons_learned = 7; + // Failure modes + repeated string failure_modes = 8; + // Number of actions taken + uint32 action_count = 9; + // When the episode was created (ms since epoch) + int64 created_at_ms = 10; + // Agent that executed the episode + optional string agent = 11; +} + +// Response with similar episodes +message GetSimilarEpisodesResponse { + // Similar episodes ranked by similarity + repeated EpisodeSummary episodes = 1; +}