feat: distributed hive mind with DHT sharding#2876
Conversation
…Kuzu Replace InMemoryHiveGraph with DistributedHiveGraph for 100+ agent deployments. Facts distributed via consistent hash ring instead of duplicated everywhere. Queries fan out to K relevant shard owners instead of all N agents. Key changes: - dht.py: HashRing (consistent hashing), ShardStore (per-agent storage), DHTRouter - bloom.py: BloomFilter for compact shard content summaries in gossip - distributed_hive_graph.py: HiveGraph protocol implementation using DHT - cognitive_adapter.py: Patch Kuzu buffer_pool_size to 256MB (was 80% of RAM) - constants.py: KUZU_BUFFER_POOL_SIZE, KUZU_MAX_DB_SIZE, DHT constants Results: - 100 agents created in 12.3s using 4.8GB RSS (was: OOM crash at 8TB mmap) - O(F/N) memory per agent instead of O(F) centralized - O(K) query fan-out instead of O(N) scan-all-agents - Bloom filter gossip with O(log N) convergence - 26/26 tests pass in 3.4s Fixes #2871 (Kuzu mmap OOM with 100 concurrent DBs) Related: #2866 (5000-turn eval spec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Repo Guardian - Passed ✅All 8 files changed in this PR are legitimate, durable additions to the codebase:
No ephemeral content, temporary scripts, or point-in-time documents detected.
|
Triage Report - DEFER (Low Priority)Risk Level: LOW AnalysisChanges: +1,522/-3 across 8 files AssessmentExperimental distributed hive mind with DHT sharding. Self-contained addition, not on critical path. Next Steps
Recommendation: DEFER - merge after resolving high-priority quality audit PRs. Note: Interesting feature but not blocking any other work. Safe to defer.
|
Covers DHT sharding, query routing, gossip protocol, federation, performance comparison, eval results, and known issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Implements a high-level Memory facade that abstracts backend selection, distributed topology, and config resolution behind a minimal two-method API. - memory/config.py: MemoryConfig dataclass with from_env(), from_file(), resolve() class methods. Resolution order: explicit kwargs > env vars > YAML file > built-in defaults. All AMPLIHACK_MEMORY_* env vars handled. - memory/facade.py: Memory class with remember(), recall(), close(), stats(), run_gossip(). Supports backend=cognitive/hierarchical/simple and topology=single/distributed. Distributed topology auto-creates or joins a DistributedHiveGraph and auto-promotes facts via CognitiveAdapter. - memory/__init__.py: exports Memory and MemoryConfig - tests/test_memory_facade.py: 48 tests covering defaults, remember/recall, env var config, YAML file config, priority order, distributed topology, shared hive, close(), stats() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive investigation and design document covering: - Full call graph from GoalSeekingAgent down to memory operations - Evidence that LearningAgent bypasses AgenticLoop (self.loop never called) - Corrected OODA loop with Memory.remember()/recall() at every phase - Unification design merging LearningAgent and GoalSeekingAgent - Eval compatibility analysis (zero harness changes needed) - Ordered 6-phase implementation plan with risk assessments - Three Mermaid diagrams: current call graph, proposed OODA loop, unification architecture Investigation only — no code changes to agent files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workstream 1 — semantic routing in dht.py: - ShardStore: add _summary_embedding (numpy running average), _embedding_count, _embedding_generator; set_embedding_generator() method; store() computes running-average embedding on each fact stored when generator is available - DHTRouter.set_embedding_generator(): propagates to all existing shards - DHTRouter.add_agent(): sets embedding generator on new shards - DHTRouter.store_fact(): ensures embedding_generator propagated to shard - DHTRouter._select_query_targets(): semantic routing via cosine similarity when embeddings exist; falls back to keyword routing otherwise Workstream 2 — Memory facade wired into OODA loop: - AgenticLoop.__init__: accepts optional memory (Memory facade instance) - AgenticLoop.observe(): OBSERVE phase — remember() + recall() via Memory facade - AgenticLoop.orient(): ORIENT phase — recall domain knowledge, build world model - AgenticLoop.perceive(): internally calls observe()+orient(); falls back to memory_retriever keyword search when no Memory facade configured - AgenticLoop.learn(): uses memory.remember(outcome_summary) when facade set; falls back to memory_retriever.store_fact() otherwise - LearningAgent.learn_from_content(): calls self.loop.observe() before fact extraction (OBSERVE) and self.loop.learn() after (LEARN) - LearningAgent.answer_question(): structured around OODA loop via comments; OBSERVE at entry, existing retrieval IS the ORIENT phase, DECIDE is synthesis, ACT records Q&A pair; public signatures unchanged All 74 tests pass (test_distributed_hive + test_memory_facade). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers OODA loop, cognitive memory model (6 types), DHT distributed topology, semantic routing, Memory facade, eval harness, and file map. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…buted backends Implements a pluggable graph persistence layer that abstracts CognitiveMemory from its storage backend. - graph_store.py: @runtime_checkable Protocol with 12 methods and 6 cognitive memory schema constants (SEMANTIC, EPISODIC, PROCEDURAL, WORKING, STRATEGIC, SOCIAL) - memory_store.py: InMemoryGraphStore — dict-based, thread-safe, keyword search - kuzu_store.py: KuzuGraphStore — wraps kuzu.Database with Cypher CREATE/MATCH queries - distributed_store.py: DistributedGraphStore — DHT ring sharding via HashRing, replication factor, semantic routing, and bloom-filter gossip - memory/__init__.py: exports all four classes - facade.py: Memory.graph_store property; constructs correct backend by topology+backend - tests/test_graph_store.py: 19 tests (8 parameterized × 2 backends + 3 distributed) All 19 tests pass: uv run pytest tests/test_graph_store.py -v Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add shard_backend field to MemoryConfig with AMPLIHACK_MEMORY_SHARD_BACKEND env var - DistributedGraphStore accepts shard_backend, storage_path, kuzu_buffer_pool_mb params - add_agent() creates KuzuGraphStore or InMemoryGraphStore based on shard_backend; shard_factory takes precedence when provided - facade.py passes shard_backend and storage_path from MemoryConfig to DistributedGraphStore - docs: add shard_backend config example and kuzu vs memory guidance - tests: add test_distributed_with_kuzu_shards verifying persistence across store reopen Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- InMemoryGraphStore: add get_all_node_ids, export_nodes, export_edges, import_nodes, import_edges for shard exchange - KuzuGraphStore: same 5 methods using Cypher queries; fix direction='in' edge query to return canonical from_id/to_id - GraphStore Protocol: declare all 5 new methods - DistributedGraphStore: rewrite run_gossip_round() to exchange full node data via bloom filter gossip; add rebuild_shard() to pull peer data via DHT ring; update add_agent() to call rebuild_shard() when peers have data - Tests: add test_export_import_nodes, test_export_import_edges, test_gossip_full_nodes, test_gossip_edges, test_rebuild_on_join (all pass) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FIX 1: export_edges() filters structural keys correctly from properties - FIX 2: retract_fact() returns bool; ShardStore.search() skips retracted facts - FIX 3: _node_content_keys map stored at create_node time; rebuild_shard uses correct routing key - FIX 4: _validate_identifier() guards all f-string interpolations in kuzu_store.py - FIX 5: Silent except:pass replaced with ImportError + Exception + logging in dht.py/distributed_store.py - FIX 6: get_summary_embedding() method added to ShardStore and _AgentShard with lock; call sites updated - FIX 8: route_query() returns list[str] agent_id strings instead of HiveAgent objects - FIX 9: escalate_fact() and broadcast_fact() added to DistributedHiveGraph - FIX 10: _query_targets returns all_ids[:_query_fanout] instead of *3 over-fetch - FIX 11: int() parsing of env vars in config.py wrapped in try/except ValueError with logging - FIX 12: Dead code (col_names/param_refs/overwritten query) removed from kuzu_store.py - FIX 13: export_edges returns 6-tuples (rel_type, from_table, from_id, to_table, to_id, props); import_edges accepts them - Updated test_graph_store.py assertions to match new 6-tuple edge format All 103 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…replication - NetworkGraphStore wraps a local GraphStore and replicates create_node/create_edge over a network transport (local/redis/azure_service_bus) using existing event_bus.py - Background thread processes incoming events: applies remote writes and responds to distributed search queries - search_nodes publishes SEARCH_QUERY, collects remote responses within timeout, and returns merged/deduplicated results - AMPLIHACK_MEMORY_TRANSPORT and AMPLIHACK_MEMORY_CONNECTION_STRING env vars added to MemoryConfig and Memory facade; non-local transport auto-wraps store with NetworkGraphStore - 20 unit tests all passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/amplihack/cli/hive.py: argparse-based CLI with create, add-agent, start, status, stop commands - create: scaffolds ~/.amplihack/hives/NAME/config.yaml with N agents - add-agent: appends agent entry with name, prompt, optional kuzu_db path - start --target local: launches agents as subprocesses with correct env vars; --target azure delegates to deploy/azure_hive/deploy.sh - status: shows agent PID status table with running/stopped states - stop: sends SIGTERM to all running agent processes - Hive config YAML matches spec (name, transport, connection_string, agents list) - Registered amplihack-hive = amplihack.cli.hive:main in pyproject.toml - 21 unit tests all passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy/azure_hive/ contains: - Dockerfile: python:3.11-slim base, installs amplihack + kuzu + sentence-transformers, non-root user (amplihack-agent), entrypoint=agent_entrypoint.py - deploy.sh: az CLI script to provision Service Bus namespace+topic+subscriptions, ACR, Azure File Share, and deploy N Container Apps (5 agents per app via Bicep) Supports --build-only, --infra-only, --cleanup, --status modes - main.bicep: defines Container Apps Environment, Service Bus, File Share, Container Registry, and N Container App resources with per-agent env vars - agent_entrypoint.py: reads AMPLIHACK_AGENT_NAME, AMPLIHACK_AGENT_PROMPT, AMPLIHACK_MEMORY_CONNECTION_STRING; creates Memory with NetworkGraphStore; runs OODA loop with graceful shutdown - 27 unit tests all passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d with deployment instructions - agent_memory_architecture.md: add NetworkGraphStore section covering architecture, configuration, environment variables, and integration with Memory facade - distributed_hive_mind.md: add comprehensive deployment guide covering local subprocess deployment, Azure Service Bus transport, and Azure Container Apps deployment with deploy.sh / main.bicep; includes troubleshooting section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove hard docker requirement and add conditional: use local docker if available, fall back to az acr build for environments without Docker daemon. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers goal-seeking agents, cognitive memory model, GraphStore protocol, DHT architecture, eval results (94.1% single vs 45.8% federated), Azure deployment, and next steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
COPY path must be relative to REPO_ROOT when using ACR remote build with repo root as the build context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bicep does not support ceil() or float() functions. Use the equivalent integer arithmetic formula (a + b - 1) / b for ceiling division. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Azure policy 'Storage account public access should be disallowed' requires allowBlobPublicAccess: false on all storage accounts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this, Container Apps may deploy before the ManagedEnvironment storage mount is registered, causing ManagedEnvironmentStorageNotFound. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rce install - Copy bloom.py from hive_mind into memory package so it's available standalone - Extract HashRing + _hash_key from dht.py into memory/hash_ring.py - Update distributed_store.py to import from amplihack.memory.bloom and amplihack.memory.hash_ring - Update Dockerfile to COPY full source tree and pip install from source so agents/ is included Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures the db_path directory exists before kuzu.Database() is called, preventing FileNotFoundError when the path doesn't exist on first run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re init Kuzu creates its own database directory; creating db_path itself causes "Database path cannot be a directory" error. Create parent only so Kuzu can initialize the database directory at the expected path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hStore init Kuzu creates its own database directory; it fails with "path cannot be a directory" if given an empty dir. Remove empty stale dir if present, then create parent so Kuzu can initialize cleanly on first run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o 100 agents - agent_entrypoint.py: default AMPLIHACK_MEMORY_BACKEND=simple, graceful azure-servicebus import fallback, startup log with name/topology/transport/backend, combined 'memory initialized and entering OODA loop' log, event polling in OODA tick - deploy.sh: pass memoryBackend=simple to Bicep template - main.bicep: add memoryBackend param (default: simple), inject AMPLIHACK_MEMORY_BACKEND env var into all agent containers (Service Bus connection string and ANTHROPIC_API_KEY already injected as secrets); supports HIVE_AGENT_COUNT up to 100 via 20 apps x 5 agents Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re Service Bus When topology=distributed and transport=azure_service_bus, the Memory facade now creates an InMemoryGraphStore wrapped in NetworkGraphStore instead of the DHT-based DistributedGraphStore. The NetworkGraphStore publishes all writes to the Service Bus topic and a background thread applies remote writes to the local store, enabling cross-container memory sharing across all 20 Container Apps. Also skips legacy DistributedHiveGraph construction when a non-local transport is configured, preventing import failures in the container environment. agent_entrypoint.py updated to pass topology="distributed" and backend="simple" so agents report the correct topology in stats logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test eval results - distributed_hive_mind.md: add actual Azure resource names (ACR hivacrhivemind.azurecr.io, SB hive-sb-dj2qo2w7vu5zi/hive-graph/100 subs, Storage hivesadj2qo2w7vu5zi), fix region eastus→westus2, add 20 Container Apps amplihive-app-0..19 / 100 agents table, note Azure Files not used for Kuzu (POSIX lock limitation), update eval results table with federated smoke 65.7% and federated full 45.8% rows, add gossip full-graph-nodes note and shard rebuild on join detail - agent_memory_architecture.md: add GraphStore 4-backend table, update gossip section with full graph node exchange and shard rebuild, update eval harness section with all three result rows and Azure deployment context - memory_ooda_integration.md: update date to 2026-03-06, add transport topology appendix covering local/redis/azure_service_bus with actual Azure resource names and Kuzu limitation - hive_mind_presentation.md: update eval results table labels (federated-10 vs federated-100), add routing precision degradation insight, add quality metrics (103 tests, 13 audit findings), update Azure deployment table with real resource names, add gossip full-graph-nodes detail Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TASK 1 — New tutorial: - Add docs/tutorial_prompt_to_distributed_hive.md: 10-step guide from single prompt to 100 distributed agents in Azure. Covers Memory facade zero-config, remember()/recall(), DistributedHiveGraph shared_hive, memory.yaml/env-var config, amplihack-hive create, HiveController manifest, deploy.sh, az containerapp monitoring, and /learn//query HTTP verification. References real Azure resources: hive-mind-rg, hivacrhivemind, hive-sb-dj2qo2w7vu5zi. TASK 2 — Stale doc refresh: - docs/hive_mind/GETTING_STARTED.md: fix hive-mind-eval-rg → hive-mind-rg, add hivacrhivemind ACR and hive-sb-dj2qo2w7vu5zi Service Bus names, update Difference table to mention DistributedHiveGraph + NetworkGraphStore - docs/hive_mind/TUTORIAL.md: same Azure resource and env-var fixes - docs/hive_mind_presentation.md: update stale baseline from 94.1% → 90.47% (dataset 5000t-seed42-v1.0, median-of-3), fix gap 28.4pp → 24.77pp, add category breakdown table, update speaker notes throughout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
InMemoryHiveGraphusing DHT-based fact sharding. Each agent holds only its shard of the fact space (~F/N facts per agent instead of F everywhere).CognitiveMemoryto use 256MB buffer pool per agent instead of Kuzu's default 80% of system RAM + 8TB mmap.continuous_eval.pyupdated to useDistributedHiveGraphfor federated conditions with 100+ agents.Before (100 agents)
After (100 agents)
Architecture
Test plan
Fixes #2871
Related: #2866
🤖 Generated with Claude Code