Skip to content

feat: distributed hive mind with DHT sharding#2876

Open
rysweet wants to merge 30 commits intomainfrom
feat/distributed-hive-mind
Open

feat: distributed hive mind with DHT sharding#2876
rysweet wants to merge 30 commits intomainfrom
feat/distributed-hive-mind

Conversation

@rysweet
Copy link
Owner

@rysweet rysweet commented Mar 4, 2026

Summary

  • DistributedHiveGraph: Drop-in replacement for InMemoryHiveGraph using DHT-based fact sharding. Each agent holds only its shard of the fact space (~F/N facts per agent instead of F everywhere).
  • HashRing + ShardStore: Consistent hashing distributes facts across agents with configurable replication factor (default R=3).
  • BloomFilter gossip: Agents exchange compact bloom filter summaries and pull missing facts. O(log N) convergence.
  • Kuzu buffer pool fix: Patches CognitiveMemory to use 256MB buffer pool per agent instead of Kuzu's default 80% of system RAM + 8TB mmap.
  • Eval integration: continuous_eval.py updated to use DistributedHiveGraph for federated conditions with 100+ agents.

Before (100 agents)

RuntimeError: Buffer manager exception: Mmap for size 8796093022208 failed.

After (100 agents)

Created 100 agents in 12.3s
RSS: 4863 MB

Architecture

    ┌──────────────────────────────────────────┐
    │       Consistent Hash Ring (DHT)          │
    │  Facts hashed → stored on shard owner(s)  │
    └──┬──────────┬──────────┬─────────┬───────┘
       │          │          │         │
    Agent 0    Agent 1    Agent 2    Agent N
    (shard)    (shard)    (shard)    (shard)
    
    Query:  DHT lookup → fan-out to K agents → merge results
    Gossip: bloom filter exchange → pull missing facts
Metric InMemoryHiveGraph DistributedHiveGraph
Memory per agent O(F) all facts O(F/N) shard only
Query fan-out O(N) all agents O(K) relevant agents
100-agent creation OOM crash 12.3s, 4.8GB RSS
Gossip convergence N/A O(log N) rounds

Test plan

  • 26 unit + integration tests (3.4s, all passing)
  • 100-agent creation smoke test (no OOM)
  • Diverse fact distribution verified (47/100 agents with facts at R=3)
  • Query routing finds correct facts across all 20 test categories
  • Federated topology (root + groups) with fact escalation
  • Full 5000-turn eval with 100 agents (blocked on long-running learning phase)

Fixes #2871
Related: #2866

🤖 Generated with Claude Code

Ubuntu and others added 2 commits March 4, 2026 07:02
…Kuzu

Replace InMemoryHiveGraph with DistributedHiveGraph for 100+ agent deployments.
Facts distributed via consistent hash ring instead of duplicated everywhere.
Queries fan out to K relevant shard owners instead of all N agents.

Key changes:
- dht.py: HashRing (consistent hashing), ShardStore (per-agent storage), DHTRouter
- bloom.py: BloomFilter for compact shard content summaries in gossip
- distributed_hive_graph.py: HiveGraph protocol implementation using DHT
- cognitive_adapter.py: Patch Kuzu buffer_pool_size to 256MB (was 80% of RAM)
- constants.py: KUZU_BUFFER_POOL_SIZE, KUZU_MAX_DB_SIZE, DHT constants

Results:
- 100 agents created in 12.3s using 4.8GB RSS (was: OOM crash at 8TB mmap)
- O(F/N) memory per agent instead of O(F) centralized
- O(K) query fan-out instead of O(N) scan-all-agents
- Bloom filter gossip with O(log N) convergence
- 26/26 tests pass in 3.4s

Fixes #2871 (Kuzu mmap OOM with 100 concurrent DBs)
Related: #2866 (5000-turn eval spec)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🤖 Auto-fixed version bump

The version in pyproject.toml has been automatically bumped to the next patch version.

If you need a minor or major version bump instead, please update pyproject.toml manually and push the change.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Repo Guardian - Passed ✅

All 8 files changed in this PR are legitimate, durable additions to the codebase:

  • Implementation files: 7 production code files implementing distributed hive mind architecture with DHT-based fact sharding
  • Test coverage: 1 comprehensive test suite with 26 unit + integration tests

No ephemeral content, temporary scripts, or point-in-time documents detected.

AI generated by Repo Guardian

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Triage Report - DEFER (Low Priority)

Risk Level: LOW
Priority: LOW
Status: Deferred

Analysis

Changes: +1,522/-3 across 8 files
Type: New experimental feature
Age: 30 hours

Assessment

Experimental distributed hive mind with DHT sharding. Self-contained addition, not on critical path.

Next Steps

  1. Wait for CI completion
  2. Merge after higher priority PRs (fix: remove CLAUDECODE env var detection, centralize stripping #2883, refactor: extract CompactionContext/ValidationResult to compaction_context.py (issue #2845) #2867, refactor: split stop.py 766 LOC into 3 modules, fix ImportError/except/counter bugs (#2845) #2870, refactor: split cli.py into focused modules (#2845) #2877, fix: make .claude/ hooks canonical, replace amplifier-bundle/ copy with symlink #2881)
  3. Low urgency - experimental feature

Recommendation: DEFER - merge after resolving high-priority quality audit PRs.

Note: Interesting feature but not blocking any other work. Safe to defer.

AI generated by PR Triage Agent

Ubuntu and others added 2 commits March 5, 2026 20:56
Covers DHT sharding, query routing, gossip protocol, federation,
performance comparison, eval results, and known issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

🤖 Auto-fixed version bump

The version in pyproject.toml has been automatically bumped to the next patch version.

If you need a minor or major version bump instead, please update pyproject.toml manually and push the change.

Ubuntu and others added 18 commits March 5, 2026 23:10
Implements a high-level Memory facade that abstracts backend selection,
distributed topology, and config resolution behind a minimal two-method API.

- memory/config.py: MemoryConfig dataclass with from_env(), from_file(),
  resolve() class methods. Resolution order: explicit kwargs > env vars >
  YAML file > built-in defaults. All AMPLIHACK_MEMORY_* env vars handled.
- memory/facade.py: Memory class with remember(), recall(), close(), stats(),
  run_gossip(). Supports backend=cognitive/hierarchical/simple and
  topology=single/distributed. Distributed topology auto-creates or joins
  a DistributedHiveGraph and auto-promotes facts via CognitiveAdapter.
- memory/__init__.py: exports Memory and MemoryConfig
- tests/test_memory_facade.py: 48 tests covering defaults, remember/recall,
  env var config, YAML file config, priority order, distributed topology,
  shared hive, close(), stats()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comprehensive investigation and design document covering:
- Full call graph from GoalSeekingAgent down to memory operations
- Evidence that LearningAgent bypasses AgenticLoop (self.loop never called)
- Corrected OODA loop with Memory.remember()/recall() at every phase
- Unification design merging LearningAgent and GoalSeekingAgent
- Eval compatibility analysis (zero harness changes needed)
- Ordered 6-phase implementation plan with risk assessments
- Three Mermaid diagrams: current call graph, proposed OODA loop, unification architecture

Investigation only — no code changes to agent files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workstream 1 — semantic routing in dht.py:
- ShardStore: add _summary_embedding (numpy running average), _embedding_count,
  _embedding_generator; set_embedding_generator() method; store() computes
  running-average embedding on each fact stored when generator is available
- DHTRouter.set_embedding_generator(): propagates to all existing shards
- DHTRouter.add_agent(): sets embedding generator on new shards
- DHTRouter.store_fact(): ensures embedding_generator propagated to shard
- DHTRouter._select_query_targets(): semantic routing via cosine similarity
  when embeddings exist; falls back to keyword routing otherwise

Workstream 2 — Memory facade wired into OODA loop:
- AgenticLoop.__init__: accepts optional memory (Memory facade instance)
- AgenticLoop.observe(): OBSERVE phase — remember() + recall() via Memory facade
- AgenticLoop.orient(): ORIENT phase — recall domain knowledge, build world model
- AgenticLoop.perceive(): internally calls observe()+orient(); falls back to
  memory_retriever keyword search when no Memory facade configured
- AgenticLoop.learn(): uses memory.remember(outcome_summary) when facade set;
  falls back to memory_retriever.store_fact() otherwise
- LearningAgent.learn_from_content(): calls self.loop.observe() before fact
  extraction (OBSERVE) and self.loop.learn() after (LEARN)
- LearningAgent.answer_question(): structured around OODA loop via comments;
  OBSERVE at entry, existing retrieval IS the ORIENT phase, DECIDE is synthesis,
  ACT records Q&A pair; public signatures unchanged

All 74 tests pass (test_distributed_hive + test_memory_facade).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers OODA loop, cognitive memory model (6 types), DHT distributed
topology, semantic routing, Memory facade, eval harness, and file map.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…buted backends

Implements a pluggable graph persistence layer that abstracts CognitiveMemory
from its storage backend.

- graph_store.py: @runtime_checkable Protocol with 12 methods and 6 cognitive
  memory schema constants (SEMANTIC, EPISODIC, PROCEDURAL, WORKING, STRATEGIC, SOCIAL)
- memory_store.py: InMemoryGraphStore — dict-based, thread-safe, keyword search
- kuzu_store.py: KuzuGraphStore — wraps kuzu.Database with Cypher CREATE/MATCH queries
- distributed_store.py: DistributedGraphStore — DHT ring sharding via HashRing,
  replication factor, semantic routing, and bloom-filter gossip
- memory/__init__.py: exports all four classes
- facade.py: Memory.graph_store property; constructs correct backend by topology+backend
- tests/test_graph_store.py: 19 tests (8 parameterized × 2 backends + 3 distributed)

All 19 tests pass: uv run pytest tests/test_graph_store.py -v

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add shard_backend field to MemoryConfig with AMPLIHACK_MEMORY_SHARD_BACKEND env var
- DistributedGraphStore accepts shard_backend, storage_path, kuzu_buffer_pool_mb params
- add_agent() creates KuzuGraphStore or InMemoryGraphStore based on shard_backend;
  shard_factory takes precedence when provided
- facade.py passes shard_backend and storage_path from MemoryConfig to DistributedGraphStore
- docs: add shard_backend config example and kuzu vs memory guidance
- tests: add test_distributed_with_kuzu_shards verifying persistence across store reopen

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- InMemoryGraphStore: add get_all_node_ids, export_nodes, export_edges,
  import_nodes, import_edges for shard exchange
- KuzuGraphStore: same 5 methods using Cypher queries; fix direction='in'
  edge query to return canonical from_id/to_id
- GraphStore Protocol: declare all 5 new methods
- DistributedGraphStore: rewrite run_gossip_round() to exchange full node
  data via bloom filter gossip; add rebuild_shard() to pull peer data via
  DHT ring; update add_agent() to call rebuild_shard() when peers have data
- Tests: add test_export_import_nodes, test_export_import_edges,
  test_gossip_full_nodes, test_gossip_edges, test_rebuild_on_join (all pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FIX 1: export_edges() filters structural keys correctly from properties
- FIX 2: retract_fact() returns bool; ShardStore.search() skips retracted facts
- FIX 3: _node_content_keys map stored at create_node time; rebuild_shard uses correct routing key
- FIX 4: _validate_identifier() guards all f-string interpolations in kuzu_store.py
- FIX 5: Silent except:pass replaced with ImportError + Exception + logging in dht.py/distributed_store.py
- FIX 6: get_summary_embedding() method added to ShardStore and _AgentShard with lock; call sites updated
- FIX 8: route_query() returns list[str] agent_id strings instead of HiveAgent objects
- FIX 9: escalate_fact() and broadcast_fact() added to DistributedHiveGraph
- FIX 10: _query_targets returns all_ids[:_query_fanout] instead of *3 over-fetch
- FIX 11: int() parsing of env vars in config.py wrapped in try/except ValueError with logging
- FIX 12: Dead code (col_names/param_refs/overwritten query) removed from kuzu_store.py
- FIX 13: export_edges returns 6-tuples (rel_type, from_table, from_id, to_table, to_id, props); import_edges accepts them
- Updated test_graph_store.py assertions to match new 6-tuple edge format

All 103 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…replication

- NetworkGraphStore wraps a local GraphStore and replicates create_node/create_edge
  over a network transport (local/redis/azure_service_bus) using existing event_bus.py
- Background thread processes incoming events: applies remote writes and responds to
  distributed search queries
- search_nodes publishes SEARCH_QUERY, collects remote responses within timeout,
  and returns merged/deduplicated results
- AMPLIHACK_MEMORY_TRANSPORT and AMPLIHACK_MEMORY_CONNECTION_STRING env vars added to
  MemoryConfig and Memory facade; non-local transport auto-wraps store with NetworkGraphStore
- 20 unit tests all passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/amplihack/cli/hive.py: argparse-based CLI with create, add-agent, start,
  status, stop commands
- create: scaffolds ~/.amplihack/hives/NAME/config.yaml with N agents
- add-agent: appends agent entry with name, prompt, optional kuzu_db path
- start --target local: launches agents as subprocesses with correct env vars;
  --target azure delegates to deploy/azure_hive/deploy.sh
- status: shows agent PID status table with running/stopped states
- stop: sends SIGTERM to all running agent processes
- Hive config YAML matches spec (name, transport, connection_string, agents list)
- Registered amplihack-hive = amplihack.cli.hive:main in pyproject.toml
- 21 unit tests all passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy/azure_hive/ contains:
- Dockerfile: python:3.11-slim base, installs amplihack + kuzu + sentence-transformers,
  non-root user (amplihack-agent), entrypoint=agent_entrypoint.py
- deploy.sh: az CLI script to provision Service Bus namespace+topic+subscriptions,
  ACR, Azure File Share, and deploy N Container Apps (5 agents per app via Bicep)
  Supports --build-only, --infra-only, --cleanup, --status modes
- main.bicep: defines Container Apps Environment, Service Bus, File Share,
  Container Registry, and N Container App resources with per-agent env vars
- agent_entrypoint.py: reads AMPLIHACK_AGENT_NAME, AMPLIHACK_AGENT_PROMPT,
  AMPLIHACK_MEMORY_CONNECTION_STRING; creates Memory with NetworkGraphStore;
  runs OODA loop with graceful shutdown
- 27 unit tests all passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d with deployment instructions

- agent_memory_architecture.md: add NetworkGraphStore section covering architecture,
  configuration, environment variables, and integration with Memory facade
- distributed_hive_mind.md: add comprehensive deployment guide covering local
  subprocess deployment, Azure Service Bus transport, and Azure Container Apps
  deployment with deploy.sh / main.bicep; includes troubleshooting section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove hard docker requirement and add conditional: use local docker if available,
fall back to az acr build for environments without Docker daemon.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers goal-seeking agents, cognitive memory model, GraphStore protocol,
DHT architecture, eval results (94.1% single vs 45.8% federated),
Azure deployment, and next steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
COPY path must be relative to REPO_ROOT when using ACR remote build
with repo root as the build context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bicep does not support ceil() or float() functions. Use the equivalent
integer arithmetic formula (a + b - 1) / b for ceiling division.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Azure policy 'Storage account public access should be disallowed' requires
allowBlobPublicAccess: false on all storage accounts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this, Container Apps may deploy before the ManagedEnvironment
storage mount is registered, causing ManagedEnvironmentStorageNotFound.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ubuntu and others added 7 commits March 6, 2026 07:15
…rce install

- Copy bloom.py from hive_mind into memory package so it's available standalone
- Extract HashRing + _hash_key from dht.py into memory/hash_ring.py
- Update distributed_store.py to import from amplihack.memory.bloom and amplihack.memory.hash_ring
- Update Dockerfile to COPY full source tree and pip install from source so agents/ is included

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures the db_path directory exists before kuzu.Database() is called,
preventing FileNotFoundError when the path doesn't exist on first run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re init

Kuzu creates its own database directory; creating db_path itself causes
"Database path cannot be a directory" error. Create parent only so Kuzu
can initialize the database directory at the expected path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hStore init

Kuzu creates its own database directory; it fails with "path cannot be a
directory" if given an empty dir. Remove empty stale dir if present, then
create parent so Kuzu can initialize cleanly on first run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o 100 agents

- agent_entrypoint.py: default AMPLIHACK_MEMORY_BACKEND=simple, graceful
  azure-servicebus import fallback, startup log with name/topology/transport/backend,
  combined 'memory initialized and entering OODA loop' log, event polling in OODA tick
- deploy.sh: pass memoryBackend=simple to Bicep template
- main.bicep: add memoryBackend param (default: simple), inject AMPLIHACK_MEMORY_BACKEND
  env var into all agent containers (Service Bus connection string and ANTHROPIC_API_KEY
  already injected as secrets); supports HIVE_AGENT_COUNT up to 100 via 20 apps x 5 agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re Service Bus

When topology=distributed and transport=azure_service_bus, the Memory facade
now creates an InMemoryGraphStore wrapped in NetworkGraphStore instead of the
DHT-based DistributedGraphStore. The NetworkGraphStore publishes all writes to
the Service Bus topic and a background thread applies remote writes to the local
store, enabling cross-container memory sharing across all 20 Container Apps.

Also skips legacy DistributedHiveGraph construction when a non-local transport
is configured, preventing import failures in the container environment.

agent_entrypoint.py updated to pass topology="distributed" and backend="simple"
so agents report the correct topology in stats logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test eval results

- distributed_hive_mind.md: add actual Azure resource names (ACR hivacrhivemind.azurecr.io,
  SB hive-sb-dj2qo2w7vu5zi/hive-graph/100 subs, Storage hivesadj2qo2w7vu5zi), fix region
  eastus→westus2, add 20 Container Apps amplihive-app-0..19 / 100 agents table, note Azure
  Files not used for Kuzu (POSIX lock limitation), update eval results table with federated
  smoke 65.7% and federated full 45.8% rows, add gossip full-graph-nodes note and shard
  rebuild on join detail
- agent_memory_architecture.md: add GraphStore 4-backend table, update gossip section with
  full graph node exchange and shard rebuild, update eval harness section with all three
  result rows and Azure deployment context
- memory_ooda_integration.md: update date to 2026-03-06, add transport topology appendix
  covering local/redis/azure_service_bus with actual Azure resource names and Kuzu limitation
- hive_mind_presentation.md: update eval results table labels (federated-10 vs federated-100),
  add routing precision degradation insight, add quality metrics (103 tests, 13 audit findings),
  update Azure deployment table with real resource names, add gossip full-graph-nodes detail

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TASK 1 — New tutorial:
- Add docs/tutorial_prompt_to_distributed_hive.md: 10-step guide from
  single prompt to 100 distributed agents in Azure. Covers Memory facade
  zero-config, remember()/recall(), DistributedHiveGraph shared_hive,
  memory.yaml/env-var config, amplihack-hive create, HiveController
  manifest, deploy.sh, az containerapp monitoring, and /learn//query
  HTTP verification. References real Azure resources: hive-mind-rg,
  hivacrhivemind, hive-sb-dj2qo2w7vu5zi.

TASK 2 — Stale doc refresh:
- docs/hive_mind/GETTING_STARTED.md: fix hive-mind-eval-rg → hive-mind-rg,
  add hivacrhivemind ACR and hive-sb-dj2qo2w7vu5zi Service Bus names,
  update Difference table to mention DistributedHiveGraph + NetworkGraphStore
- docs/hive_mind/TUTORIAL.md: same Azure resource and env-var fixes
- docs/hive_mind_presentation.md: update stale baseline from 94.1% →
  90.47% (dataset 5000t-seed42-v1.0, median-of-3), fix gap 28.4pp →
  24.77pp, add category breakdown table, update speaker notes throughout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

eval: 5000-turn long horizon results — pre-built DB regression + federated 100-agent OOM

1 participant