fix(docker): migrate single-node compose from host to bridge networking by bitflicker64 · Pull Request #2952 · apache/hugegraph

bitflicker64 · 2026-02-15T13:13:37Z

Purpose of the PR

Fix single node Docker deployment failing on macOS and Windows due to Linux only host networking.

Main Changes

Remove network_mode host from single node docker compose setup
Use default bridge networking
Add explicit port mappings
- 8080 Server HTTP
- 8520 Store HTTP
- 8620 PD HTTP
Add configuration volume mounts
- docker pd-conf
- docker store-conf
- docker server-conf
Replace localhost and non routable addresses with container hostnames
- PD grpc host set to pd
- Store grpc host set to store
- Server pd peers set to pd 8686
Update healthcheck endpoints

Problem

The original single node Docker configuration uses network_mode host.

This only works on native Linux. Docker Desktop on macOS and Windows does not implement host networking the same way. Containers start but HugeGraph services advertise incorrect addresses such as 127.0.0.1 or 0.0.0.0.

Resulting failures:

Server stuck in loop waiting for storage backend
PD client UNAVAILABLE io exception errors
Store reports zero partitions
Cluster never becomes usable even though containers are running

The issue is not process failure but invalid service discovery and advertised endpoints.

Root Cause

network_mode host is Linux specific
Docker Desktop falls back to bridge networking
HugeGraph components still advertise localhost style addresses
Other containers cannot route to those addresses

Solution

Switch to bridge networking and advertise container resolvable hostnames.

Docker DNS resolves service names automatically. Services bind normally while exposing correct internal endpoints.

Verification

Observed behavior after changes on Docker Desktop macOS:

Container state

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

server   Up healthy   0.0.0.0:8080->8080
store    Up healthy   0.0.0.0:8520->8520
pd       Up healthy   0.0.0.0:8620->8620

Server startup sequence

Hugegraph server are waiting for storage backend
Initializing HugeGraph Store
Starting HugeGraphServer ... OK
Started

Endpoints

Server:

curl http://localhost:8080

Returns service metadata.

Store:

curl http://localhost:8520

Returns non zero leader and partition count:

{"leaderCount":12,"partitionCount":12}

PD:

curl http://localhost:8620

Returns expected auth response, confirming service availability.

Cluster becomes operational after initialization delay.

Why This Works

Bridge networking is cross platform
Container names become stable service addresses
No platform dependent networking behavior
Services advertise routable endpoints

Does this PR potentially affect the following parts

Documentation Status

Doc No Need

Changes Checklist

Updated Docker networking configuration
Added / verified required port mappings
Adjusted service communication to use container hostnames
Validated environment-based configuration
Verified PD, Store, and Server containers start correctly
Confirmed single-node cluster reaches healthy state
Confirmed partition assignment and leader election
Validate multi-node (3-node) cluster deployment
Update documentation if required

Replace network_mode: host with explicit port mappings and add configuration volumes for PD, Store, and Server services to support macOS/Windows Docker. - Remove host network mode from all services - Add explicit port mappings (8620, 8520, 8080) - Add configuration directories with volume mounts - Update healthcheck endpoints - Add PD peers environment variable Enables HugeGraph cluster to run on all Docker platforms.

…mlin-console

bitflicker64 · 2026-02-20T12:09:08Z

Bridge networking changes have been validated successfully across environments:

macOS (Docker Desktop)
Ubuntu 24.04.4 LTS

Observed behavior:

PD container starts and becomes healthy
Store container starts, registers, and receives partitions
Partitions are assigned and Raft leaders are elected
Server container initializes without errors
REST endpoints respond as expected

No regressions were observed in the single-node deployment. Service discovery and inter-container communication function correctly under bridge networking.

ARM64 Compatibility Fix — `wait-storage.sh`

Problem

The original wait-storage.sh relied on gremlin-console.sh for storage readiness detection:

On ARM64 (Apple Silicon), this fails due to a Jansi native library crash

Root Cause

gremlin-console.sh depends on Jansi, which is unstable on ARM64
The detection logic is triggered only when hugegraph.* environment variables are used
Volume-mounted configurations bypass this code path, masking the failure

Fix

Replaced Gremlin Console detection with a lightweight PD REST health check:

Cleanup

detect-storage.groovy is no longer required by the updated startup flow and can be removed

Copilot

Pull request overview

This PR migrates the single-node Docker Compose configuration from Linux-specific host networking to cross-platform bridge networking. The change addresses a critical issue where Docker Desktop on macOS and Windows doesn't support host networking properly, causing services to advertise unreachable addresses and preventing cluster initialization.

Changes:

Replaced host networking with bridge networking and explicit port mappings
Added comprehensive environment-based configuration for PD, Store, and Server through new entrypoint scripts
Implemented health-aware startup with PD REST endpoint polling in wait-storage.sh
Added volume mounts for persistent data and deprecated variable migration guards

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
docker/docker-compose.yml	Migrated from host to bridge networking, added environment variables, updated healthchecks, exposed required ports
hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh	New comprehensive entrypoint with SPRING_APPLICATION_JSON configuration, deprecation guards, and required variable validation
hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh	New comprehensive entrypoint with SPRING_APPLICATION_JSON configuration, deprecation guards, and required variable validation
hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh	Refactored to use environment variables for backend and PD configuration with deprecation guards
hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh	Replaced Gremlin-based storage detection with PD REST health endpoint polling, increased timeout to 300s

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T16:25:51Z

docker/docker-compose.yml

+    container_name: hg-pd
    hostname: pd
-    network_mode: host
+    restart: unless-stopped
+    networks: [hg-net]
+    environment:
+      HG_PD_GRPC_HOST: pd
+      HG_PD_GRPC_PORT: "8686"
+      HG_PD_REST_PORT: "8620"
+      HG_PD_RAFT_ADDRESS: pd:8610
+      HG_PD_RAFT_PEERS_LIST: pd:8610
+      HG_PD_INITIAL_STORE_LIST: store:8500
+      HG_PD_DATA_PATH: /hugegraph-pd/pd_data
+    ports:
+      - "8620:8620"
+      - "8686:8686"
+    volumes:
+      - hg-pd-data:/hugegraph-pd/pd_data
    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:8620"]
+      test: ["CMD-SHELL", "curl -fsS http://localhost:8620/v1/health >/dev/null || exit 1"]
      interval: 10s
      timeout: 5s
-      retries: 3
+      retries: 12
+      start_period: 20s

  store:
-    image: hugegraph/store
-    container_name: store
+    build:
+      context: ..
+      dockerfile: hugegraph-store/Dockerfile
+    image: hugegraph/store:${HUGEGRAPH_VERSION:-1.7.0}
+    container_name: hg-store
    hostname: store
-    network_mode: host
+    restart: unless-stopped
+    networks: [hg-net]
    depends_on:
      pd:
        condition: service_healthy
+    environment:
+      HG_STORE_PD_ADDRESS: pd:8686
+      HG_STORE_GRPC_HOST: store
+      HG_STORE_GRPC_PORT: "8500"
+      HG_STORE_REST_PORT: "8520"
+      HG_STORE_RAFT_ADDRESS: store:8510
+      HG_STORE_DATA_PATH: /hugegraph-store/storage
+    ports:
+      - "8520:8520"
+      - "8500:8500"
+      - "8510:8510"
+    volumes:
+      - hg-store-data:/hugegraph-store/storage
    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:8520"]
+      test: ["CMD-SHELL", "curl -fsS http://localhost:8520/v1/health >/dev/null && curl -fsS http://pd:8620/v1/health >/dev/null || exit 1"]
      interval: 10s
-      timeout: 5s
-      retries: 3
+      timeout: 10s
+      retries: 30
+      start_period: 30s

  server:
-    image: hugegraph/server
-    container_name: server
+    build:
+      context: ..
+      dockerfile: hugegraph-server/Dockerfile-hstore
+    image: hugegraph/server:${HUGEGRAPH_VERSION:-1.7.0}
+    container_name: hg-server


The container names have been changed from "pd", "store", "server" to "hg-pd", "hg-store", "hg-server". This is a breaking change for users who may reference these containers by name in scripts, monitoring tools, or documentation. While the service names (used for DNS resolution within the Docker network) remain unchanged, external references using docker commands will break. Consider documenting this breaking change in the PR description or release notes, or if backward compatibility is important, keep the original container names.

docker/docker-compose.yml

hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh

hugegraph-store/hg-store-dist/docker/docker-entrypoint.sh

docker/docker-compose.yml

bitflicker64 · 2026-02-20T18:52:22Z

Thank you for the review. I’ll take care of the suggested adjustments and will proceed with testing the 3 node cluster configuration next.

…ecks, remove unused detect-storage script

imbajin · 2026-02-22T11:53:55Z

hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh

+            " || echo "Warning: Timeout waiting for storage, proceeding anyway..."
+        else
+            echo "No pd.peers configured, skipping storage wait..."
+        fi


‼️ Critical: wait-storage check downgraded from actual graph connectivity test to PD health ping only

The PR deletes detect-storage.groovy and replaces the Gremlin-based storage readiness check with a simple curl http://${PD_REST}/v1/health + a fixed 10s sleep. This is a significant regression in correctness:

PD being healthy does not mean: Store has registered, partitions are assigned, or the server can actually read/write the backend

The old check used the Gremlin console to open the actual graph — a true end-to-end proof that storage is ready

On timeout the new code silently proceeds: echo "Warning: Timeout waiting for storage, proceeding anyway..." — the server can start in a broken state with no hard failure

Suggested approach: restore a functional storage readiness probe. At minimum, poll the store's partition/leader endpoint (e.g. /v1/health or /v1/partitions) and verify non-zero partition/leader count before proceeding. Fail hard on timeout rather than warning-and-continue.

Suggested change

fi

timeout "${WAIT_STORAGE_TIMEOUT_S}s" bash -c "

until curl -fsS http://${PD_REST}/v1/health >/dev/null 2>&1 &&

[ \"$(curl -fsS http://${STORE_REST}/v1/partitions 2>/dev/null | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"partitionCount\",0))' 2>/dev/null)\" -gt 0 ]; do

echo 'Waiting for storage backend (PD + Store partitions)...'

sleep 5

done

echo 'Storage backend is ready\!'

" || { echo 'ERROR: Timeout waiting for storage backend'; exit 1; }

Partition-based readiness checks may be unreliable since partition assignment occurs asynchronously after wait-storage completes, and a properly registered Store can legitimately report partitionCount = 0 during normal initialization. Interpreting this as a failure condition could unintentionally block startup in otherwise healthy clusters. Would it make sense to consider separating validation into pre-startup checks (Store/PD availability) and post-startup checks (partition stabilization) instead?

Partition-based readiness checks may be unreliable since partition assignment occurs asynchronously after wait-storage completes, and a properly registered Store can legitimately report partitionCount = 0 during normal initialization. Interpreting this as a failure condition could unintentionally block startup in otherwise healthy clusters. Would it make sense to consider separating validation into pre-startup checks (Store/PD availability) and post-startup checks (partition stabilization) instead?

Your point is right: partitionCount can legitimately be 0 during early initialization, so using it as a strict pre-start gate can block healthy clusters.

The core concern I want to preserve is:

Current readiness is too weak (PD /v1/health + fixed sleep only).

Timeout currently does warning + continue, which allows starting in a broken state (no fail-fast).

Maybe we could split validation into pre-start hard gates vs post-start stabilization checks:

docker compose up | v [Pre-start Gate] (must pass, otherwise exit) - PD /v1/health == OK - Store /v1/health == OK - PD /v1/stores has at least 1 Up store | v [Init-store with retries] (must succeed, otherwise exit) | v [Start server] | v [Post-start checks] (non-blocking) - observe partition/leader stabilization - warn if not converged yet, but do not treat as pre-start failure

Thanks for the clarification that matches what I observed during testing. I’ve updated my PR to separate pre-start readiness checks from post-start partition stabilization.

In follow up commit i will make sure to add hard gates to the pre start ( wait-storage.sh)

docker/docker-compose.yml

imbajin · 2026-02-22T11:54:14Z

docker/docker-compose.yml

+networks:
+  hg-net:
+    driver: bridge
+


⚠️ build: in quickstart compose can unexpectedly trigger local source builds — prefer pull-only defaults

Using build: together with image: does not always force a local build. In Compose, the usual behavior (depending on pull_policy) is to try pulling first, then fall back to building if pull is unavailable.

That said, this is still risky for a quickstart file:

If the image/tag cannot be pulled, users unexpectedly need the full source tree and Docker build context.

Startup becomes much slower because services may be built locally.

Locally built images can differ from official release artifacts, reducing reproducibility.

Since docker/docker-compose.yml is intended for quickstart usage, I recommend keeping it pull-only by default and moving build: blocks to a dev-specific override (for example, docker-compose.dev.yml).

Suggested change

pd:

image: hugegraph/pd:${HUGEGRAPH_VERSION:-1.7.0}

container_name: hg-pd

hostname: pd

If local builds are needed for development, users can combine files explicitly, e.g.:
docker compose -f docker/docker-compose.yml -f docker/docker-compose.dev.yml up

I’ve changed the compose setup so it’s pull-only and no longer builds locally. Right now, I’m bind-mounting the updated entrypoint scripts because the current published images don’t include these changes.
Once this PR is done and new images are available, I can remove the mounts. I’ll follow up with a small cleanup PR to switch completely to the official images.

hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh

hugegraph-pd/hg-pd-dist/docker/docker-entrypoint.sh

imbajin · 2026-02-22T12:22:58Z

hugegraph-server/hugegraph-dist/docker/scripts/detect-storage.groovy

-RegisterUtil.registerRocksDB()
-RegisterUtil.registerHBase()
-
-graph = HugeFactory.open('./conf/graphs/hugegraph.properties')


‼️ Critical: deleting this script while related Docker references remain can break build workflows

This file is removed in this PR, but the server Docker build flow still references detect-storage.groovy in current Dockerfiles. That creates an inconsistency and can fail local build paths when Compose/build expects the file.

Please align this PR by either:

removing stale COPY ... detect-storage.groovy references from server Dockerfiles, or

keeping a temporary compatibility stub until those references are removed.

hugegraph-server/hugegraph-dist/docker/docker-entrypoint.sh

docker/docker-compose.yml

hugegraph-server/hugegraph-dist/src/assembly/static/bin/wait-storage.sh

…ipts

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. pd PD module store Store module labels Feb 15, 2026

github-project-automation bot added this to HugeGraph PD-Store Tasks Feb 15, 2026

github-project-automation bot moved this to In progress in HugeGraph PD-Store Tasks Feb 15, 2026

chore: add Apache license headers to config files

3c2ba6e

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 15, 2026

imbajin mentioned this pull request Feb 15, 2026

[Bug] Single node Docker setup does not work on macOS and Windows because of host networking #2951

Open

1 task

bitflicker64 added 2 commits February 19, 2026 00:33

---

afa63f0

fix: ARM64 compatibility in wait-storage.sh - use curl instead of gre…

64a9aab

…mlin-console

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 20, 2026

imbajin requested a review from Copilot February 20, 2026 16:17

Copilot started reviewing on behalf of imbajin February 20, 2026 16:18 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Fix Docker configuration: remove deprecated checks, simplify healthch…

6e024d2

…ecks, remove unused detect-storage script