From b04e7ef9e49ef86e65e932172101ea15654d849d Mon Sep 17 00:00:00 2001 From: Katrina Van Meer Date: Fri, 16 Jan 2026 15:37:42 -0700 Subject: [PATCH] docs: add postgres replication slot is active guide --- .../postgres/replication-slot-active.md | 151 ++++++++++++++++++ .../ingest-data/postgres/troubleshooting.md | 1 + 2 files changed, 152 insertions(+) create mode 100644 doc/user/content/ingest-data/postgres/replication-slot-active.md diff --git a/doc/user/content/ingest-data/postgres/replication-slot-active.md b/doc/user/content/ingest-data/postgres/replication-slot-active.md new file mode 100644 index 0000000000000..9638f8271250f --- /dev/null +++ b/doc/user/content/ingest-data/postgres/replication-slot-active.md @@ -0,0 +1,151 @@ +--- +title: "Troubleshooting: Replication slot is active" +description: "How to troubleshoot and resolve the replication slot is active error with PostgreSQL sources in Materialize" +menu: + main: + parent: "pg-troubleshooting" + name: "Replication slot is active" + weight: 15 +--- + +This guide helps you troubleshoot and resolve the "replication slot... is active" +error that can occur with PostgreSQL sources in Materialize. + +## What this error means + +When you see an error like: + +```nofmt +postgres: ERROR replication slot "materialize_1002f37c6eeb4c28b052fa3805d46baa" is active for PID 610 +``` + +This means that PostgreSQL has detected an attempt to use a replication slot that +is already in use by another active connection. Each replication slot can only +have one active consumer at a time, and this error occurs when Materialize tries +to connect to a slot that PostgreSQL considers already active. This error is +generally transient and indicates a race condition during connection management +between Materialize and PostgreSQL. + +## Common causes + +- **PostgreSQL or Materialize maintenance**: During maintenance windows, + Materialize clusters or PostgreSQL instances may restart, causing active + replication connections to be temporarily interrupted. When the cluster comes + back online, it attempts to reconnect to the replication slot before PostgreSQL + has fully released the previous connection. +- **Cluster replica changes**: Adding replicas to your Materialize source cluster + can cause the ingestion process to restart from a different replica, triggering + this error as multiple replicas attempt to use the same replication slot. +- **Source restarts or failover**: When a source cluster restarts or fails over + to another replica, the new replica may attempt to connect to the replication + slot before PostgreSQL has released the previous connection. +- **Network interruptions**: Brief network disconnections can leave PostgreSQL + thinking a connection is still active when Materialize has already moved to a + new connection attempt. +- **PostgreSQL connection cleanup delays**: PostgreSQL may not immediately + release replication slots when connections are terminated, especially if the + termination was not graceful. +- **Multiple environments or sources**: Accidentally configuring multiple + Materialize sources to use the same replication slot. + +## Diagnosing the issue + +### Check replication slot status in PostgreSQL + +Connect to your PostgreSQL database and run: + +```sql +SELECT + slot_name, + active, + active_pid, + restart_lsn, + confirmed_flush_lsn +FROM pg_replication_slots +WHERE slot_name LIKE '%materialize%'; +``` + +Look for: + +- **active = true**: The slot is currently in use +- **active_pid**: The PostgreSQL backend process ID that is using the slot +- Multiple slots with similar names that might indicate configuration issues + +### Check for active connections + +Check which connections are using the replication slot: + +```sql +SELECT + pid, + usename, + application_name, + client_addr, + state, + backend_start +FROM pg_stat_activity +WHERE pid IN ( + SELECT active_pid + FROM pg_replication_slots + WHERE slot_name LIKE '%materialize%' +); +``` + +## Resolution + +### Immediate fix: Wait for automatic recovery + +In most cases, Materialize will automatically reconnect and recover from this +error. The error is typically transient and resolves itself within a few minutes +as PostgreSQL releases the previous connection. + +### If automatic recovery fails: Terminate stale connections + +If the error persists and you've verified that there are no legitimate active +connections, you can terminate the stale connection in PostgreSQL: + +{{< warning >}} +Only terminate connections if you're certain they are stale. Terminating an +active replication connection will interrupt data ingestion. +{{}} + +```sql +-- First, identify the PID from the error message or from pg_replication_slots +SELECT + slot_name, + active_pid +FROM pg_replication_slots +WHERE slot_name LIKE 'materialize%' AND active = true; + +-- Terminate the connection (replace ### with the actual PID) +SELECT pg_terminate_backend(###); +``` + +After terminating the connection, Materialize should be able to reconnect to +the replication slot. + +## Prevention + +**Best practices to avoid this error:** + +- **Avoid frequent replica changes**: Minimize adding or removing replicas from + source clusters during active replication. Plan cluster sizing changes during + maintenance windows when possible. +- **Ensure stable network connectivity**: Maintain reliable network connections + between Materialize and PostgreSQL to prevent connection interruptions. +- **Monitor source health**: Regularly check source status to detect and address + issues early: + +```mzsql +SELECT * +FROM mz_internal.mz_source_statuses; +``` + +- **Configure connection timeouts appropriately**: Ensure PostgreSQL connection + timeout settings allow for proper cleanup of disconnected sessions. +- **Use unique replication slots**: Verify that each Materialize source uses a + unique replication slot and avoid reusing slot names across different sources + or environments. +- **Plan for graceful restarts**: When performing maintenance that requires + source restarts, allow sufficient time for PostgreSQL to release connections + before restarting. diff --git a/doc/user/content/ingest-data/postgres/troubleshooting.md b/doc/user/content/ingest-data/postgres/troubleshooting.md index 7517066a27357..a7cdc5ac78e73 100644 --- a/doc/user/content/ingest-data/postgres/troubleshooting.md +++ b/doc/user/content/ingest-data/postgres/troubleshooting.md @@ -24,3 +24,4 @@ common questions about PostgreSQL sources, see the [FAQ](/ingest-data/postgres/f |-------|-------------| | [Slot overcompacted](/ingest-data/postgres/slot-overcompacted/) | Resolve errors when PostgreSQL removes WAL data before Materialize can read it | | [Connection Closed](/ingest-data/postgres/connection-closed/) | Resolve unexpected networking connection terminations between Materialize and PostgreSQL | +| [Replication slot is active](/ingest-data/postgres/replication-slot-active/) | Resolve errors when Materialize attempts to use a replication slot that is already in use by another active connection |