From b04e7ef9e49ef86e65e932172101ea15654d849d Mon Sep 17 00:00:00 2001
From: Katrina Van Meer <katrina@materialize.com>
Date: Fri, 16 Jan 2026 15:37:42 -0700
Subject: [PATCH] docs: add postgres replication slot is active guide

---
 .../postgres/replication-slot-active.md       | 151 ++++++++++++++++++
 .../ingest-data/postgres/troubleshooting.md   |   1 +
 2 files changed, 152 insertions(+)
 create mode 100644 doc/user/content/ingest-data/postgres/replication-slot-active.md

diff --git a/doc/user/content/ingest-data/postgres/replication-slot-active.md b/doc/user/content/ingest-data/postgres/replication-slot-active.md
new file mode 100644
index 0000000000000..9638f8271250f
--- /dev/null
+++ b/doc/user/content/ingest-data/postgres/replication-slot-active.md
@@ -0,0 +1,151 @@
+---
+title: "Troubleshooting: Replication slot is active"
+description: "How to troubleshoot and resolve the replication slot is active error with PostgreSQL sources in Materialize"
+menu:
+  main:
+    parent: "pg-troubleshooting"
+    name: "Replication slot is active"
+    weight: 15
+---
+
+This guide helps you troubleshoot and resolve the "replication slot... is active"
+error that can occur with PostgreSQL sources in Materialize.
+
+## What this error means
+
+When you see an error like:
+
+```nofmt
+postgres: ERROR replication slot "materialize_1002f37c6eeb4c28b052fa3805d46baa" is active for PID 610
+```
+
+This means that PostgreSQL has detected an attempt to use a replication slot that
+is already in use by another active connection. Each replication slot can only
+have one active consumer at a time, and this error occurs when Materialize tries
+to connect to a slot that PostgreSQL considers already active. This error is
+generally transient and indicates a race condition during connection management
+between Materialize and PostgreSQL.
+
+## Common causes
+
+- **PostgreSQL or Materialize maintenance**: During maintenance windows,
+  Materialize clusters or PostgreSQL instances may restart, causing active
+  replication connections to be temporarily interrupted. When the cluster comes
+  back online, it attempts to reconnect to the replication slot before PostgreSQL
+  has fully released the previous connection.
+- **Cluster replica changes**: Adding replicas to your Materialize source cluster
+  can cause the ingestion process to restart from a different replica, triggering
+  this error as multiple replicas attempt to use the same replication slot.
+- **Source restarts or failover**: When a source cluster restarts or fails over
+  to another replica, the new replica may attempt to connect to the replication
+  slot before PostgreSQL has released the previous connection.
+- **Network interruptions**: Brief network disconnections can leave PostgreSQL
+  thinking a connection is still active when Materialize has already moved to a
+  new connection attempt.
+- **PostgreSQL connection cleanup delays**: PostgreSQL may not immediately
+  release replication slots when connections are terminated, especially if the
+  termination was not graceful.
+- **Multiple environments or sources**: Accidentally configuring multiple
+  Materialize sources to use the same replication slot.
+
+## Diagnosing the issue
+
+### Check replication slot status in PostgreSQL
+
+Connect to your PostgreSQL database and run:
+
+```sql
+SELECT
+  slot_name,
+  active,
+  active_pid,
+  restart_lsn,
+  confirmed_flush_lsn
+FROM pg_replication_slots
+WHERE slot_name LIKE '%materialize%';
+```
+
+Look for:
+
+- **active = true**: The slot is currently in use
+- **active_pid**: The PostgreSQL backend process ID that is using the slot
+- Multiple slots with similar names that might indicate configuration issues
+
+### Check for active connections
+
+Check which connections are using the replication slot:
+
+```sql
+SELECT
+  pid,
+  usename,
+  application_name,
+  client_addr,
+  state,
+  backend_start
+FROM pg_stat_activity
+WHERE pid IN (
+  SELECT active_pid
+  FROM pg_replication_slots
+  WHERE slot_name LIKE '%materialize%'
+);
+```
+
+## Resolution
+
+### Immediate fix: Wait for automatic recovery
+
+In most cases, Materialize will automatically reconnect and recover from this
+error. The error is typically transient and resolves itself within a few minutes
+as PostgreSQL releases the previous connection.
+
+### If automatic recovery fails: Terminate stale connections
+
+If the error persists and you've verified that there are no legitimate active
+connections, you can terminate the stale connection in PostgreSQL:
+
+{{< warning >}}
+Only terminate connections if you're certain they are stale. Terminating an
+active replication connection will interrupt data ingestion.
+{{</ warning >}}
+
+```sql
+-- First, identify the PID from the error message or from pg_replication_slots
+SELECT
+  slot_name,
+  active_pid
+FROM pg_replication_slots
+WHERE slot_name LIKE 'materialize%' AND active = true;
+
+-- Terminate the connection (replace ### with the actual PID)
+SELECT pg_terminate_backend(###);
+```
+
+After terminating the connection, Materialize should be able to reconnect to
+the replication slot.
+
+## Prevention
+
+**Best practices to avoid this error:**
+
+- **Avoid frequent replica changes**: Minimize adding or removing replicas from
+  source clusters during active replication. Plan cluster sizing changes during
+  maintenance windows when possible.
+- **Ensure stable network connectivity**: Maintain reliable network connections
+  between Materialize and PostgreSQL to prevent connection interruptions.
+- **Monitor source health**: Regularly check source status to detect and address
+  issues early:
+
+```mzsql
+SELECT *
+FROM mz_internal.mz_source_statuses;
+```
+
+- **Configure connection timeouts appropriately**: Ensure PostgreSQL connection
+  timeout settings allow for proper cleanup of disconnected sessions.
+- **Use unique replication slots**: Verify that each Materialize source uses a
+  unique replication slot and avoid reusing slot names across different sources
+  or environments.
+- **Plan for graceful restarts**: When performing maintenance that requires
+  source restarts, allow sufficient time for PostgreSQL to release connections
+  before restarting.
diff --git a/doc/user/content/ingest-data/postgres/troubleshooting.md b/doc/user/content/ingest-data/postgres/troubleshooting.md
index 7517066a27357..a7cdc5ac78e73 100644
--- a/doc/user/content/ingest-data/postgres/troubleshooting.md
+++ b/doc/user/content/ingest-data/postgres/troubleshooting.md
@@ -24,3 +24,4 @@ common questions about PostgreSQL sources, see the [FAQ](/ingest-data/postgres/f
 |-------|-------------|
 | [Slot overcompacted](/ingest-data/postgres/slot-overcompacted/) | Resolve errors when PostgreSQL removes WAL data before Materialize can read it |
 | [Connection Closed](/ingest-data/postgres/connection-closed/) | Resolve unexpected networking connection terminations between Materialize and PostgreSQL |
+| [Replication slot is active](/ingest-data/postgres/replication-slot-active/) | Resolve errors when Materialize attempts to use a replication slot that is already in use by another active connection |