Copying Postgres to DuckDB Without Losing Your Mind (Hands-On Stack)

This directory turns the blog playbook into a reproducible project. Spin up Postgres + DuckDB + a 150-line Polars script, then continuously sync live rows from Postgres into DuckDB without Kafka, Debezium, or Airflow.

Why this stack?

Realistic demo: Postgres table with updated_at trigger, soft deletes (deleted_at), and fake SaaS data ready on container start.
Minimal moving parts: A single Python process batsched rows to Parquet, loads them into DuckDB via INSERT OR REPLACE, and tracks checkpoints in SQLite.
Crash safe: Last sync timestamp + row count live in SQLite. When the container restarts, the loop resumes exactly where it stopped.
Schema aware: New Postgres columns are detected automatically and added to DuckDB before inserts land.
Docker native: docker compose up -d --build launches the full environment; helpers show how to seed changes and inspect DuckDB.

Stack layout

postgres-duckdb-sync/
├── docker-compose.yml          # Postgres + syncer service
├── app/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── src/sync.py             # 150-line incremental sync script
│   └── scripts/seed_changes.py # Helper to simulate live inserts/updates/deletes
├── postgres/init/              # Schema, trigger, seed rows
├── lake/                       # DuckDB file volume (warehouse.duckdb)
├── parquet/                    # Rolling Parquet drops for each batch
└── state/                      # SQLite checkpoint + bootstrap flag

Quick start

Build + launch
```
cd postgres-duckdb-sync
docker compose up -d --build
```
- Postgres listens on localhost:5437 with user/password analyst.
- The syncer container loops every 30 seconds (SYNC_INTERVAL_SECONDS).
Watch the sync loop
```
docker compose logs -f syncer
```
First run performs the bootstrap (SELECT * dump → Parquet → DuckDB). Subsequent loops only copy rows with updated_at newer than the last checkpoint.
Generate live change events
```
docker compose exec syncer python scripts/seed_changes.py
```
This inserts one row, updates one, and soft-deletes another with deleted_at = NOW(). The sync loop picks up the delta on the next interval.

Query DuckDB

docker compose exec syncer python - <<'PY'
import duckdb
con = duckdb.connect('/app/lake/warehouse.duckdb')
print(con.sql("SELECT id, email, plan, lifetime_value FROM analytics_subscriptions WHERE deleted_at IS NULL ORDER BY updated_at DESC LIMIT 5"))
PY

DuckDB stays <100 ms on millions of rows because it only appends columnar Parquet batches.

How it implements the blog checklist

Timestamp column + trigger: postgres/init/01_schema.sql adds updated_at TIMESTAMPTZ DEFAULT NOW() and a trigger that touches it on every update. deleted_at turns hard deletes into tombstones that analytics can filter out.
Incremental queries: app/src/sync.py pulls WHERE updated_at > :last_sync in 10k-row chunks (BATCH_SIZE). Batch size is configurable via env.
Parquet or bust: Each chunk becomes a Parquet file inside parquet/. DuckDB ingests via INSERT OR REPLACE ... read_parquet(?), so re-running the same batch is idempotent.
Checkpoints in SQLite: state/checkpoints.db stores (last_sync_ts, cumulative_row_count). Crashes at 3 AM resume on the next loop with no manual fixups.
Soft deletes: Because Postgres keeps tombstones, DuckDB always knows which rows are active. Analytics queries simply filter WHERE deleted_at IS NULL.
Schema drift guardrails: Before each insert, the sync script compares information_schema.columns against DuckDB’s pragma_table_info; missing columns trigger ALTER TABLE ... ADD COLUMN inside DuckDB.
Bootstrap vs incremental: The presence of state/bootstrap.done toggles the mode. Bootstrap streams SELECT * FROM schema.table ORDER BY updated_at through Polars, drops one Parquet file, loads it once, and then switches to incremental forever.
Polars + DuckDB speed: Polars streams the Postgres cursor straight into columnar batches (no pandas). DuckDB ingests Parquet directly—no CSV/JSON to slow things down.
Idempotent loads: DuckDB table analytics_subscriptions declares PRIMARY KEY(id) so INSERT OR REPLACE handles reruns and duplicates gracefully.
No orchestration sprawl: Everything runs as a single long-lived process inside the syncer container. Swap the entrypoint for cron if you’d rather schedule exact start times.

Environment variables

Adjustable via docker-compose.yml or env files:

Variable	Description	Default
`POSTGRES_*`	Connection details + table/schema names	`analyst` / `analytics` / `subscriptions`
`POSTGRES_PRIMARY_KEY`	Column used for `INSERT OR REPLACE`	`id`
`BATCH_SIZE`	Rows per incremental query	`10000`
`DUCKDB_PATH`	Where the DuckDB file lives inside the container	`/app/lake/warehouse.duckdb`
`DUCKDB_TABLE`	Target table name inside DuckDB	`analytics_subscriptions`
`PARQUET_DIR`	Parquet drop zone	`/app/parquet`
`STATE_DIR`	Checkpoint + bootstrap flag	`/app/state`
`SYNC_INTERVAL_SECONDS`	How often the loop reruns when `--loop` is enabled	`30`

Cleaning up / rerunning

docker compose down -v  # remove containers + Postgres volume
docker compose up -d --build

Deleting the state/ directory resets checkpoints so the next run performs a fresh bootstrap.

Extending the pattern

Replace the seed data with your production table by pointing POSTGRES_* at your database.
Batch export Parquet files to S3/MinIO for cheap backups—the sync already writes them.
Integrate with DuckDB’s read_parquet('parquet/*.parquet') to replay history or validate row counts.
Add a second cron/compose service that runs python src/sync.py on a schedule instead of looping forever.

Ten minutes, zero message queues, and you can demo a resilient Postgres → DuckDB sync end to end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copying Postgres to DuckDB Without Losing Your Mind (Hands-On Stack)

Why this stack?

Stack layout

Quick start

How it implements the blog checklist

Environment variables

Cleaning up / rerunning

Extending the pattern

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Copying Postgres to DuckDB Without Losing Your Mind (Hands-On Stack)

Why this stack?

Stack layout

Quick start

How it implements the blog checklist

Environment variables

Cleaning up / rerunning

Extending the pattern