Skip to content

relytcloud/pg_ducklake

Repository files navigation

pg_ducklake

PostgreSQL Extension for DuckLake

dockerhub License

This project is under high development and is not yet ready for production use.

pg_ducklake brings a native datalake experience into PostgreSQL, powered by DuckLake (a DuckDB lakehouse format with SQL catalog metadata and open Parquet data files).

Key Features

  • Managed DuckLake tables: Create/Write/Query DuckLake tables in PostgreSQL via SQL (e.g., psql/JDBC).
  • DuckDB compatibility: tables created by pg_ducklake are directly queryable from DuckDB clients.
  • Cloud storage: store data files in AWS S3 (or GCS, R2) to decouple storage and compute for serverless analytics.
  • Fast analytics: columnar storage + DuckDB vectorized execution, with hybrid queries over PostgreSQL heap tables supported.

See it in action

Your first Data Lake in PostgreSQL

-- Or use AWS S3 as data storage.
-- SET ducklake.default_table_path = 's3://my-bucket/prefix/';

CREATE TABLE my_table (
    id INT,
    name TEXT,
    age INT
) USING ducklake;

INSERT INTO my_table VALUES (1, 'Alice', 25), (2, 'Bob', 30);

SELECT * FROM my_table;

Access your data with DuckDB

INSTALL ducklake;
LOAD ducklake;
ATTACH 'ducklake:postgres:dbname=postgres host=localhost' AS my_ducklake (METADATA_SCHEMA 'ducklake');
SELECT * FROM my_ducklake.public.my_table;

For cloud storage (AWS S3 or Azure Blob Storage), see the Secrets Management guide.

Quick Start

Docker

Run PostgreSQL with pg_ducklake pre-installed in a docker container:

docker run -d -e POSTGRES_PASSWORD=duckdb --name pgducklake pgducklake/pgducklake:18-main

docker exec -it pgducklake psql

Compile from source

Requirements:

  • PostgreSQL: 14, 15, 16, 17, 18
  • Operating Systems: Ubuntu 22.04-24.04, macOS
git clone https://github.com/relytcloud/pg_ducklake
cd pg_ducklake
# (Optional) install pg_duckdb
# make install-pg_duckdb
make install

See compilation guide for detailed instructions.

Usecases

Convert a PostgreSQL heap table into a DuckLake table

This pattern performs a one-time ETL copy from row-store (PostgreSQL heap) tables to DuckLake (column-store) tables for fast analytics, while OLTP continues to use the original heap tables.

-- Create a PostgreSQL row-store (heap) table.
CREATE TABLE row_store_table AS
SELECT i AS id, 'hello pg_ducklake' AS msg
FROM generate_series(1, 10000) AS i;

-- Create a DuckLake column-store table via ETL.
CREATE TABLE col_store_table USING ducklake AS
SELECT *
FROM row_store_table;

-- Run analytics against the converted table.
SELECT max(id) FROM col_store_table;

Load an external dataset

External datasets (e.g., CSV/Parquet) can be ingested with DuckDB readers and materialized as tables for analytics.

CREATE TABLE titanic USING ducklake AS
SELECT * FROM read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv');

SELECT "Pclass", "Sex", COUNT(*), AVG("Survived") AS survival_rate
FROM titanic
GROUP BY "Pclass", "Sex";

Roadmap

Features

  • CREATE / CREATE_TABLE_AS for DuckLake tables
  • INSERT / SELECT / DELETE / UPDATE for DuckLake tables
  • Online schema evolution (ADD COLUMN / DROP COLUMN / type promotion)
  • Time-travel queries
  • Partitioned tables
  • Read-only pg_ducklake tables referencing shared DuckLake datasets (e.g., frozen DuckLake)
  • Table maintenance (e.g., compaction / GC) via PostgreSQL (e.g., VACUUM or UDFs) [^]
  • HTAP support for incremental row-store → column-store conversion (PostgreSQL heap → DuckLake)
  • Complex types

[^]: Table maintenance can be carried out by standalone DuckDB clients (preferable, since it is serverless and avoids burdening the PostgreSQL server); pg_ducklake still plans to expose these operations for ease of use.

Performance

  • Native inlined (heap) table for small writes
  • Better transaction concurrency model (based on PostgreSQL XID)
  • Faster metadata operations via PostgreSQL native functions (e.g., SPI)

Docs

  • Access control behavior for DuckLake tables [^]

[^]: DuckLake tables are exposed via PostgreSQL table access methods (AM), so PostgreSQL table/column privileges may already apply; the current behavior and gaps will be reviewed and documented. DuckLake itself relies on its metadata service for ACL management.

Contributing

We welcome contributions! Please see:

Acknowledgments

This project is built with pg_duckdb and ducklake.