The DuckDB Lineage Extension automatically captures fine-grained lineage of queries at the physical plan level.
It tracks which input tuples contribute to each output row and generates SPJUA-level lineage blocks for complete queries.
This extension allows users to capture full lineage without rewriting queries or modifying workflow.
- Download the extension:
python3 scripts/download_extension.py- Use the extension in DuckDB (allow unsigned extensions):
import duckdb
con = duckdb.connect(config={'allow_unsigned_extensions': True})
con.execute("LOAD 'lineage';")- Enable lineage capture:
PRAGMA set_lineage(True);- Execute any query as usual:
SELECT c.name, SUM(o.value) AS total_spend
FROM customer c
JOIN orders o USING (cid)
GROUP BY c.name;- Disable lineage capture:
PRAGMA set_lineage(False);- Access captured lineage through SPJUA blocks:
lineage_edges = con.execute("""
SELECT *
FROM read_block(
(SELECT max(query_id) FROM lineage_meta())
);
""").fetchdf()Lineage capture should be extended to handle:
UNIONANY JOIN,DEPENDENT JOIN,OPTIONAL JOININTERSECTUNNESTSAMPLEWINDOWDISTINCTRECURSIVE CTE
Short-circuit logic (e.g., semi joins) must be disabled to ensure full lineage capture.
For each SPJUA portion of a query:
-
Collect operator-level lineage for all operators in that SPJUA.
-
Compose a single SPJUA block with:
- One column per base table accessed
- One column for output tuple IDs
- Rows encoding full input-to-output tuple mapping
- Refactor
CreateJoinAggBlocksandInitGlobalLineageBuffto support multiple SPJUA blocks per query. - Ensure recursive handling for nested queries and subqueries.
- Ensure all input annotations are captured regardless of operator-specific optimizations.
- Validate input-to-output lineage for each new operator.
- Check SPJUA block consistency.
Test the extension on complex queries:
Test Cases:
-
Unions and Intersections
- Queries using
UNIONorINTERSECToperators. - Confirm lineage is captured for each branch.
- Validate combined SPJUA block correctly represents input-to-output mapping for all branches.
- Queries using
-
Recursive CTEs
- Queries using recursive common table expressions.
- Ensure lineage capture works for each recursion iteration.
- Confirm SPJUA blocks reflect lineage across recursive steps.
Validation Steps:
-
Manual Comparison
- Write equivalent SQL queries with
LIST()andUNNEST()rewrites to manually track lineage. - Compare manual lineage vs extension output.
- Write equivalent SQL queries with
-
SPJUA Block Inspection
- Use:
SELECT *
FROM read_block(
(SELECT max(query_id) FROM lineage_meta))
);-
Ensure:
- Each table accessed has a corresponding column.
- Output tuple mapping is complete and matches expected input-to-output relationships.
-
Edge Cases
- Queries with mark/semi/anti joins, and delim joins.
- Disable short-circuit logic.
- Validate SPJUA block consistency even when some operators do not contribute output rows.