A DuckDB extension implementing Julian Hyde's "Measures in SQL" paper (arXiv:2406.00251).
Yardstick adds measure-aware SQL to DuckDB. Measures are aggregations that know how to re-aggregate themselves when the query context changes. This enables:
- Percent of total calculations without CTEs or window functions
- Year-over-year comparisons with simple syntax
- Drill-down analytics that automatically adjust aggregation context
-- Load the extension
INSTALL yardstick FROM community;
LOAD yardstick;
-- Create the sales table
CREATE TABLE sales (
id INTEGER PRIMARY KEY,
year INTEGER,
region VARCHAR(50),
amount DECIMAL(10, 2)
);
-- Insert sample data
INSERT INTO sales (id, year, region, amount) VALUES
(1, 2023, 'North', 15000.00),
(2, 2023, 'North', 22000.00),
(3, 2023, 'South', 18000.00),
(4, 2023, 'South', 12000.00),
(5, 2023, 'East', 25000.00),
(6, 2023, 'West', 19000.00),
(7, 2024, 'North', 28000.00),
(8, 2024, 'North', 31000.00),
(9, 2024, 'South', 21000.00),
(10, 2024, 'South', 16000.00),
(11, 2024, 'East', 33000.00),
(12, 2024, 'East', 29000.00),
(13, 2024, 'West', 24000.00),
(14, 2024, 'West', 27000.00);
-- Create a view with measures
CREATE VIEW sales_v AS
SELECT
year,
region,
SUM(amount) AS MEASURE revenue,
COUNT(*) AS MEASURE order_count
FROM sales;
-- Query with AGGREGATE() and AT modifiers (SEMANTIC prefix required)
SEMANTIC SELECT
year,
region,
AGGREGATE(revenue) AS revenue,
AGGREGATE(revenue) AT (ALL region) AS year_total,
AGGREGATE(revenue) / AGGREGATE(revenue) AT (ALL region) AS pct_of_year
FROM sales_v;
-- Variance from the global average
SEMANTIC SELECT
region,
AGGREGATE(revenue) AS revenue,
AGGREGATE(revenue) AT (ALL) / 4.0 AS expected_if_equal, -- 4 regions
AGGREGATE(revenue) - (AGGREGATE(revenue) AT (ALL) / 4.0) AS variance
FROM sales_v;
-- Nested percentages (% of year, and that year's % of total)
SEMANTIC SELECT
year,
region,
AGGREGATE(revenue) AS revenue,
100.0 * AGGREGATE(revenue) / AGGREGATE(revenue) AT (ALL region) AS pct_of_year,
100.0 * AGGREGATE(revenue) AT (ALL region) / AGGREGATE(revenue) AT (ALL) AS year_pct_of_total
FROM sales_v;
-- Compare 2024 performance to 2023 baseline for each region
SEMANTIC SELECT
region,
AGGREGATE(revenue) AT (SET year = 2024) AS rev_2024,
AGGREGATE(revenue) AT (SET year = 2023) AS rev_2023,
AGGREGATE(revenue) AT (SET year = 2024) - AGGREGATE(revenue) AT (SET year = 2023) AS growth
FROM sales_v;
-- Filter to specific segments
SEMANTIC SELECT
year,
AGGREGATE(revenue) AS total_revenue,
AGGREGATE(revenue) AT (SET region = 'North') AS north_revenue,
AGGREGATE(revenue) AT (SET region IN ('North', 'South')) AS north_south_combined
FROM sales_v;CREATE VIEW view_name AS
SELECT
dimension1,
dimension2,
AGG(expr) AS MEASURE measure_name
FROM table;Yardstick automatically handles the grouping. All DuckDB aggregate functions are supported, including COUNT(DISTINCT) with some restrictions (see below).
Queries using AGGREGATE() must use the SEMANTIC prefix:
SEMANTIC SELECT
dimensions,
AGGREGATE(measure_name) [AT modifier]
FROM view_name;| Modifier | Description | Example |
|---|---|---|
AT (ALL) |
Grand total across all dimensions | AGGREGATE(revenue) AT (ALL) |
AT (ALL dim) |
Total excluding specific dimension | AGGREGATE(revenue) AT (ALL region) |
AT (ALL expr) |
Total excluding ad hoc dimension | AGGREGATE(revenue) AT (ALL MONTH(date)) |
AT (SET dim = val) |
Fix dimension to specific value | AGGREGATE(revenue) AT (SET year = 2022) |
AT (SET dim = expr) |
Fix dimension to expression | AGGREGATE(revenue) AT (SET year = year - 1) |
AT (SET expr = val) |
Fix ad hoc dimension to value | AGGREGATE(revenue) AT (SET MONTH(date) = 6) |
AT (WHERE cond) |
Pre-aggregation filter | AGGREGATE(revenue) AT (WHERE region = 'US') |
AT (VISIBLE) |
Use query's WHERE clause | AGGREGATE(revenue) AT (VISIBLE) |
COUNT(DISTINCT) is a non-decomposable aggregate: you can't sum distinct counts from subsets without double-counting. Yardstick handles this by deferring evaluation to query time via correlated subqueries.
-- Define a COUNT(DISTINCT) measure
CREATE VIEW orders_v AS
SELECT year, region, COUNT(DISTINCT customer_id) AS MEASURE unique_customers
FROM orders;
-- Basic query works
SEMANTIC SELECT year, region, AGGREGATE(unique_customers) FROM orders_v;
-- AT (WHERE) works (just adds a filter)
SEMANTIC SELECT year, AGGREGATE(unique_customers) AT (WHERE region = 'US') FROM orders_v;Restrictions: AT modifiers that require re-aggregation will error:
-- These will return an error:
AGGREGATE(unique_customers) AT (ALL) -- can't compute grand total from subset totals
AGGREGATE(unique_customers) AT (ALL region) -- can't remove dimension and re-aggregate
AGGREGATE(unique_customers) AT (SET year = year - 1) -- can't compare across contextsPrerequisites:
- CMake 3.5+
- C++17 compiler
- Cargo
make # builds Rust library and DuckDB extension
make test # runs testsThe extension will be at build/release/extension/yardstick/yardstick.duckdb_extension
See LIMITATIONS.md for known issues and workarounds.
Key limitations:
- Non-decomposable aggregates (COUNT DISTINCT, MEDIAN, PERCENTILE) cannot use AGGREGATE()
- Window function measures not supported
COUNT(DISTINCT)works but cannot be used withAT (ALL),AT (ALL dim), orAT (SET)modifiers
"I used this to integrate into a copilotkit chat interface serving graphs, works really well for the llm." - JFox, DuckDB Discord
- Julian Hyde, "Measures in SQL" (2024). arXiv:2406.00251
- DuckDB Extension Template
MIT