TPC-H Query Efficiency Study

Benchmarking SQLite query execution time and memory usage across 19 TPC-H standard business intelligence queries — with statistically robust multi-run methodology and process-level isolation.

The "Why"

Standard database benchmarks often report single-run metrics, which are skewed by OS page caching, cold starts, and cross-query memory contamination in long-running processes. This study addresses that gap by:

Running each TPC-H query 10–40 times across multiple isolated collection sessions
Using subprocess-per-query isolation to prevent RSS memory pollution between queries
Capturing both wall-clock time and process RSS memory per execution
Aggregating into summary statistics and visualizations for direct cross-query comparison

The result is a reproducible, statistically grounded picture of how SQLite handles each of 19 real-world OLAP query patterns.

Tech Stack

Category	Tool	Why
Database	SQLite	Lightweight, file-based engine; ideal for reproducible local benchmarks without server overhead
Memory Profiling	psutil	Captures OS-level RSS (Resident Set Size) rather than Python heap only — catches SQLite's own buffer usage
Batch Automation	Bash + Python subprocess	Spawns a fresh Python process per query so each RSS baseline starts clean
Data Processing	pandas	Multi-run CSV merging, zero-value filtering, and `describe()` summary statistics
Visualization	matplotlib	Per-query execution bar charts and cross-query summary stat comparisons
Query Standard	TPC-H	Industry-standard OLAP benchmark; 22 parameterized business intelligence queries across 8 relational tables

Key Features

Multi-run statistical profiling — Each query was executed 10–40 times across 4 independent collection runs, yielding datasets with 50–120+ observations per query for robust mean and variance estimates.
Process-level isolation — automateBash.py is invoked as a fresh subprocess by execLoop.bash for each query. Every process starts with a clean heap and fresh RSS baseline, scoping memory deltas to the target query alone.
Dual-metric capture — Each execution records time.perf_counter() wall-clock time and a psutil RSS delta. SQLite's .stats output and EXPLAIN QUERY PLAN are also captured per run for deeper diagnostics.
Memory noise filtering — combineFiles.py drops zero-valued memory observations (cache-hit runs that return a near-zero RSS delta) and deduplicates readings before computing statistics, reducing measurement noise.
Cross-query comparison — Summary stats (mean, std, min, 25%, 50%, 75%, max) saved to time_allStats.csv and mem_allStats.csv for direct comparison across all 19 queries.

Architecture

flowchart TD
    A[TPC-H SQL files\ntpchQueriesSQLlTE/] --> B[execLoop.bash]
    B -->|"spawns subprocess\nper query"| C[automateBash.py]
    C --> D[(tpch.db\nSQLite)]
    C -->|N executions| E[Raw CSVs\nfinalMetrics/]
    E --> F[combineFiles.py\nMerge + Filter + Stats]
    F --> G[finalTimes/ & finalMem/\nper-query CSVs]
    F --> H[time_allStats.csv\nmem_allStats.csv]
    G --> I[exploreData.py\nVisualization]
    I --> J[timeGraphs/*.png\nSummary *.png]

Results Overview

All times in seconds; collected over 50–120 runs per query on a local SQLite database.

Query	Description	Mean (s)	Std (s)	Notes
Q9	Product Type Profit	32.67	4.16	Slowest — 8-table join across full lineitem scan
Q13	Customer Distribution	23.73	2.99	Correlated subquery over orders with LIKE filter
Q8	National Market Share	12.50	1.89	Multi-join with region predicate pushdown
Q21	Suppliers Who Kept Orders Waiting	10.29	1.83	Multiple correlated EXISTS subqueries
Q1	Pricing Summary	9.59	1.36	Full lineitem scan with multi-column aggregation
Q7	Volume Shipping	7.49	0.99	Cross-nation join with year extraction
Q5	Local Supplier Volume	4.85	0.49	6-table join with region and year filter
Q3	Shipping Priority	3.48	1.24	Unshipped orders revenue ranking
Q19	Discounted Revenue	3.34	1.02	Disjunctive OR predicate on lineitem
Q18	Large Volume Customer	2.73	0.96	HAVING clause over grouped subquery
Q6	Forecasting Revenue Change	2.13	0.34	Simple lineitem scan, discount range filter
Q14	Promotion Effect	2.77	0.46	Conditional SUM over date-filtered lineitem
Q12	Shipping Modes and Order Priority	2.50	0.46	JOIN on shipmode with CASE aggregation
Q10	Returned Item Reporting	2.99	0.40	Customer revenue from flagged returns
Q15	Top Supplier	2.46	0.45	Subquery creates ranked supplier view
Q4	Order Priority Checking	1.52	0.34	EXISTS subquery per order row
Q11	Important Stock Identification	1.01	0.21	Subquery computes warehouse-wide threshold
Q16	Parts/Supplier Relationship	0.50	0.08	NOT IN with indexed part filter — fastest join
Q2	Minimum Cost Supplier	0.47	0.50	Highly selective subquery; fastest overall

Full statistics in analysis/time_allStats.csv and analysis/mem_allStats.csv. Per-query bar charts in analysis/timeGraphs/ and analysis/*.png.

Challenges & Solutions

Memory Measurement Pollution

The problem: The original automate.py script ran all queries sequentially inside a single Python process. Because SQLite caches database pages and Python's memory allocator does not return RSS to the OS between queries, the RSS delta measured for query N included residual allocations from queries 1 through N-1. Memory readings were therefore systematically inflated for later-running queries, making cross-query comparison unreliable.

The solution: The harness was refactored into automateBash.py, a CLI-invocable module that accepts <db> <sql> <n_runs> <out_csv> as arguments. execLoop.bash invokes it as a separate python subprocess for each query. Every process starts with a fresh heap and a clean RSS baseline, so the memory delta is scoped to that query's executions only.

The trade-off: Each subprocess adds ~1–2 seconds of Python interpreter startup time. Since the target queries run for 0.5–45 seconds, the startup overhead is negligible relative to measurement signal and well within the noise of the standard deviation.

Repository Structure

queryEfficiencyStudy/
├── tpchQueriesSQLlTE/          # 22 TPC-H queries adapted for SQLite syntax
│   └── querySource.md          # Business meaning of each query
├── baseline_tests_python/      # Exploratory single-run scripts using tracemalloc (not used in final data)
├── automate.py                 # First-generation multi-run harness (single-process; deprecated)
├── automateBash.py             # CLI-driven harness — use this for new data collection
├── execLoop.bash               # Bash driver: spawns automateBash.py once per query
├── finalMetrics/               # Raw CSVs from 4 collection runs (run1, run3, run4, 10run)
└── analysis/
    ├── combineFiles.py         # Merges runs, filters memory noise, computes summary stats
    ├── exploreData.py          # Generates per-query bar charts and cross-query stat plots
    ├── finalTimes/             # Merged execution time CSVs (one file per query)
    ├── finalMem/               # Merged + filtered memory CSVs (one file per query)
    ├── timeGraphs/             # Bar chart PNGs per query
    ├── time_allStats.csv       # Summary statistics table for execution times
    └── mem_allStats.csv        # Summary statistics table for memory usage

Running the Benchmark

Prerequisites

pip install pandas matplotlib psutil

You will also need a tpch.db SQLite database. Generate one using the TPC-H dbgen tool.

Collect data for a single query

python automateBash.py /path/to/tpch.db tpchQueriesSQLlTE/q1.sql 30 finalMetrics/run4/q1.csv

Run all queries via the bash loop

# Edit execLoop.bash to set correct paths, then:
bash execLoop.bash

Merge runs and analyze

cd analysis/
python combineFiles.py   # merges runs → finalTimes/, finalMem/, *_allStats.csv
python exploreData.py    # generates charts → timeGraphs/*.png, *.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TPC-H Query Efficiency Study

The "Why"

Tech Stack

Key Features

Architecture

Results Overview

Challenges & Solutions

Memory Measurement Pollution

Repository Structure

Running the Benchmark

Prerequisites

Collect data for a single query

Run all queries via the bash loop

Merge runs and analyze

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
__pycache__		__pycache__
analysis		analysis
baseline_tests_python		baseline_tests_python
finalMetrics		finalMetrics
tpchQueriesSQLlTE		tpchQueriesSQLlTE
README.md		README.md
automate.py		automate.py
automateBash.py		automateBash.py
execLoop.bash		execLoop.bash
vscode_results_python.xlsx		vscode_results_python.xlsx

Folders and files

Latest commit

History

Repository files navigation

TPC-H Query Efficiency Study

The "Why"

Tech Stack

Key Features

Architecture

Results Overview

Challenges & Solutions

Memory Measurement Pollution

Repository Structure

Running the Benchmark

Prerequisites

Collect data for a single query

Run all queries via the bash loop

Merge runs and analyze

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages