Skip to content

tyakovenko/queryEfficiencyStudy

Repository files navigation

TPC-H Query Efficiency Study

Benchmarking SQLite query execution time and memory usage across 19 TPC-H standard business intelligence queries — with statistically robust multi-run methodology and process-level isolation.


The "Why"

Standard database benchmarks often report single-run metrics, which are skewed by OS page caching, cold starts, and cross-query memory contamination in long-running processes. This study addresses that gap by:

  • Running each TPC-H query 10–40 times across multiple isolated collection sessions
  • Using subprocess-per-query isolation to prevent RSS memory pollution between queries
  • Capturing both wall-clock time and process RSS memory per execution
  • Aggregating into summary statistics and visualizations for direct cross-query comparison

The result is a reproducible, statistically grounded picture of how SQLite handles each of 19 real-world OLAP query patterns.


Tech Stack

Category Tool Why
Database SQLite Lightweight, file-based engine; ideal for reproducible local benchmarks without server overhead
Memory Profiling psutil Captures OS-level RSS (Resident Set Size) rather than Python heap only — catches SQLite's own buffer usage
Batch Automation Bash + Python subprocess Spawns a fresh Python process per query so each RSS baseline starts clean
Data Processing pandas Multi-run CSV merging, zero-value filtering, and describe() summary statistics
Visualization matplotlib Per-query execution bar charts and cross-query summary stat comparisons
Query Standard TPC-H Industry-standard OLAP benchmark; 22 parameterized business intelligence queries across 8 relational tables

Key Features

  • Multi-run statistical profiling — Each query was executed 10–40 times across 4 independent collection runs, yielding datasets with 50–120+ observations per query for robust mean and variance estimates.

  • Process-level isolationautomateBash.py is invoked as a fresh subprocess by execLoop.bash for each query. Every process starts with a clean heap and fresh RSS baseline, scoping memory deltas to the target query alone.

  • Dual-metric capture — Each execution records time.perf_counter() wall-clock time and a psutil RSS delta. SQLite's .stats output and EXPLAIN QUERY PLAN are also captured per run for deeper diagnostics.

  • Memory noise filteringcombineFiles.py drops zero-valued memory observations (cache-hit runs that return a near-zero RSS delta) and deduplicates readings before computing statistics, reducing measurement noise.

  • Cross-query comparison — Summary stats (mean, std, min, 25%, 50%, 75%, max) saved to time_allStats.csv and mem_allStats.csv for direct comparison across all 19 queries.


Architecture

flowchart TD
    A[TPC-H SQL files\ntpchQueriesSQLlTE/] --> B[execLoop.bash]
    B -->|"spawns subprocess\nper query"| C[automateBash.py]
    C --> D[(tpch.db\nSQLite)]
    C -->|N executions| E[Raw CSVs\nfinalMetrics/]
    E --> F[combineFiles.py\nMerge + Filter + Stats]
    F --> G[finalTimes/ & finalMem/\nper-query CSVs]
    F --> H[time_allStats.csv\nmem_allStats.csv]
    G --> I[exploreData.py\nVisualization]
    I --> J[timeGraphs/*.png\nSummary *.png]
Loading

Results Overview

All times in seconds; collected over 50–120 runs per query on a local SQLite database.

Query Description Mean (s) Std (s) Notes
Q9 Product Type Profit 32.67 4.16 Slowest — 8-table join across full lineitem scan
Q13 Customer Distribution 23.73 2.99 Correlated subquery over orders with LIKE filter
Q8 National Market Share 12.50 1.89 Multi-join with region predicate pushdown
Q21 Suppliers Who Kept Orders Waiting 10.29 1.83 Multiple correlated EXISTS subqueries
Q1 Pricing Summary 9.59 1.36 Full lineitem scan with multi-column aggregation
Q7 Volume Shipping 7.49 0.99 Cross-nation join with year extraction
Q5 Local Supplier Volume 4.85 0.49 6-table join with region and year filter
Q3 Shipping Priority 3.48 1.24 Unshipped orders revenue ranking
Q19 Discounted Revenue 3.34 1.02 Disjunctive OR predicate on lineitem
Q18 Large Volume Customer 2.73 0.96 HAVING clause over grouped subquery
Q6 Forecasting Revenue Change 2.13 0.34 Simple lineitem scan, discount range filter
Q14 Promotion Effect 2.77 0.46 Conditional SUM over date-filtered lineitem
Q12 Shipping Modes and Order Priority 2.50 0.46 JOIN on shipmode with CASE aggregation
Q10 Returned Item Reporting 2.99 0.40 Customer revenue from flagged returns
Q15 Top Supplier 2.46 0.45 Subquery creates ranked supplier view
Q4 Order Priority Checking 1.52 0.34 EXISTS subquery per order row
Q11 Important Stock Identification 1.01 0.21 Subquery computes warehouse-wide threshold
Q16 Parts/Supplier Relationship 0.50 0.08 NOT IN with indexed part filter — fastest join
Q2 Minimum Cost Supplier 0.47 0.50 Highly selective subquery; fastest overall

Full statistics in analysis/time_allStats.csv and analysis/mem_allStats.csv. Per-query bar charts in analysis/timeGraphs/ and analysis/*.png.


Challenges & Solutions

Memory Measurement Pollution

The problem: The original automate.py script ran all queries sequentially inside a single Python process. Because SQLite caches database pages and Python's memory allocator does not return RSS to the OS between queries, the RSS delta measured for query N included residual allocations from queries 1 through N-1. Memory readings were therefore systematically inflated for later-running queries, making cross-query comparison unreliable.

The solution: The harness was refactored into automateBash.py, a CLI-invocable module that accepts <db> <sql> <n_runs> <out_csv> as arguments. execLoop.bash invokes it as a separate python subprocess for each query. Every process starts with a fresh heap and a clean RSS baseline, so the memory delta is scoped to that query's executions only.

The trade-off: Each subprocess adds ~1–2 seconds of Python interpreter startup time. Since the target queries run for 0.5–45 seconds, the startup overhead is negligible relative to measurement signal and well within the noise of the standard deviation.


Repository Structure

queryEfficiencyStudy/
├── tpchQueriesSQLlTE/          # 22 TPC-H queries adapted for SQLite syntax
│   └── querySource.md          # Business meaning of each query
├── baseline_tests_python/      # Exploratory single-run scripts using tracemalloc (not used in final data)
├── automate.py                 # First-generation multi-run harness (single-process; deprecated)
├── automateBash.py             # CLI-driven harness — use this for new data collection
├── execLoop.bash               # Bash driver: spawns automateBash.py once per query
├── finalMetrics/               # Raw CSVs from 4 collection runs (run1, run3, run4, 10run)
└── analysis/
    ├── combineFiles.py         # Merges runs, filters memory noise, computes summary stats
    ├── exploreData.py          # Generates per-query bar charts and cross-query stat plots
    ├── finalTimes/             # Merged execution time CSVs (one file per query)
    ├── finalMem/               # Merged + filtered memory CSVs (one file per query)
    ├── timeGraphs/             # Bar chart PNGs per query
    ├── time_allStats.csv       # Summary statistics table for execution times
    └── mem_allStats.csv        # Summary statistics table for memory usage

Running the Benchmark

Prerequisites

pip install pandas matplotlib psutil

You will also need a tpch.db SQLite database. Generate one using the TPC-H dbgen tool.

Collect data for a single query

python automateBash.py /path/to/tpch.db tpchQueriesSQLlTE/q1.sql 30 finalMetrics/run4/q1.csv

Run all queries via the bash loop

# Edit execLoop.bash to set correct paths, then:
bash execLoop.bash

Merge runs and analyze

cd analysis/
python combineFiles.py   # merges runs → finalTimes/, finalMem/, *_allStats.csv
python exploreData.py    # generates charts → timeGraphs/*.png, *.png

About

Analyze the efficiency of standardadized sql queries on different systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors