SemBench: Benchmarking Semantic Query Processing Engines

SemBench is a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data.

SemBench introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, text, and table. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators.

Currently SemBench is evalulated on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.

We understand that every system is under rapid development, which is why we maintain an online leaderboard. We encourage you to submit your system's results and participate in the benchmark. Please reach out to discuss how to contribute your results to the leaderboard.

To Users: SemBench — A Simple, Ready-to-Use Benchmark

We understand that downloading datasets, generating databases, and setting up environments can be tedious—especially in the systems area, where each system often requires a unique setup. SemBench automates all of these steps for you! It automatically downloads datasets and generates multi-modal databases. Setting up environments for SemBench and four supported systems (LOTUS, Palimpzest, ThalamusDB, and BigQuery) takes just one script.

So, enjoy using SemBench! We believe a good benchmark should minimize user effort—and SemBench is designed exactly for that.

Materials

🌟 Overview

Modern data systems increasingly need to process and reason over multi-modal data - combining traditional relational data with images, audio, and text. SemBench provides a standardized evaluation framework with:

5 Real-world Scenarios: wildlife monitoring, medical diagnosis, sentiment analysis of movie reviews, question anwering, E-commerce product analysis
Multi-modal Queries: Complex semantic operations across multi-modal databases: table, text, image, and audio
System-agnostic Design: Extensible and already supports LOTUS, Palimpzest, ThalamusDB, CAESURA, BigQuery, DuckDB FlockMTL
Comprehensive Metrics: Quality (precision/recall/F1, relative error...), cost (money, token consumption), and efficiency evaluation (execution time)
Rich Visualizations: Automated generation of performance comparisons

Scenario	#Queries	Mod: Table	Mod: Text	Mod: Image	Mod: Audio	Op: Filter	Op: Join	Op: Map	Op: Rank	Op: Classify	Size: Text	Size: Image	Size: Audio
Movie	10	✓	✓	--	--	4	3	--	2	1	1,375,738	--	--
Wildlife	10	✓	--	✓	✓	17	--	--	--	--	--	8,718	650
E-Commerce	14	✓	✓	✓	--	12	9	3	1	2	44,446	44,446	--
MMQA	11	✓	✓	✓	--	5	3	4	--	--	5,000	1,000	--
Medical	10	✓	✓	✓	✓	12	--	--	--	1	1,200	10,012	336
Total	55	✓	✓	✓	✓	49	15	7	3	4	1,426,384	64,176	986

📁 Architecture

SemBench/
├── files/              # Use case data and queries
│   ├── {use_case}/
│   │   ├── data/           # Multi-modal datasets
│   │   ├── query/          # Natural language & SQL queries  
│   │   ├── metrics/        # System performance results
│   │   └── raw_results/    # Query execution outputs
├── figures/            # Performance visualizations  
├── src/                # Core implementation
│   ├── runner/            # System-specific implementations
│   ├── scenario/          # Use case logic & evaluation
│   ├── evaluator/         # Quality assessment framework
│   └── run.py            # Main benchmark orchestrator
└── README.md

🚀 Supported Scenarios

1. Animals - Wildlife Monitoring

Data Modalities: Tables, Images, Audio

Studies animal species detection and co-occurrence using camera traps and audio recorders. Features queries about species identification, location-based analysis, and cross-modal correlation between visual and audio evidence.

Key Features: Image-based species recognition, audio call classification, spatial-temporal analysis Queries: 10 queries ranging from simple counts to complex multi-way joins → Detailed documentation

2. Medical - Electronic Health Records

Data Modalities: Tables, Text, Images, Audio

Comprehensive EHR analysis combining patient demographics, symptom descriptions, chest X-rays, and lung sound recordings. Evaluates disease diagnosis and multi-modal health assessment capabilities.

Key Features: Disease detection from multiple modalities, co-occurrence analysis, health status correlation Queries: 7 queries focusing on diagnostic accuracy and patient profiling → Detailed documentation

3. Movie - Sentiment Analysis

Data Modalities: Tables, Text

Movie review sentiment analysis testing systems' understanding of emotional tone in textual content. Includes sentiment classification, comparison, and aggregation operations.

Key Features: Sentiment classification, review comparison, positivity ratio calculation
Queries: 7 queries spanning filtering, joins, and aggregation operations → Detailed documentation

4. MMQA - Multi-Modal Question Answering

Data Modalities: Tables, Text, Images

Based on the standard MultiModalQA dataset, testing question-answering capabilities across combined textual and visual information sources.

Key Features: Cross-modal question answering, information synthesis, knowledge reasoning

5. Product - Multi-Modal Amazon Fashion Product Analysis

Data Modalities: Tables, Text, Images

Based on the amazon fashion product dataset, making analysis over multi-modal product information

Key Features: Cross-modal product information analysis

🔧 System Support

SemBench supports evaluation of multiple multi-modal data systems:

LOTUS: Semantic operators optimized by reducing costs with guaranteed accuracy
Palimpzest: Semantic operators using cost-based optimization
ThalamusDB: Semantic operators optimized with approximate query processing
CAESURA: LLM-Based Multi-Modal Query Planner
FlockMTL: An open-source extension of DuckDB
BigQuery: Google's analytics data warehouse

Each system implements a standardized runner interface enabling fair comparison across different architectural approaches. Also SemBench can be easily extended to support more systems.

⚡ Quick Start

Large Language Model Provider Setup

Please refer to the .env.example file and create a corresponding .env file to configure the necessary credentials or API keys for using Gemini, OpenAI, or models provided by other platforms. For the list of supported models for each system, please consult the corresponding repository for detailed information.

Automatic Environment Setup

bash scripts/setup_environment.sh

Automatic Dataset Download and Database Generation

Note

SemBench automatically downloads the required datasets from Google Drive and constructs multi-modal databases according to the specified scale factor.
The databases included in the repository are provided for demonstration purposes only. Before running your own experiments, please delete the existing files/{scenario}/data directory and execute the scripts to regenerate the data.

Running Benchmarks

# Run specific system on specific use case and queries
python3 src/run.py --systems lotus --use-cases movie --queries 1 3 --model gemini-2.5-flash --scale-factor 2000

# Run full evaluation on a use case  
python3 src/run.py --systems lotus --use-cases movie --model gemini-2.5-flash --scale-factor 2000

# Compare multiple systems
python3 src/run.py --systems lotus thalamusdb --use-cases movie --model gemini-2.5-flash --scale-factor 2000

# Execute repeated experiments for error bars
# Please configure the script file first
cd scripts
./repeat_experiment.sh

# Generate performance visualizations
python3 src/plot.py

# Generate the latex table used in our paper
python3 src/table_brick_design.py

# Generate analysis report
python3 src/scripts/analysis.py

Output Structure

Results are organized as:

Query Results: files/{scenario}/raw_results/{system}/Q{n}.csv
Performance Metrics: files/{scenario}/metrics/{system}.json
Visualizations: figures/{scenario}/

SemBench provides bar charts for every performance metric (money cost, latency, and result quality), pareto figure for cost-quality trade-off, and a comprehensive table in latex to compare all metrics.

🏗️ Extending the Benchmark

The modular architecture supports easy extension:

Add New Use Cases: Implement scenario-specific runner and evaluator
Support New Systems: Create system-specific runner inheriting from GenericRunner
Custom Metrics: Extend evaluation framework with domain-specific quality measures
Additional Queries: Add query definitions in natural language and SQL formats

🤝 Contributing

We welcome contributions for new scenarios, systems, metrics, and more semantic queries!

Citation

If you use this benchmark, or otherwise found our work valuable, please cite 📒:

@misc{lao2025sembenchbenchmarksemanticquery,
      title={SemBench: A Benchmark for Semantic Query Processing Engines}, 
      author={Jiale Lao and Andreas Zimmerer and Olga Ovcharenko and Tianji Cong and Matthew Russo and Gerardo Vitagliano and Michael Cochez and Fatma Özcan and Gautam Gupta and Thibaud Hottelier and H. V. Jagadish and Kris Kissel and Sebastian Schelter and Andreas Kipf and Immanuel Trummer},
      year={2025},
      eprint={2511.01716},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2511.01716}, 
}

SemBench enables systematic evaluation of multi-modal data systems across diverse, realistic scenarios. Built for researchers and practitioners working at the intersection of databases, AI, and multi-modal data processing.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.vscode		.vscode
analysis_results		analysis_results
assets		assets
config/system		config/system
docs		docs
figures		figures
files		files
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
ENVIRONMENT_SETUP.md		ENVIRONMENT_SETUP.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SemBench: Benchmarking Semantic Query Processing Engines

To Users: SemBench — A Simple, Ready-to-Use Benchmark

Materials

🌟 Overview

📁 Architecture

🚀 Supported Scenarios

1. Animals - Wildlife Monitoring

2. Medical - Electronic Health Records

3. Movie - Sentiment Analysis

4. MMQA - Multi-Modal Question Answering

5. Product - Multi-Modal Amazon Fashion Product Analysis

🔧 System Support

⚡ Quick Start

Large Language Model Provider Setup

Automatic Environment Setup

Automatic Dataset Download and Database Generation

Running Benchmarks

Output Structure

🏗️ Extending the Benchmark

🤝 Contributing

Citation

About

Uh oh!

Releases

Packages

Languages

SolidLao/SemBench

Folders and files

Latest commit

History

Repository files navigation

SemBench: Benchmarking Semantic Query Processing Engines

To Users: SemBench — A Simple, Ready-to-Use Benchmark

Materials

🌟 Overview

📁 Architecture

🚀 Supported Scenarios

1. Animals - Wildlife Monitoring

2. Medical - Electronic Health Records

3. Movie - Sentiment Analysis

4. MMQA - Multi-Modal Question Answering

5. Product - Multi-Modal Amazon Fashion Product Analysis

🔧 System Support

⚡ Quick Start

Large Language Model Provider Setup

Automatic Environment Setup

Automatic Dataset Download and Database Generation

Running Benchmarks

Output Structure

🏗️ Extending the Benchmark

🤝 Contributing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages