![]() |
SemBench is a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data.
SemBench introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, text, and table. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators.
Currently SemBench is evalulated on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.
We understand that every system is under rapid development, which is why we maintain an online leaderboard. We encourage you to submit your system's results and participate in the benchmark. Please reach out to discuss how to contribute your results to the leaderboard.
We understand that downloading datasets, generating databases, and setting up environments can be tedious—especially in the systems area, where each system often requires a unique setup. SemBench automates all of these steps for you! It automatically downloads datasets and generates multi-modal databases. Setting up environments for SemBench and four supported systems (LOTUS, Palimpzest, ThalamusDB, and BigQuery) takes just one script.
So, enjoy using SemBench! We believe a good benchmark should minimize user effort—and SemBench is designed exactly for that.
Modern data systems increasingly need to process and reason over multi-modal data - combining traditional relational data with images, audio, and text. SemBench provides a standardized evaluation framework with:
- 5 Real-world Scenarios: wildlife monitoring, medical diagnosis, sentiment analysis of movie reviews, question anwering, E-commerce product analysis
- Multi-modal Queries: Complex semantic operations across multi-modal databases: table, text, image, and audio
- System-agnostic Design: Extensible and already supports LOTUS, Palimpzest, ThalamusDB, CAESURA, BigQuery, DuckDB FlockMTL
- Comprehensive Metrics: Quality (precision/recall/F1, relative error...), cost (money, token consumption), and efficiency evaluation (execution time)
- Rich Visualizations: Automated generation of performance comparisons
| Scenario | #Queries | Mod: Table | Mod: Text | Mod: Image | Mod: Audio | Op: Filter | Op: Join | Op: Map | Op: Rank | Op: Classify | Size: Text | Size: Image | Size: Audio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Movie | 10 | ✓ | ✓ | -- | -- | 4 | 3 | -- | 2 | 1 | 1,375,738 | -- | -- |
| Wildlife | 10 | ✓ | -- | ✓ | ✓ | 17 | -- | -- | -- | -- | -- | 8,718 | 650 |
| E-Commerce | 14 | ✓ | ✓ | ✓ | -- | 12 | 9 | 3 | 1 | 2 | 44,446 | 44,446 | -- |
| MMQA | 11 | ✓ | ✓ | ✓ | -- | 5 | 3 | 4 | -- | -- | 5,000 | 1,000 | -- |
| Medical | 10 | ✓ | ✓ | ✓ | ✓ | 12 | -- | -- | -- | 1 | 1,200 | 10,012 | 336 |
| Total | 55 | ✓ | ✓ | ✓ | ✓ | 49 | 15 | 7 | 3 | 4 | 1,426,384 | 64,176 | 986 |
SemBench/
├── files/ # Use case data and queries
│ ├── {use_case}/
│ │ ├── data/ # Multi-modal datasets
│ │ ├── query/ # Natural language & SQL queries
│ │ ├── metrics/ # System performance results
│ │ └── raw_results/ # Query execution outputs
├── figures/ # Performance visualizations
├── src/ # Core implementation
│ ├── runner/ # System-specific implementations
│ ├── scenario/ # Use case logic & evaluation
│ ├── evaluator/ # Quality assessment framework
│ └── run.py # Main benchmark orchestrator
└── README.md
Data Modalities: Tables, Images, Audio
Studies animal species detection and co-occurrence using camera traps and audio recorders. Features queries about species identification, location-based analysis, and cross-modal correlation between visual and audio evidence.
Key Features: Image-based species recognition, audio call classification, spatial-temporal analysis Queries: 10 queries ranging from simple counts to complex multi-way joins → Detailed documentation
Data Modalities: Tables, Text, Images, Audio
Comprehensive EHR analysis combining patient demographics, symptom descriptions, chest X-rays, and lung sound recordings. Evaluates disease diagnosis and multi-modal health assessment capabilities.
Key Features: Disease detection from multiple modalities, co-occurrence analysis, health status correlation Queries: 7 queries focusing on diagnostic accuracy and patient profiling → Detailed documentation
Data Modalities: Tables, Text
Movie review sentiment analysis testing systems' understanding of emotional tone in textual content. Includes sentiment classification, comparison, and aggregation operations.
Key Features: Sentiment classification, review comparison, positivity ratio calculation
Queries: 7 queries spanning filtering, joins, and aggregation operations
→ Detailed documentation
Data Modalities: Tables, Text, Images
Based on the standard MultiModalQA dataset, testing question-answering capabilities across combined textual and visual information sources.
Key Features: Cross-modal question answering, information synthesis, knowledge reasoning
Data Modalities: Tables, Text, Images
Based on the amazon fashion product dataset, making analysis over multi-modal product information
Key Features: Cross-modal product information analysis
SemBench supports evaluation of multiple multi-modal data systems:
- LOTUS: Semantic operators optimized by reducing costs with guaranteed accuracy
- Palimpzest: Semantic operators using cost-based optimization
- ThalamusDB: Semantic operators optimized with approximate query processing
- CAESURA: LLM-Based Multi-Modal Query Planner
- FlockMTL: An open-source extension of DuckDB
- BigQuery: Google's analytics data warehouse
Each system implements a standardized runner interface enabling fair comparison across different architectural approaches. Also SemBench can be easily extended to support more systems.
Please refer to the .env.example file and create a corresponding .env file to configure the necessary credentials or API keys for using Gemini, OpenAI, or models provided by other platforms.
For the list of supported models for each system, please consult the corresponding repository for detailed information.
bash scripts/setup_environment.shNote
-
SemBench automatically downloads the required datasets from Google Drive and constructs multi-modal databases according to the specified scale factor.
-
The databases included in the repository are provided for demonstration purposes only. Before running your own experiments, please delete the existing
files/{scenario}/datadirectory and execute the scripts to regenerate the data.
# Run specific system on specific use case and queries
python3 src/run.py --systems lotus --use-cases movie --queries 1 3 --model gemini-2.5-flash --scale-factor 2000
# Run full evaluation on a use case
python3 src/run.py --systems lotus --use-cases movie --model gemini-2.5-flash --scale-factor 2000
# Compare multiple systems
python3 src/run.py --systems lotus thalamusdb --use-cases movie --model gemini-2.5-flash --scale-factor 2000
# Execute repeated experiments for error bars
# Please configure the script file first
cd scripts
./repeat_experiment.sh
# Generate performance visualizations
python3 src/plot.py
# Generate the latex table used in our paper
python3 src/table_brick_design.py
# Generate analysis report
python3 src/scripts/analysis.pyResults are organized as:
- Query Results:
files/{scenario}/raw_results/{system}/Q{n}.csv - Performance Metrics:
files/{scenario}/metrics/{system}.json - Visualizations:
figures/{scenario}/
SemBench provides bar charts for every performance metric (money cost, latency, and result quality), pareto figure for cost-quality trade-off, and a comprehensive table in latex to compare all metrics.
The modular architecture supports easy extension:
- Add New Use Cases: Implement scenario-specific runner and evaluator
- Support New Systems: Create system-specific runner inheriting from
GenericRunner - Custom Metrics: Extend evaluation framework with domain-specific quality measures
- Additional Queries: Add query definitions in natural language and SQL formats
We welcome contributions for new scenarios, systems, metrics, and more semantic queries!
If you use this benchmark, or otherwise found our work valuable, please cite 📒:
@misc{lao2025sembenchbenchmarksemanticquery,
title={SemBench: A Benchmark for Semantic Query Processing Engines},
author={Jiale Lao and Andreas Zimmerer and Olga Ovcharenko and Tianji Cong and Matthew Russo and Gerardo Vitagliano and Michael Cochez and Fatma Özcan and Gautam Gupta and Thibaud Hottelier and H. V. Jagadish and Kris Kissel and Sebastian Schelter and Andreas Kipf and Immanuel Trummer},
year={2025},
eprint={2511.01716},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2511.01716},
}
SemBench enables systematic evaluation of multi-modal data systems across diverse, realistic scenarios. Built for researchers and practitioners working at the intersection of databases, AI, and multi-modal data processing.
