Welcome to the Soccer Analytics Engineering repository. This project focuses on building a high-performance, normalized analytical database from StatsBomb Open Data using DuckDB.
This repository provides a robust ETL pipeline that transforms raw StatsBomb JSON files into a structured, queryable DuckDB database. It is designed for high-performance tactical analysis, including advanced metrics like Expected Threat (xT) and spatial mapping.
- High Performance: Leverages DuckDB's columnar storage for lightning-fast queries with optimized single-pass ETL (3-4x faster builds).
- Normalized Schema: 15 core tables with full referential integrity, ENUM types for categorical data, and composite indexes.
- Data Enrichment: Integrated player canonicalization, 360 tracking data support, and dynamic lineup tracking.
- Comprehensive Testing: Validated with a suite of ~100 data integrity and quality tests.
- Optimized Storage: Removed redundant JSON columns, using extracted coordinate columns for better performance.
The following diagram illustrates the data flow from raw StatsBomb JSONs to the final analytical database.
graph TD
subgraph "Data Sources 📁"
SB[StatsBomb Open Data]
COMP[competitions.json]
MATCH[matches/*.json]
EVENT[events/*.json]
LINEUP[lineups/*.json]
THREE60[three-sixty/*.json]
end
subgraph "ETL Pipeline ⚙️"
BUILD[build.py]
SCHEMA[schema/ package]
end
subgraph "Storage 💾"
DB[(stats.duckdb)]
end
subgraph "Verification 🛡️"
TESTS[pytest suite]
end
SB --> COMP
SB --> MATCH
SB --> EVENT
COMP --> BUILD
MATCH --> BUILD
EVENT --> BUILD
LINEUP --> BUILD
THREE60 --> BUILD
SCHEMA --> BUILD
BUILD --> DB
DB --> TESTS
Follow these steps to initialize your environment and build the database.
- Python 3.12+
- Git
# Initialize and activate virtual environment
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt# Clone the StatsBomb open-data repository (this takes a few minutes)
git clone --depth 1 https://github.com/statsbomb/open-data.git# Run the build script
python3 build.pyThis will generate stats.duckdb in the root directory.
Once built, you can query the database directly using any DuckDB client or Python.
SELECT
player,
team,
COUNT(*) as goals
FROM events
WHERE shot_outcome = 'Goal'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10;SELECT
team,
COUNT(*) as total_passes,
ROUND(SUM(CASE WHEN pass_outcome IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as completion_pct
FROM events
WHERE type = 'Pass'
GROUP BY 1
ORDER BY 3 DESC;Find player positions during a specific shot:
SELECT
e.player,
e.type,
p.location_x,
p.location_y,
p.teammate
FROM events e
JOIN three_sixty_positions p ON e.id = p.event_uuid
WHERE e.type = 'Shot' AND e.match_id = 3788741;For deep dives into the engineering and data structure, refer to the following:
- 📑 db_spec.md: Detailed schema, table definitions, and indexing strategy.
- 🧪 tests.md: Overview of the testing suite and data quality checks.
The repository includes a comprehensive test suite to ensure data integrity.
# Run all tests
pytest -vThe tests cover:
- Schema Validation: Correctness of tables and types.
- Referential Integrity: PK/FK consistency across all 15 tables.
- Data Quality: Coordinate bounds, xG ranges, and event sequences.
This repository was built off of the work done by the NYU MSDS Fall 2025 Soccer capstone team.
This project is licensed under the MIT License. Note: Usage of StatsBomb Open Data is subject to their CC BY-NC 4.0 license (non-commercial use only).