A reproducible ETL pipeline that transforms production data from the Equinor Volve oil field dataset into a structured SQLite database for educational purposes and SQL learning.
This project demonstrates the advantages of structured databases over Excel-only data storage by converting raw Volve production data (Excel format) into a queryable SQLite database. The focus is on education - teaching SQL, data engineering concepts, and database design to students and professionals.
✅ Star schema design with dimension and fact tables ✅ SQLAlchemy Core for transparent SQL operations (not ORM) ✅ Reproducible ETL pipeline from Excel to SQLite ✅ Data validation with referential integrity checks ✅ Beginner-friendly code with extensive documentation ✅ Ready for analysis with pandas, polars, or duckdb
- 7 wells from the Volve field
- 15,634 daily production records (2007-2016)
- 526 monthly production records (2007-2016)
- Database size: 3.2 MB
- wells: Master dimension table (7 wellbores)
- daily_production: Daily measurements (15,634 records, 2007-2016)
- monthly_production: Monthly aggregates (526 records, 2007-2016)
All tables are related via npd_wellbore_code foreign keys.
# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtDownload the Volve production data Excel file from Equinor Data Portal and place it at:
data/production/Volve production data.xlsx
# Create database schema
uv run python scripts/transform/create_tables.py
# Load data from Excel
uv run python scripts/transform/load_data.pyimport pandas as pd
import sqlite3
conn = sqlite3.connect('database/volve.db')
# Get production summary
df = pd.read_sql("""
SELECT
w.wellbore_name,
SUM(d.oil_volume) as total_oil,
SUM(d.gas_volume) as total_gas
FROM wells w
JOIN daily_production d ON w.npd_wellbore_code = d.npd_wellbore_code
WHERE d.oil_volume IS NOT NULL
GROUP BY w.wellbore_name
ORDER BY total_oil DESC
""", conn)
print(df)
conn.close()This project provides automated Parquet exports of all database tables. Parquet files are columnar, compressed, and ideal for high-performance analytics directly in the browser (via DuckDB-WASM) or in Python without needing to manage a local SQLite file.
# Export all tables to parquet/ directory
uv run python scripts/export/parquet_export.pyExported files are stored in the parquet/ directory:
parquet/wells.parquetparquet/daily_production.parquetparquet/monthly_production.parquetparquet/README.md(metadata summary)
If hosted on GitHub, these files are automatically deployed to GitHub Pages and can be accessed via URL (useful for DuckDB-WASM):
https://<username>.github.io/<repo>/parquet/daily_production.parquet
See parquet/README.md for comprehensive usage examples with DuckDB-WASM, Pandas, and Polars.
volve-db/
├── data/
│ └── production/
│ └── Volve production data.xlsx # Source data (download)
├── database/
│ └── volve.db # SQLite database (generated)
├── parquet/
│ └── *.parquet # Columnar exports (generated)
├── scripts/
│ ├── explore/
│ │ ├── analyze_production_data.py # Data exploration
│ │ └── PRODUCTION_DATA_FINDINGS.md # Analysis results
│ ├── export/
│ │ ├── export_constants.py # Export configuration
│ │ ├── export_utils.py # Validation utilities
│ │ └── parquet_export.py # Export implementation
│ └── transform/
│ ├── constants.py # Reusable column definitions
│ ├── create_tables.py # Schema creation
│ ├── load_data.py # ETL pipeline
│ ├── SCHEMA_DOCUMENTATION.md # Database schema docs
│ ├── USAGE_EXAMPLES.md # Query examples
│ └── README.md # Transformation guide
├── pyproject.toml # Dependencies (uv)
└── README.md # This file
SELECT
w.wellbore_name,
ROUND(SUM(d.oil_volume), 2) as total_oil
FROM wells w
JOIN daily_production d ON w.npd_wellbore_code = d.npd_wellbore_code
WHERE d.oil_volume IS NOT NULL
GROUP BY w.wellbore_name
ORDER BY total_oil DESC;See USAGE_EXAMPLES.md for comprehensive query examples including:
- Production summaries by well
- Time-series analysis
- Well performance comparison
- Pandas, Polars, and DuckDB examples
- Visualization with matplotlib
- SCHEMA_DOCUMENTATION.md - Complete database schema with ER diagrams
- USAGE_EXAMPLES.md - SQL queries and analysis examples
- PRODUCTION_DATA_FINDINGS.md - Source data analysis
This project is designed for learning SQL and data engineering, not as a production application:
# ✅ Good: Explicit SQL with Core
from sqlalchemy import text
query = text("""
SELECT wellbore_name, SUM(oil_volume) as total
FROM wells w
JOIN daily_production d ON w.npd_wellbore_code = d.npd_wellbore_code
GROUP BY wellbore_name
""")
result = conn.execute(query)
# ❌ Avoid: ORM hides the SQL
session.query(Well).join(DailyProduction)...Why? Students can see and learn SQL directly. The ORM abstracts away the very concepts we're trying to teach.
- Minimal dependencies - Only essential packages
- Explicit SQL - Show the SQL, don't hide it
- Educational code - Readable, well-commented, type-hinted
- Reproducible - Anyone can recreate the database
- Beginner-friendly - Clear variable names, comprehensive docs
import pandas as pd
import sqlite3
conn = sqlite3.connect('database/volve.db')
# Monthly production trends
df = pd.read_sql("""
SELECT
strftime('%Y-%m', date) as month,
SUM(oil_volume) as oil,
SUM(gas_volume) as gas
FROM daily_production
WHERE oil_volume IS NOT NULL
GROUP BY month
ORDER BY month
""", conn)
df['month'] = pd.to_datetime(df['month'])
df = df.set_index('month')
# Plot
df.plot(figsize=(12, 6), title='Volve Field Production')
conn.close()import polars as pl
df = pl.read_database(
"SELECT * FROM daily_production",
connection="sqlite:///database/volve.db"
)
summary = (
df.group_by('npd_wellbore_code')
.agg([
pl.col('oil_volume').sum(),
pl.col('gas_volume').sum()
])
)
print(summary)import duckdb
conn = duckdb.connect()
conn.execute("ATTACH 'database/volve.db' AS volve (TYPE SQLITE)")
result = conn.execute("""
SELECT
w.wellbore_name,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY d.oil_volume) as median_oil
FROM volve.wells w
JOIN volve.daily_production d ON w.npd_wellbore_code = d.npd_wellbore_code
GROUP BY w.wellbore_name
""").fetchdf()
print(result)- Python 3.12+
- uv - Fast package manager
- SQLAlchemy - Database toolkit (Core only)
- pandas - Data manipulation and Excel reading
- SQLite - Lightweight, portable database
Optional for analysis:
- matplotlib - Plotting
- polars or duckdb - Fast analytics
The database uses a star schema design:
wells (dimension)
├── 7 unique wellbores
└── Static attributes: codes, names, field, facility
daily_production (fact)
├── 15,634 daily measurements
├── Operational metrics: pressure, temperature, choke
└── Production volumes: oil, gas, water
monthly_production (fact)
├── 526 monthly aggregates
└── Production volumes in Sm3
See SCHEMA_DOCUMENTATION.md for:
- Complete table schemas with all columns
- ER diagrams and relationship details
- Data types, constraints, and indexes
- ETL pipeline documentation
- Completeness: ~60% (operational metrics have ~40% missing values)
- Missing sensors: Not all wells had downhole pressure/temperature monitoring
- Water injection: Only populated for injection wells (expected)
- Completeness: ~60% (production volumes have ~40% missing values)
- Gas injection: 99.8% missing (rarely used)
- Non-producing periods: NULL values indicate shut-in months
This is an educational project. Contributions that improve clarity, add documentation, or enhance the learning experience are welcome.
Please maintain:
- SQLAlchemy Core approach (not ORM)
- Explicit SQL visibility
- Beginner-friendly code style
- Comprehensive comments and docstrings
- Add example analysis notebooks
- Create visualization examples
- Add data validation tests
- Production decline curve analysis
- Water breakthrough analysis
- Well performance comparison
- Time-series forecasting examples
This project processes data from the Equinor Volve dataset. Please refer to the original license terms:
https://cdn.equinor.com/files/h61q9gi9/global/de6532f6134b9a953f6c41bac47a0c055a3712d3.pdf
- Equinor for releasing the Volve field dataset as open data
- The dataset is provided for research and educational purposes
Educational Note: This project demonstrates database design principles, ETL pipelines, and SQL concepts. It's intentionally kept simple to focus on learning rather than production-grade features.