Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,74 @@ The webserver provides an intuitive interface with four main pages:
4. **Webserver** → serves UI and provides API access to database
5. **Grafana** → visualizes real-time data from database with embedded dashboards

## Archive Data

The database can be initialized with archive CML data using two methods:

### Method 1: CSV Files (Default, Fast)

Pre-generated CSV files included in the repository:
- **728 CML sublinks** (364 unique CML IDs) covering Berlin area
- **~1.5M data rows** at 5-minute intervals over 7 days
- **Gzip-compressed** (~7.6 MB total, included in repo)
- **Loads in ~3 seconds** via PostgreSQL COPY

Files are located in `/database/archive_data/` and loaded automatically on first database startup.

### Method 2: Load from NetCDF (For Larger/Higher Resolution Archives)

Load data directly from the full 3-month NetCDF archive with configurable time range:

#### Default: 7 Days at 10-Second Resolution (~44M rows, ~5 minutes)

```sh
# Rebuild parser if needed
docker compose build parser

# Start database
docker compose up -d database

# Load last 7 days from NetCDF
docker compose run --rm -e DB_HOST=database parser python /app/parser/parse_netcdf_archive.py
```

#### Custom Time Range

Use `ARCHIVE_MAX_DAYS` to control how much data to load:

```sh
# Load last 14 days (~88M rows, ~10 minutes)
docker compose run --rm -e DB_HOST=database -e ARCHIVE_MAX_DAYS=14 parser python /app/parser/parse_netcdf_archive.py

# Load full 3 months (~579M rows, ~1 hour)
docker compose run --rm -e DB_HOST=database -e ARCHIVE_MAX_DAYS=0 parser python /app/parser/parse_netcdf_archive.py
```

**Note**: Set `ARCHIVE_MAX_DAYS=0` to disable the time limit and load the entire dataset. Larger datasets require more database memory (recommend at least 4GB RAM for full 3-month archive).

**Features**:
- Auto-downloads 3-month NetCDF file (~209 MB) on first run
- **10-second resolution** (vs 5-minute for CSV method)
- **Automatic timestamp shifting** - data ends at current time
- **Progress reporting** with batch-by-batch status (~155K rows/sec)
- PostgreSQL COPY for maximum performance
- Configurable time window to balance demo realism vs load time

The NetCDF file is downloaded to `parser/example_data/openMRG_cmls_20150827_3months.nc` and gitignored.

### Managing Archive Data

To regenerate CSV archive data:
```sh
python mno_data_source_simulator/generate_archive.py
```

To reload archive data (either method):
```sh
docker compose down -v # Remove volumes
docker compose up -d # Restart with fresh database
```

## Storage Backend

The webserver supports multiple storage backends for received files:
Expand Down
3 changes: 2 additions & 1 deletion database/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ ENV POSTGRES_USER=myuser
ENV POSTGRES_PASSWORD=mypassword

# Copy the initialization script to set up the database schema
COPY init.sql /docker-entrypoint-initdb.d/
# Files are executed in alphabetical order, so we prefix with numbers
COPY init.sql /docker-entrypoint-initdb.d/01-init-schema.sql

EXPOSE 5432
Binary file added database/archive_data/data_archive.csv.gz
Binary file not shown.
Binary file added database/archive_data/metadata_archive.csv.gz
Binary file not shown.
52 changes: 52 additions & 0 deletions database/init_archive_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/bin/bash
set -e

# This script loads archive data into the database during initialization
# It runs after the schema is created (init.sql) but before the database
# accepts external connections.

echo "Loading archive data into database..."

ARCHIVE_DATA_DIR="/docker-entrypoint-initdb.d/archive_data"

# Check if archive data exists (should be included in the repo)
if [ ! -f "$ARCHIVE_DATA_DIR/metadata_archive.csv.gz" ] || [ ! -f "$ARCHIVE_DATA_DIR/data_archive.csv.gz" ]; then
echo "Warning: Archive data files not found. Skipping archive data load."
echo "Hint: Run 'python mno_data_source_simulator/generate_archive.py' to generate archive data."
exit 0
fi

# Load metadata first (required for foreign key references)
echo "Loading metadata archive..."
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
\COPY cml_metadata FROM PROGRAM 'gunzip -c $ARCHIVE_DATA_DIR/metadata_archive.csv.gz' WITH (FORMAT csv, HEADER true);
EOSQL

METADATA_COUNT=$(psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" -t -c "SELECT COUNT(*) FROM cml_metadata;")
echo "Loaded $METADATA_COUNT metadata records"

# Load time-series data using COPY for maximum speed
echo "Loading time-series archive data (this may take 10-30 seconds)..."
START_TIME=$(date +%s)

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
\COPY cml_data (time, cml_id, sublink_id, tsl, rsl) FROM PROGRAM 'gunzip -c $ARCHIVE_DATA_DIR/data_archive.csv.gz' WITH (FORMAT csv, HEADER true);
EOSQL

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

DATA_COUNT=$(psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" -t -c "SELECT COUNT(*) FROM cml_data;")
echo "Loaded $DATA_COUNT data records in $DURATION seconds"

# Display time range of loaded data
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
SELECT
'Archive data time range:' as info,
MIN(time) as start_time,
MAX(time) as end_time,
COUNT(*) as total_rows
FROM cml_data;
EOSQL

echo "Archive data successfully loaded!"
3 changes: 3 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ services:
build: ./database
ports:
- "5432:5432"
volumes:
- ./database/archive_data:/docker-entrypoint-initdb.d/archive_data:ro
- ./database/init_archive_data.sh:/docker-entrypoint-initdb.d/99-load-archive.sh:ro

processor:
build: ./processor
Expand Down
4 changes: 3 additions & 1 deletion grafana/provisioning/dashboards/dashboards.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ providers:
folder: ''
type: file
disableDeletion: false
editable: true
editable: false
updateIntervalSeconds: 10
allowUiUpdates: false
options:
path: /etc/grafana/provisioning/dashboards/definitions
115 changes: 109 additions & 6 deletions grafana/provisioning/dashboards/definitions/cml-realtime.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
}
]
},
"editable": true,
"editable": false,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 1,
Expand Down Expand Up @@ -156,7 +156,7 @@
},
"format": "time_series",
"rawQuery": true,
"rawSql": "SELECT\n time AS \"time\",\n sublink_id AS \"metric\",\n rsl AS \"value\"\nFROM cml_data\nWHERE cml_id = '${cml_id}'\n AND time >= $__timeFrom()::timestamptz\n AND time <= $__timeTo()::timestamptz\nORDER BY \"time\" ASC",
"rawSql": "SELECT\n time AS \"time\",\n sublink_id AS \"metric\",\n rsl AS \"value\"\nFROM cml_data\nWHERE cml_id = '${cml_id}'\n AND '${aggregation}' = 'RAW'\n AND time >= $__timeFrom()::timestamptz\n AND time <= $__timeTo()::timestamptz\nUNION ALL\nSELECT\n time_bucket((CASE WHEN '${interval}' = 'auto' THEN (${__interval_ms} || ' milliseconds') ELSE '${interval}' END)::interval, time) AS \"time\",\n sublink_id AS \"metric\",\n CASE\n WHEN '${aggregation}' = 'MEDIAN' THEN PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY rsl)\n WHEN '${aggregation}' = 'AVG' THEN AVG(rsl)\n WHEN '${aggregation}' = 'MIN' THEN MIN(rsl)\n WHEN '${aggregation}' = 'MAX' THEN MAX(rsl)\n WHEN '${aggregation}' = 'STDDEV' THEN STDDEV(rsl)\n END AS \"value\"\nFROM cml_data\nWHERE cml_id = '${cml_id}'\n AND '${aggregation}' <> 'RAW'\n AND time >= $__timeFrom()::timestamptz\n AND time <= $__timeTo()::timestamptz\nGROUP BY time_bucket((CASE WHEN '${interval}' = 'auto' THEN (${__interval_ms} || ' milliseconds') ELSE '${interval}' END)::interval, time), sublink_id\nORDER BY \"time\" ASC",
"refId": "A"
}
],
Expand Down Expand Up @@ -244,7 +244,7 @@
},
"format": "time_series",
"rawQuery": true,
"rawSql": "SELECT\n time AS \"time\",\n sublink_id AS \"metric\",\n tsl AS \"value\"\nFROM cml_data\nWHERE cml_id = '${cml_id}'\n AND time >= $__timeFrom()::timestamptz\n AND time <= $__timeTo()::timestamptz\nORDER BY \"time\" ASC",
"rawSql": "SELECT\n time AS \"time\",\n sublink_id AS \"metric\",\n tsl AS \"value\"\nFROM cml_data\nWHERE cml_id = '${cml_id}'\n AND '${aggregation}' = 'RAW'\n AND time >= $__timeFrom()::timestamptz\n AND time <= $__timeTo()::timestamptz\nUNION ALL\nSELECT\n time_bucket((CASE WHEN '${interval}' = 'auto' THEN (${__interval_ms} || ' milliseconds') ELSE '${interval}' END)::interval, time) AS \"time\",\n sublink_id AS \"metric\",\n CASE\n WHEN '${aggregation}' = 'MEDIAN' THEN PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY tsl)\n WHEN '${aggregation}' = 'AVG' THEN AVG(tsl)\n WHEN '${aggregation}' = 'MIN' THEN MIN(tsl)\n WHEN '${aggregation}' = 'MAX' THEN MAX(tsl)\n WHEN '${aggregation}' = 'STDDEV' THEN STDDEV(tsl)\n END AS \"value\"\nFROM cml_data\nWHERE cml_id = '${cml_id}'\n AND '${aggregation}' <> 'RAW'\n AND time >= $__timeFrom()::timestamptz\n AND time <= $__timeTo()::timestamptz\nGROUP BY time_bucket((CASE WHEN '${interval}' = 'auto' THEN (${__interval_ms} || ' milliseconds') ELSE '${interval}' END)::interval, time), sublink_id\nORDER BY \"time\" ASC",
"refId": "A"
}
],
Expand Down Expand Up @@ -392,18 +392,121 @@
"query": "SELECT DISTINCT cml_id::text FROM cml_metadata ORDER BY cml_id",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"skipUrlSync": true,
"sort": 0,
"tagValuesQuery": "",
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"current": {
"selected": true,
"text": "1min",
"value": "1 minute"
},
"description": "Time aggregation interval for downsampling data",
"hide": 0,
"includeAll": false,
"label": "Interval",
"multi": false,
"name": "interval",
"options": [
{
"selected": false,
"text": "Auto",
"value": "auto"
},
{
"selected": true,
"text": "1min",
"value": "1 minute"
},
{
"selected": false,
"text": "5min",
"value": "5 minutes"
},
{
"selected": false,
"text": "15min",
"value": "15 minutes"
},
{
"selected": false,
"text": "1h",
"value": "1 hour"
},
{
"selected": false,
"text": "6h",
"value": "6 hours"
},
{
"selected": false,
"text": "1d",
"value": "1 day"
}
],
"query": "Auto : auto,1min : 1 minute,5min : 5 minutes,15min : 15 minutes,1h : 1 hour,6h : 6 hours,1d : 1 day",
"queryValue": "",
"skipUrlSync": true,
"type": "custom"
},
{
"current": {
"selected": true,
"text": "Mean",
"value": "AVG"
},
"description": "Aggregation function for downsampling",
"hide": 0,
"includeAll": false,
"label": "Aggregation",
"multi": false,
"name": "aggregation",
"options": [
{
"selected": true,
"text": "Mean",
"value": "AVG"
},
{
"selected": false,
"text": "Raw (no aggregation)",
"value": "RAW"
},
{
"selected": false,
"text": "Min",
"value": "MIN"
},
{
"selected": false,
"text": "Max",
"value": "MAX"
},
{
"selected": false,
"text": "Median",
"value": "MEDIAN"
},
{
"selected": false,
"text": "StdDev",
"value": "STDDEV"
}
],
"query": "Mean : AVG,Raw (no aggregation) : RAW,Min : MIN,Max : MAX,Median : MEDIAN,StdDev : STDDEV",
"queryValue": "",
"skipUrlSync": true,
"type": "custom"
}
]
},
"time": {
"from": "2015-08-27T00:00:00Z",
"to": "2015-08-27T00:30:00Z"
"from": "now-7d",
"to": "now"
},
"timepicker": {},
"timezone": "",
Expand Down
Loading