This project is an AI Observability Proof of Concept (POC) designed to detect, visualize, alert, and automatically remediate Data Drift in Machine Learning models. It simulates a production environment using a standard industry stack (Prometheus + Grafana) coupled with Evidently AI for statistical monitoring.
This project goes beyond simple detection to demonstrate a Closed-Loop MLOps Lifecycle:
- Tiered Alerting Strategy:
- Warning (PSI > 0.15): Early warning system for proactive monitoring.
- Critical (PSI > 0.30): Actionable alerts connected to incident response channels.
- Automated Remediation:
- One-click Auto-Retrain simulation directly from the dashboard.
- Visualizes the complete CI/CD pipeline: Cluster Init -> Data Fetch -> AutoML -> Deployment.
- Deep-Dive Diagnostics:
- Dual-Metric Analysis: Compares PSI vs. Jensen-Shannon Divergence for robust drift confirmation.
- Data Quality Gates: Integrated schema validation (Nulls, Types, Ranges) to distinguish "Bad Data" from "Real Drift".
- Explainability: Feature importance tracking (SHAP values) to pinpoint root causes.
The system is fully containerized and orchestrated via Docker Compose:
model_service: The inference engine hosting the trained Random Forest model.monitor_service: The observability core. It runs Evidently AI to compare live inference data against the training baseline.- Calculates PSI (Population Stability Index).
- Calculates Jensen-Shannon Divergence.
- Performs Data Quality checks (Nulls, Type mismatches).
- Exposes metrics at
/metricsfor Prometheus.
prometheus: Time-series database that scrapes metrics from themonitor_service.grafana: Visualization dashboard connected to Prometheus with configured Alert Rules.
- Docker & Docker Compose
- Python 3.9+
# 1. Build and start all services
docker-compose up --build -d
# 2. Access the Dashboard
# Open http://localhost:3000 (Grafana)
# Default credentials: admin / adminWe utilize a tiered alerting strategy to prevent fatigue and ensure appropriate responses:
| Level | Threshold (PSI) | Notification Channel | Response SLA |
|---|---|---|---|
| WARNING | > 0.15 |
Slack (#ml-alerts) | 4 Hours |
| CRITICAL | > 0.30 |
PagerDuty + Slack | 15 Minutes |
Scenario: CRITICAL_DRIFT_DETECTED (PSI > 0.30) fires.
- Acknowledge: On-call engineer acknowledges PagerDuty alert.
- Verify: Check Grafana "Data Quality" panel.
- Is it a data engineering issue? (e.g., sudden influx of Nulls).
- Is it genuine drift? (e.g., shifts in
petal_length).
- Impact Analysis: Check "Model Accuracy" metric.
- If Accuracy < 90%: Pause Batch Inferences.
- Remediation:
- Option A (Auto): Click "Trigger Auto-Retrain" in the dashboard to execute the Airflow DAG.
- Option B (Manual): Rollback to previous stable model version if caused by bad deployment.
project_files/docker-compose.yml: Main orchestration file.project_files/monitor_service.py: Core logic for drift detection using Evidently.project_files/prometheus.yml: Scrape configurations.project_files/alerts.yml: Prometheus alerting rules definition.
modules = ["nodejs-20", "web", "postgresql-16"] run = "npm run dev" hidden = [".config", ".git", "generated-icon.png", "node_modules", "dist"]
[nix] channel = "stable-24_05"
[deployment] deploymentTarget = "static" build = ["npm", "run", "build"] publicDir = "dist/public"
[[ports]] localPort = 5000 externalPort = 80
[[ports]] localPort = 40075 externalPort = 3000
[env] PORT = "5000"
[workflows] runButton = "Project"
[[workflows.workflow]] name = "Project" mode = "parallel" author = "agent"
[[workflows.workflow.tasks]] task = "workflow.run" args = "Start application"
[[workflows.workflow]] name = "Start application" author = "agent"
[[workflows.workflow.tasks]] task = "shell.exec" args = "npm run dev:client" waitForPort = 5000
[agent] mockupState = "MOCKUP"