Skip to content

This project is an AI Observability Proof of Concept (POC) designed to detect, visualize, alert, and automatically remediate Data Drift in Machine Learning models. It simulates a production environment using a standard industry stack (Prometheus + Grafana) coupled with Evidently AI for statistical monitoring.

Notifications You must be signed in to change notification settings

techleadevelopers/data-smart-drift-alert

Repository files navigation

MLOps Drift Detection System

📋 Overview

This project is an AI Observability Proof of Concept (POC) designed to detect, visualize, alert, and automatically remediate Data Drift in Machine Learning models. It simulates a production environment using a standard industry stack (Prometheus + Grafana) coupled with Evidently AI for statistical monitoring.

🌟 Key Features (Operational Excellence)

This project goes beyond simple detection to demonstrate a Closed-Loop MLOps Lifecycle:

  1. Tiered Alerting Strategy:
    • Warning (PSI > 0.15): Early warning system for proactive monitoring.
    • Critical (PSI > 0.30): Actionable alerts connected to incident response channels.
  2. Automated Remediation:
    • One-click Auto-Retrain simulation directly from the dashboard.
    • Visualizes the complete CI/CD pipeline: Cluster Init -> Data Fetch -> AutoML -> Deployment.
  3. Deep-Dive Diagnostics:
    • Dual-Metric Analysis: Compares PSI vs. Jensen-Shannon Divergence for robust drift confirmation.
    • Data Quality Gates: Integrated schema validation (Nulls, Types, Ranges) to distinguish "Bad Data" from "Real Drift".
    • Explainability: Feature importance tracking (SHAP values) to pinpoint root causes.

🏗 Architecture

The system is fully containerized and orchestrated via Docker Compose:

  1. model_service: The inference engine hosting the trained Random Forest model.
  2. monitor_service: The observability core. It runs Evidently AI to compare live inference data against the training baseline.
    • Calculates PSI (Population Stability Index).
    • Calculates Jensen-Shannon Divergence.
    • Performs Data Quality checks (Nulls, Type mismatches).
    • Exposes metrics at /metrics for Prometheus.
  3. prometheus: Time-series database that scrapes metrics from the monitor_service.
  4. grafana: Visualization dashboard connected to Prometheus with configured Alert Rules.

🚀 Getting Started

Prerequisites

  • Docker & Docker Compose
  • Python 3.9+

Running the Stack

# 1. Build and start all services
docker-compose up --build -d

# 2. Access the Dashboard
# Open http://localhost:3000 (Grafana)
# Default credentials: admin / admin

🚦 Alerting & Triage Protocol

Alert Levels

We utilize a tiered alerting strategy to prevent fatigue and ensure appropriate responses:

Level Threshold (PSI) Notification Channel Response SLA
WARNING > 0.15 Slack (#ml-alerts) 4 Hours
CRITICAL > 0.30 PagerDuty + Slack 15 Minutes

Triage Routine (Standard Operating Procedure)

Scenario: CRITICAL_DRIFT_DETECTED (PSI > 0.30) fires.

  1. Acknowledge: On-call engineer acknowledges PagerDuty alert.
  2. Verify: Check Grafana "Data Quality" panel.
    • Is it a data engineering issue? (e.g., sudden influx of Nulls).
    • Is it genuine drift? (e.g., shifts in petal_length).
  3. Impact Analysis: Check "Model Accuracy" metric.
    • If Accuracy < 90%: Pause Batch Inferences.
  4. Remediation:
    • Option A (Auto): Click "Trigger Auto-Retrain" in the dashboard to execute the Airflow DAG.
    • Option B (Manual): Rollback to previous stable model version if caused by bad deployment.

🛠 Project Structure

  • project_files/docker-compose.yml: Main orchestration file.
  • project_files/monitor_service.py: Core logic for drift detection using Evidently.
  • project_files/prometheus.yml: Scrape configurations.
  • project_files/alerts.yml: Prometheus alerting rules definition.

modules = ["nodejs-20", "web", "postgresql-16"] run = "npm run dev" hidden = [".config", ".git", "generated-icon.png", "node_modules", "dist"]

[nix] channel = "stable-24_05"

[deployment] deploymentTarget = "static" build = ["npm", "run", "build"] publicDir = "dist/public"

[[ports]] localPort = 5000 externalPort = 80

[[ports]] localPort = 40075 externalPort = 3000

[env] PORT = "5000"

[workflows] runButton = "Project"

[[workflows.workflow]] name = "Project" mode = "parallel" author = "agent"

[[workflows.workflow.tasks]] task = "workflow.run" args = "Start application"

[[workflows.workflow]] name = "Start application" author = "agent"

[[workflows.workflow.tasks]] task = "shell.exec" args = "npm run dev:client" waitForPort = 5000

[agent] mockupState = "MOCKUP"

About

This project is an AI Observability Proof of Concept (POC) designed to detect, visualize, alert, and automatically remediate Data Drift in Machine Learning models. It simulates a production environment using a standard industry stack (Prometheus + Grafana) coupled with Evidently AI for statistical monitoring.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages