MLOps Drift Detection System

📋 Overview

This project is an AI Observability Proof of Concept (POC) designed to detect, visualize, alert, and automatically remediate Data Drift in Machine Learning models. It simulates a production environment using a standard industry stack (Prometheus + Grafana) coupled with Evidently AI for statistical monitoring.

🌟 Key Features (Operational Excellence)

This project goes beyond simple detection to demonstrate a Closed-Loop MLOps Lifecycle:

Tiered Alerting Strategy:
- Warning (PSI > 0.15): Early warning system for proactive monitoring.
- Critical (PSI > 0.30): Actionable alerts connected to incident response channels.
Automated Remediation:
- One-click Auto-Retrain simulation directly from the dashboard.
- Visualizes the complete CI/CD pipeline: Cluster Init -> Data Fetch -> AutoML -> Deployment.
Deep-Dive Diagnostics:
- Dual-Metric Analysis: Compares PSI vs. Jensen-Shannon Divergence for robust drift confirmation.
- Data Quality Gates: Integrated schema validation (Nulls, Types, Ranges) to distinguish "Bad Data" from "Real Drift".
- Explainability: Feature importance tracking (SHAP values) to pinpoint root causes.

🏗 Architecture

The system is fully containerized and orchestrated via Docker Compose:

model_service: The inference engine hosting the trained Random Forest model.
monitor_service: The observability core. It runs Evidently AI to compare live inference data against the training baseline.
- Calculates PSI (Population Stability Index).
- Calculates Jensen-Shannon Divergence.
- Performs Data Quality checks (Nulls, Type mismatches).
- Exposes metrics at /metrics for Prometheus.
prometheus: Time-series database that scrapes metrics from the monitor_service.
grafana: Visualization dashboard connected to Prometheus with configured Alert Rules.

🚀 Getting Started

Prerequisites

Docker & Docker Compose
Python 3.9+

Running the Stack

# 1. Build and start all services
docker-compose up --build -d

# 2. Access the Dashboard
# Open http://localhost:3000 (Grafana)
# Default credentials: admin / admin

🚦 Alerting & Triage Protocol

Alert Levels

We utilize a tiered alerting strategy to prevent fatigue and ensure appropriate responses:

Level	Threshold (PSI)	Notification Channel	Response SLA
WARNING	`> 0.15`	Slack (#ml-alerts)	4 Hours
CRITICAL	`> 0.30`	PagerDuty + Slack	15 Minutes

Triage Routine (Standard Operating Procedure)

Scenario: CRITICAL_DRIFT_DETECTED (PSI > 0.30) fires.

Acknowledge: On-call engineer acknowledges PagerDuty alert.
Verify: Check Grafana "Data Quality" panel.
- Is it a data engineering issue? (e.g., sudden influx of Nulls).
- Is it genuine drift? (e.g., shifts in petal_length).
Impact Analysis: Check "Model Accuracy" metric.
- If Accuracy < 90%: Pause Batch Inferences.
Remediation:
- Option A (Auto): Click "Trigger Auto-Retrain" in the dashboard to execute the Airflow DAG.
- Option B (Manual): Rollback to previous stable model version if caused by bad deployment.

🛠 Project Structure

project_files/docker-compose.yml: Main orchestration file.
project_files/monitor_service.py: Core logic for drift detection using Evidently.
project_files/prometheus.yml: Scrape configurations.
project_files/alerts.yml: Prometheus alerting rules definition.

modules = ["nodejs-20", "web", "postgresql-16"] run = "npm run dev" hidden = [".config", ".git", "generated-icon.png", "node_modules", "dist"]

[nix] channel = "stable-24_05"

[deployment] deploymentTarget = "static" build = ["npm", "run", "build"] publicDir = "dist/public"

[[ports]] localPort = 5000 externalPort = 80

[[ports]] localPort = 40075 externalPort = 3000

[env] PORT = "5000"

[workflows] runButton = "Project"

[[workflows.workflow]] name = "Project" mode = "parallel" author = "agent"

[[workflows.workflow.tasks]] task = "workflow.run" args = "Start application"

[[workflows.workflow]] name = "Start application" author = "agent"

[[workflows.workflow.tasks]] task = "shell.exec" args = "npm run dev:client" waitForPort = 5000

[agent] mockupState = "MOCKUP"

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
attached_assets		attached_assets
client		client
project_files		project_files
script		script
server		server
shared		shared
.gitignore		.gitignore
README.md		README.md
components.json		components.json
drizzle.config.ts		drizzle.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tsconfig.json		tsconfig.json
vite-plugin-meta-images.ts		vite-plugin-meta-images.ts
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLOps Drift Detection System

📋 Overview

🌟 Key Features (Operational Excellence)

🏗 Architecture

🚀 Getting Started

Prerequisites

Running the Stack

🚦 Alerting & Triage Protocol

Alert Levels

Triage Routine (Standard Operating Procedure)

🛠 Project Structure

About

Uh oh!

Releases

Packages

Languages

techleadevelopers/data-smart-drift-alert

Folders and files

Latest commit

History

Repository files navigation

MLOps Drift Detection System

📋 Overview

🌟 Key Features (Operational Excellence)

🏗 Architecture

🚀 Getting Started

Prerequisites

Running the Stack

🚦 Alerting & Triage Protocol

Alert Levels

Triage Routine (Standard Operating Procedure)

🛠 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages