Skip to content

acourreg/patternalarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

145 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🚨 PatternAlarm

Streaming analytics pipeline for fraud detection β€” Kafka ingests 10K+ events/minute, Flink processes streaming aggregates across 3 domains, Spark ML scores with 97.5% accuracy. Includes a 79x performance optimization case study.

Part of my Scalefine.ai portfolio β€” exploring streaming patterns for real-time ML.

Dashboard


🎯 Context

This is a sandbox project exploring how streaming architectures handle fraud detection patterns. The goal was to wire together Kafka β†’ Flink β†’ Spark ML end-to-end and see where the bottlenecks actually are.

What I wanted to learn:

  • Can Flink keep up with 10K events/min while calling an ML API?
  • How bad is Spark job overhead for real-time inference? (Spoiler: very bad β€” see Performance Case Study)
  • What does it cost to run MSK + ECS + RDS for a streaming pipeline?

These questions came from my experience in gaming analytics, where streaming pipelines often hit scalability walls under high-velocity data. By simulating fraud across fintech, gaming, and e-commerce, I aimed to uncover practical trade-offs in cost, performance, and reliability that go beyond textbook setups.


πŸ“Š Capacity & Cost

To evaluate scalability, I ran load tests across different API Gateway configurations. The ML inference layer turned out to be the bottleneck β€” not Flink. Here's how throughput scales with API Gateway instances:

API Gateway Instances Throughput Monthly Cost Cost per 1M events
1 instance ~3,700/min ~$300 ~$0.05
3 instances ~12,000/min ~$450 ~$0.025
10 instances ~35,000/min ~$800 ~$0.015

Measured during load tests. Auto-scales to zero when idle.


✨ Results

Metric Value Notes
Throughput 3.7K events/min (1 API instance) Scales horizontally
ML Accuracy 97.5% (F1: 97.27%) 10-class RandomForest
Detection Latency < 3 seconds End-to-end
Batch Optimization 79x faster See case study below

These metrics validate the pipeline's viability for real-time fraud intervention, where low latency is critical. The high ML accuracy, combined with sub-3-second detection, provides a blueprint for production systems handling diverse event streams.

πŸ“– The 79x improvement was the interesting part β€” write-up here.


πŸ—οΈ Architecture

The architecture follows a layered approach: event ingestion via Kafka, stream processing in Flink, and ML scoring through a FastAPI service running Spark. Data flows seamlessly across domains, with built-in fault tolerance and horizontal scalability at the inference layer.

Architecture

Services

Service Purpose Tech
event-simulator/ Lambda function generating fictive transactions with configurable fraud %. Triggered by dashboard, injects into Kafka topics. Python, AWS Lambda
flink-processor/ Processes Kafka streams from 3 domains (fintech/ecommerce/gaming). Bronze→Silver (shared features) →Gold (ML scoring). Saves fraud alerts to PostgreSQL. Scala, Flink, ECS Fargate
api-gateway/ Serves ML predictions, velocity analytics, and fraud alerts with related transactions. Loads trained model from S3. This was the bottleneck. Python, FastAPI, Spark ML
dashboard/ UI to trigger transaction pipeline, monitor fraud status, and visualize real-time charts. Java, Spring Boot, Chart.js
airflow/ Productionizes model training pipeline: extract features β†’ train model β†’ validate β†’ save to S3. Python, Airflow, EMR Serverless
notebook/ Preliminary model development β€” cross-domain RandomForest achieving 97.5% accuracy on 10 fraud types. PySpark, Jupyter

Tech Stack

Layer Technology Why This Choice
Ingestion AWS Lambda β†’ MSK (Kafka) Scalable data pipelines, multi-topic architecture
Stream Processing Apache Flink (ECS Fargate) Real-time analytics with exactly-once semantics
ML Scoring Spark MLlib (RandomForest) Production mlops patterns, batch-optimized inference
Storage PostgreSQL (RDS) + Redis Time-series patterns, sub-100ms query caching
API FastAPI (async) High-throughput model serving
Dashboard Spring Boot + Thymeleaf + Chart.js Real-time visualization
Orchestration Apache Airflow + EMR Serverless MLOps workflow automation
Infrastructure Terraform (IaC) Reproducible cloud architecture

Similar stack to what I've used in gaming analytics roles. More projects β†’


πŸ“Έ Screenshots

To illustrate the pipeline in action, here are key visuals from the dashboard, ML workflow, and cost tracking.

Live Dashboard

Real-time fraud alerts with velocity graph and severity indicators.

Dashboard Screenshot

ML Pipeline (Airflow)

Orchestrated training pipeline: feature extraction β†’ model training β†’ validation (EMR Serverless).

Airflow DAG

Model Performance

97.5% accuracy with per-class breakdown across 10 fraud types.

Model Metrics

Infrastructure Costs

Optimized from $26/day to ~$10/day with auto-scaling.

AWS Costs


πŸ”¬ Performance Case Study: 79x Throughput Improvement

The Problem

During load testing, ML predictions were timing out under backlog pressure. Individual API calls took ~950ms each, limiting throughput to 63 predictions/minute β€” far below the 10K/min target.

Investigation

Isolated the API Gateway and profiled each step:

⏱️ [5] createDataFrame:  412ms (43%)
⏱️ [6] model.transform:   89ms  (9%)
⏱️ [7] collect:          773ms (81%)  ← BOTTLENECK

Finding: 81% of time was fixed overhead (Spark job setup), not actual inference. Scaling Flink or adding API instances wouldn't fix this β€” the problem was architectural.

Solution: Batch Processing

Switched from async single requests to synchronous batch processing:

Before:

// 1 request = 1 Spark job = 950ms
AsyncDataStream.unorderedWait(aggregates, fraudScoringAsyncFunction, ...)

After:

// 100 requests = 1 Spark job = 1200ms total = 12ms each
aggregates.process(new FraudScoringBatchFunction(batchSize = 100))

Results

Metric Before After Improvement
Latency/prediction 950ms 12ms 79x faster
Throughput 63/min 3,700/min 59x higher
10K/min target ❌ βœ… (with 3 API instances) β€”

Capacity Planning (Post-Optimization)

Configuration Throughput Supported Load
1 API instance, single requests 63/min ❌ None
1 API instance, batch 100 3,700/min βœ… MINI
3 API instances, batch 100 12,000/min βœ… NORMAL
10 API instances, batch 100 35,000/min βœ… PEAK

This optimization not only met the 10K/min target but also underscored the value of rethinking inference patterns in streaming ML pipelines.

πŸ’‘ Takeaway: Profile before scaling. This applies to any Spark-based inference.


🎯 Detection Results

In a NORMAL load test simulating 10K events per minute, the pipeline generated 16 alerts, catching 40.6% of true fraud cases across various types. Critical and high-severity incidents were prioritized effectively, showcasing the system's ability to surface actionable insights in real time.

Metric Value
Alerts generated 16
True fraud detected 13/32 (40.6%)
Critical severity 5
High severity 9

Alert Types Detected:

  • suspicious_activity (8)
  • account_takeover (3)
  • chargeback_fraud (2)
  • money_laundering (2)
  • structuring (1)

πŸš€ Quick Start

Getting started is straightforward, whether for local development or full AWS deployment.

Prerequisites

  • AWS CLI configured
  • Terraform >= 1.0
  • Docker + Docker Compose
  • Python 3.11+
  • Java 17

1. Local Development (Docker Compose)

# Clone
git clone https://github.com/acourreg/patternalarm.git
cd patternalarm

# Start local stack (Kafka, PostgreSQL, Redis)
cd scripts
cp config.example.conf config.conf  # Edit with your settings
docker-compose up -d

# Run setup scripts in order
./1-create-kafka-topics.sh
./2-rds-schema.sql        # Apply to local PostgreSQL
./3-upload-training-data.sh

2. Deploy to AWS (Terraform)

cd infra/terraform
terraform init
terraform apply

# Then run the same scripts against AWS resources
cd ../scripts
./1-create-kafka-topics.sh
./3-upload-training-data.sh

3. Run Load Test

# From dashboard UI or CLI
curl -X POST "https://<dashboard>/api/test/execute" \
  -d "domain=gaming&loadLevel=normal"

πŸ’° Cost Optimization

Managing costs was a key focus during development, as streaming setups can quickly rack up bills. From real billing data over six days, I optimized from an initial $26/day down to about $10/day through targeted strategies.

Metric Value
Total (6 days) $92.88
Average daily $15.48
Optimized daily ~$10/day
Monthly estimate ~$300

Cost breakdown: MSK (Kafka) ~40%, ECS ~25%, RDS ~20%, Other ~15%

Savings tactics:

  • ECS services scale to 0 when idle
  • MSK paused between tests
  • NAT Gateway minimized
  • EMR Serverless for Airflow jobs (pay per use)

Detailed cost breakdown in my blog post on cloud cost patterns.


πŸ“š Lessons Learned

Building this project reinforced several principles in streaming and ML engineering:

  1. Batch > Single requests for Spark ML β€” fixed overhead dominates individual inference. This shift alone unlocked massive throughput gains.
  2. Profile before scaling β€” adding instances wouldn't have fixed the root bottleneck. The 79x improvement came from architectural change, not more resources.
  3. MSK is expensive β€” consider Redpanda or self-hosted Kafka for dev environments.
  4. Auto-scale aggressively β€” cloud costs compound faster than expected.
  5. EMR Serverless β€” perfect for sporadic ML training vs always-on clusters.

πŸ› οΈ What's Inside

At its core, the project integrates streaming, data, cloud, and MLOps tools β€” with batch optimization being the most challenging (and rewarding) aspect.

Area What I Used Notes
Streaming Kafka (MSK), Flink (Scala), Spark ML Multi-topic, windowing, exactly-once semantics
Data PostgreSQL, Redis, FastAPI Time-series patterns, caching layer
Cloud AWS (MSK, RDS, ECS Fargate, EMR Serverless) Terraform for everything
MLOps Airflow, EMR Serverless, S3 model registry Training pipeline, not just notebooks
The Hard Part Batch optimization for Spark inference See case study β€” this took a while to figure out

More context on my background: scalefine.ai/about


πŸ”— Related


πŸ“„ License

MIT


πŸ™‹ Author

AurΓ©lien Courreges-Clercq
Building streaming pipelines and ML systems.

Scalefine.ai Β· LinkedIn Β· GitHub

About

Real-time fraud detection pipeline. Kafka ingests 10K+ events/min, Flink aggregates, Spark ML scores at 97.5% accuracy. ~$300/month gets you 3,700 predictions/min, <3s latency, auto-scaling to zero. Airflow + Terraform IaC. 79x throughput improvement via batch optimization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors