Streaming analytics pipeline for fraud detection β Kafka ingests 10K+ events/minute, Flink processes streaming aggregates across 3 domains, Spark ML scores with 97.5% accuracy. Includes a 79x performance optimization case study.
Part of my Scalefine.ai portfolio β exploring streaming patterns for real-time ML.
This is a sandbox project exploring how streaming architectures handle fraud detection patterns. The goal was to wire together Kafka β Flink β Spark ML end-to-end and see where the bottlenecks actually are.
What I wanted to learn:
- Can Flink keep up with 10K events/min while calling an ML API?
- How bad is Spark job overhead for real-time inference? (Spoiler: very bad β see Performance Case Study)
- What does it cost to run MSK + ECS + RDS for a streaming pipeline?
These questions came from my experience in gaming analytics, where streaming pipelines often hit scalability walls under high-velocity data. By simulating fraud across fintech, gaming, and e-commerce, I aimed to uncover practical trade-offs in cost, performance, and reliability that go beyond textbook setups.
To evaluate scalability, I ran load tests across different API Gateway configurations. The ML inference layer turned out to be the bottleneck β not Flink. Here's how throughput scales with API Gateway instances:
| API Gateway Instances | Throughput | Monthly Cost | Cost per 1M events |
|---|---|---|---|
| 1 instance | ~3,700/min | ~$300 | ~$0.05 |
| 3 instances | ~12,000/min | ~$450 | ~$0.025 |
| 10 instances | ~35,000/min | ~$800 | ~$0.015 |
Measured during load tests. Auto-scales to zero when idle.
| Metric | Value | Notes |
|---|---|---|
| Throughput | 3.7K events/min (1 API instance) | Scales horizontally |
| ML Accuracy | 97.5% (F1: 97.27%) | 10-class RandomForest |
| Detection Latency | < 3 seconds | End-to-end |
| Batch Optimization | 79x faster | See case study below |
These metrics validate the pipeline's viability for real-time fraud intervention, where low latency is critical. The high ML accuracy, combined with sub-3-second detection, provides a blueprint for production systems handling diverse event streams.
π The 79x improvement was the interesting part β write-up here.
The architecture follows a layered approach: event ingestion via Kafka, stream processing in Flink, and ML scoring through a FastAPI service running Spark. Data flows seamlessly across domains, with built-in fault tolerance and horizontal scalability at the inference layer.
| Service | Purpose | Tech |
|---|---|---|
event-simulator/ |
Lambda function generating fictive transactions with configurable fraud %. Triggered by dashboard, injects into Kafka topics. | Python, AWS Lambda |
flink-processor/ |
Processes Kafka streams from 3 domains (fintech/ecommerce/gaming). BronzeβSilver (shared features) βGold (ML scoring). Saves fraud alerts to PostgreSQL. | Scala, Flink, ECS Fargate |
api-gateway/ |
Serves ML predictions, velocity analytics, and fraud alerts with related transactions. Loads trained model from S3. This was the bottleneck. | Python, FastAPI, Spark ML |
dashboard/ |
UI to trigger transaction pipeline, monitor fraud status, and visualize real-time charts. | Java, Spring Boot, Chart.js |
airflow/ |
Productionizes model training pipeline: extract features β train model β validate β save to S3. | Python, Airflow, EMR Serverless |
notebook/ |
Preliminary model development β cross-domain RandomForest achieving 97.5% accuracy on 10 fraud types. | PySpark, Jupyter |
| Layer | Technology | Why This Choice |
|---|---|---|
| Ingestion | AWS Lambda β MSK (Kafka) | Scalable data pipelines, multi-topic architecture |
| Stream Processing | Apache Flink (ECS Fargate) | Real-time analytics with exactly-once semantics |
| ML Scoring | Spark MLlib (RandomForest) | Production mlops patterns, batch-optimized inference |
| Storage | PostgreSQL (RDS) + Redis | Time-series patterns, sub-100ms query caching |
| API | FastAPI (async) | High-throughput model serving |
| Dashboard | Spring Boot + Thymeleaf + Chart.js | Real-time visualization |
| Orchestration | Apache Airflow + EMR Serverless | MLOps workflow automation |
| Infrastructure | Terraform (IaC) | Reproducible cloud architecture |
Similar stack to what I've used in gaming analytics roles. More projects β
To illustrate the pipeline in action, here are key visuals from the dashboard, ML workflow, and cost tracking.
Real-time fraud alerts with velocity graph and severity indicators.
Orchestrated training pipeline: feature extraction β model training β validation (EMR Serverless).
97.5% accuracy with per-class breakdown across 10 fraud types.
Optimized from $26/day to ~$10/day with auto-scaling.
During load testing, ML predictions were timing out under backlog pressure. Individual API calls took ~950ms each, limiting throughput to 63 predictions/minute β far below the 10K/min target.
Isolated the API Gateway and profiled each step:
β±οΈ [5] createDataFrame: 412ms (43%)
β±οΈ [6] model.transform: 89ms (9%)
β±οΈ [7] collect: 773ms (81%) β BOTTLENECK
Finding: 81% of time was fixed overhead (Spark job setup), not actual inference. Scaling Flink or adding API instances wouldn't fix this β the problem was architectural.
Switched from async single requests to synchronous batch processing:
Before:
// 1 request = 1 Spark job = 950ms
AsyncDataStream.unorderedWait(aggregates, fraudScoringAsyncFunction, ...)After:
// 100 requests = 1 Spark job = 1200ms total = 12ms each
aggregates.process(new FraudScoringBatchFunction(batchSize = 100))| Metric | Before | After | Improvement |
|---|---|---|---|
| Latency/prediction | 950ms | 12ms | 79x faster |
| Throughput | 63/min | 3,700/min | 59x higher |
| 10K/min target | β | β (with 3 API instances) | β |
| Configuration | Throughput | Supported Load |
|---|---|---|
| 1 API instance, single requests | 63/min | β None |
| 1 API instance, batch 100 | 3,700/min | β MINI |
| 3 API instances, batch 100 | 12,000/min | β NORMAL |
| 10 API instances, batch 100 | 35,000/min | β PEAK |
This optimization not only met the 10K/min target but also underscored the value of rethinking inference patterns in streaming ML pipelines.
π‘ Takeaway: Profile before scaling. This applies to any Spark-based inference.
In a NORMAL load test simulating 10K events per minute, the pipeline generated 16 alerts, catching 40.6% of true fraud cases across various types. Critical and high-severity incidents were prioritized effectively, showcasing the system's ability to surface actionable insights in real time.
| Metric | Value |
|---|---|
| Alerts generated | 16 |
| True fraud detected | 13/32 (40.6%) |
| Critical severity | 5 |
| High severity | 9 |
Alert Types Detected:
suspicious_activity(8)account_takeover(3)chargeback_fraud(2)money_laundering(2)structuring(1)
Getting started is straightforward, whether for local development or full AWS deployment.
- AWS CLI configured
- Terraform >= 1.0
- Docker + Docker Compose
- Python 3.11+
- Java 17
# Clone
git clone https://github.com/acourreg/patternalarm.git
cd patternalarm
# Start local stack (Kafka, PostgreSQL, Redis)
cd scripts
cp config.example.conf config.conf # Edit with your settings
docker-compose up -d
# Run setup scripts in order
./1-create-kafka-topics.sh
./2-rds-schema.sql # Apply to local PostgreSQL
./3-upload-training-data.shcd infra/terraform
terraform init
terraform apply
# Then run the same scripts against AWS resources
cd ../scripts
./1-create-kafka-topics.sh
./3-upload-training-data.sh# From dashboard UI or CLI
curl -X POST "https://<dashboard>/api/test/execute" \
-d "domain=gaming&loadLevel=normal"Managing costs was a key focus during development, as streaming setups can quickly rack up bills. From real billing data over six days, I optimized from an initial $26/day down to about $10/day through targeted strategies.
| Metric | Value |
|---|---|
| Total (6 days) | $92.88 |
| Average daily | $15.48 |
| Optimized daily | ~$10/day |
| Monthly estimate | ~$300 |
Cost breakdown: MSK (Kafka) ~40%, ECS ~25%, RDS ~20%, Other ~15%
Savings tactics:
- ECS services scale to 0 when idle
- MSK paused between tests
- NAT Gateway minimized
- EMR Serverless for Airflow jobs (pay per use)
Detailed cost breakdown in my blog post on cloud cost patterns.
Building this project reinforced several principles in streaming and ML engineering:
- Batch > Single requests for Spark ML β fixed overhead dominates individual inference. This shift alone unlocked massive throughput gains.
- Profile before scaling β adding instances wouldn't have fixed the root bottleneck. The 79x improvement came from architectural change, not more resources.
- MSK is expensive β consider Redpanda or self-hosted Kafka for dev environments.
- Auto-scale aggressively β cloud costs compound faster than expected.
- EMR Serverless β perfect for sporadic ML training vs always-on clusters.
At its core, the project integrates streaming, data, cloud, and MLOps tools β with batch optimization being the most challenging (and rewarding) aspect.
| Area | What I Used | Notes |
|---|---|---|
| Streaming | Kafka (MSK), Flink (Scala), Spark ML | Multi-topic, windowing, exactly-once semantics |
| Data | PostgreSQL, Redis, FastAPI | Time-series patterns, caching layer |
| Cloud | AWS (MSK, RDS, ECS Fargate, EMR Serverless) | Terraform for everything |
| MLOps | Airflow, EMR Serverless, S3 model registry | Training pipeline, not just notebooks |
| The Hard Part | Batch optimization for Spark inference | See case study β this took a while to figure out |
More context on my background: scalefine.ai/about
- Scalefine Blog β Deep dives on Kafka, Flink, and production ML
MIT
AurΓ©lien Courreges-Clercq
Building streaming pipelines and ML systems.
Scalefine.ai Β· LinkedIn Β· GitHub





