Skip to content

This repository serves for the end-to-end implementation of a scalable, serverless data engineering pipeline on Amazon Web Services (AWS). It automates the ingestion, cleansing, feature engineering (ETL via Glue), and real-time fraud detection of financial transaction data using AWS S3, Lambda, Glue, SageMaker, and DynamoDB.

License

Notifications You must be signed in to change notification settings

tawusap/DE486_Data_Engineering_Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS Fraud Detection Data Pipeline (Terraform Deployment)

Overview

This repository contains Infrastructure-as-Code (IaC) definitions written in Terraform to deploy a fully automated, end-to-end data pipeline for fraud detection. The pipeline ingests raw data into S3, validates schema, processes and transforms data using AWS Glue, triggers ML predictions using Amazon SageMaker, and stores enriched fraud-classified records into DynamoDB.

The Terraform configuration provisions all required AWS resources, including:

  • Amazon S3 (Landing, Processed, Curated buckets)
  • AWS Lambda for schema validation and model inferenc
  • AWS Glue Crawler + ETL Job
  • Amazon SageMaker Model, Endpoint Configuration, Endpoint
  • AWS DynamoDB final results table
  • IAM Roles & Policies for Glue, Lambda, and SageMaker

Terraform does not deploy code scripts (Python files, Glue scripts, etc.). Instead, users must upload these ZIP/script files to S3 beforehand and supply their S3 paths in terraform.tfvars.

1. Repository Structure

terraform/
│ main.tf
│ variables.tf
│ outputs.tf
│ terraform.tfvars (user-created)
│
├── s3.tf                # Landing / Processed / Curated S3 buckets
├── iam.tf               # IAM roles + policies
├── lambda.tf            # Lambda functions + S3 triggers
├── glue.tf              # Glue crawler + ETL job
├── dynamodb.tf          # Final DynamoDB table
└── sagemaker.tf         # SageMaker model + endpoint

2. Prerequisites

2.1 Required Tools

You must install the following before deploying: Terraform 1.5+ AWS CLI v2+ Python 3.12(optional) For creating Lambda ZIPs manually ZIP utility To compress Lambda code for upload

2.2 AWS Permissions Needed

The AWS account or IAM user running Terraform must have:

  • S3: CreateBucket, PutBucketPolicy, PutObject
  • IAM: CreateRole, AttachRolePolicy, PassRole
  • Lambda: CreateFunction, AddPermission
  • Glue: CreateCrawler, CreateJob
  • SageMaker: CreateModel, CreateEndpoint
  • DynamoDB: CreateTable

You may use AdministratorAccess if working in a sandbox environment.

3. Upload Required Artifacts Before Deployment

Terraform expects you to pre-upload:

3.1 Lambda Code ZIPs

You must upload:
schema_validator.zip
inference_lambda.zip
lambda_layer.zip

to an S3 deployment bucket you control.

Example:
s3://my-lambda-code-bucket/schema_validator.zip
s3://my-lambda-code-bucket/inference_lambda.zip
s3://my-lambda-code-bucket/lambda_layer.zip

3.2 Glue Script

Upload your ETL script (e.g., etl_transform.py) to S3:
s3://my-glue-script-bucket/etl/etl_transform.py

3.3 SageMaker Model Artifact

Upload your trained model:

model.tar.gz (must contain XGBoost model + metadata)

Example:
s3://my-model-artifacts/fraud/model.tar.gz

Or train the model in SageMaker using provided JupyterNotebook file name: FraudDetectionXGB.ipynb

4. Terraform Configuration

4.1 Create terraform.tfvars

Terraform uses variables defined in variables.tf. You must create a new file:

terraform.tfvars

Example Template:

aws_region = "us-east-1"

landing_bucket     = "my-landing-bucket"
processed_bucket   = "my-processed-bucket"
curated_bucket     = "my-curated-bucket"

lambda_code_bucket      = "my-lambda-code-bucket"
schema_lambda_s3_key    = "schema_validator.zip"
inference_lambda_s3_key = "inference_lambda.zip"

glue_script_bucket = "my-glue-script-bucket"
glue_script_key    = "etl/etl_transform.py"

model_artifact_bucket = "my-model-artifacts"
model_artifact_key    = "fraud/model.tar.gz"

dynamodb_table_name = "fraud-detection-results"

All values must be updated to your environment.

5. Deployment Instructions

Follow these steps exactly.

Step 1 — Initialize Terraform

Run inside the terraform/ directory:

terraform init

This:

  • download the AWS provider
  • sets up the Terraform working directory

Step 2 Validate Configuration

terraform validate

if you see:

Success! The configuration is valid.

you may continue.

Step 3 — Review Deployment Plan

terraform plan

You should see resources such as:

  • 3 S3 buckets
  • 2 Lambda functions
  • 1 Glue crawler
  • 1 Glue job
  • 1 SageMaker model + endpoint
  • DynamoDB table
  • IAM roles

Verify everything looks correct.

Step 4 — Deploy

terraform apply

confirm with:

yes

Terraform will now provision the entire pipeline.

Expected deployment time:

  • IAM roles: immediate
  • S3 buckets: immediate
  • DynamoDB table: immediate
  • Lambda functions: < 1 min
  • Glue job & crawler: < 30 sec
  • SageMaker endpoint: 6–10 minutes

Once finished, Terraform will output:

Apply complete! Resources: XX added, 0 changed, 0 destroyed.

6. Post-Deployment Verification

6.1 Upload a test file to the Landing Bucket

Upload a CSV file:
s3://landing-bucket/input.csv

Lambda should automatically trigger and:

  1. Validate schema
  2. Write processed file to processed-bucket/

Within an hour (based on your schedule):

  • Glue Crawler updates schema
  • Glue Job transforms and outputs curated data The final curated file triggers the inference Lambda:
  • The lambda calls the SageMaker fraud model
  • It writes the final record to DynamoDB

7. Redeploying Updated Code

Terraform manages infrastructure only, not code.

If you update Lambda Python code:

  1. ZIP it again
  2. Upload new ZIP to the same S3 path
  3. Run:
terraform apply

Terraform will detect that the file was updated in S3 and update the Lambda function.

8. Destroying All Resources

To clean up all AWS resources:
terraform destroy

confirm with:

yes

This deletes:

  • Buckets
  • Lambdas
  • IAM roles
  • Glue jobs
  • SageMaker endpoint
  • DynamoDB table

Member list:

  • 66102010179
  • 66102010250
  • 66102010251
  • 66102010252

About

This repository serves for the end-to-end implementation of a scalable, serverless data engineering pipeline on Amazon Web Services (AWS). It automates the ingestion, cleansing, feature engineering (ETL via Glue), and real-time fraud detection of financial transaction data using AWS S3, Lambda, Glue, SageMaker, and DynamoDB.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 41.4%
  • Jupyter Notebook 31.7%
  • HCL 26.9%