A production-ready, load-aware database migration and synchronization system using AWS Database Migration Service (DMS) with intelligent orchestration that automatically pauses and resumes replication based on source database load.
This system enables secure database migration from on-premises to AWS while minimizing business impact during peak operational periods. It automatically monitors source database load and pauses/resumes DMS replication tasks to ensure critical business operations are not affected.
This solution supports heterogeneous and homogeneous database migrations across different database engines:
- Homogeneous Migrations: Migrate between the same database engines (e.g., PostgreSQL to PostgreSQL, MySQL to MySQL, Oracle to Oracle, SQL Server to SQL Server)
- Heterogeneous Migrations: Migrate between different database engines with automatic schema conversion support:
- Oracle to PostgreSQL/MySQL/Aurora
- SQL Server to PostgreSQL/MySQL/Aurora
- MySQL to PostgreSQL/Aurora PostgreSQL
- PostgreSQL to MySQL/Aurora MySQL
- Cloud-Native Targets: Support for AWS-managed database services including Amazon RDS, Amazon Aurora, and Amazon Redshift
- Version Compatibility: Handles migrations across different database versions with automatic compatibility adjustments
- Load-Aware Orchestration: Automatically pauses replication during high-load periods
- Secure Connectivity: Encrypted VPN tunnel for data transit
- Intelligent Retry Logic: Automatic retry with exponential backoff for transient failures
- Admin Notifications: Critical alerts when manual intervention is required
- Idempotent Operations: Safe handling of duplicate control requests
- Full Load + CDC: Complete initial migration followed by continuous change replication
.
├── src/
│ ├── lambda/
│ │ └── dms_controller/ # Lambda function for DMS task control
│ │ ├── handler.py
│ │ ├── requirements.txt
│ │ └── package.sh # Deployment package script
│ └── monitoring_agent/ # On-premises monitoring agent
│ └── agent.py
├── infrastructure/
│ ├── terraform/ # Infrastructure as Code
│ │ ├── main.tf # Main infrastructure definitions
│ │ ├── variables.tf # Variable declarations
│ │ ├── outputs.tf # Output definitions
│ │ ├── terraform.tfvars.example
│ │ └── README.md # Detailed Terraform documentation
│ └── step_functions/ # Step Functions state machine
│ ├── state_machine.json # State machine definition
│ └── README.md # State machine documentation
├── docs/
│ ├── DEPLOYMENT.md # Deployment guide
│ ├── OPERATIONS.md # Operations runbook
│ └── TROUBLESHOOTING.md # Troubleshooting guide
├── tests/ # Test suite
│ ├── test_dms_controller.py
│ └── test_monitoring_agent.py
├── requirements.txt # Python dependencies
├── requirements-dev.txt # Development dependencies
└── README.md
- AWS Account with permissions to create VPC, DMS, Lambda, Step Functions, CloudWatch, SNS, and IAM resources
- Python 3.11+ for development and monitoring agent
- Terraform 1.0+ for infrastructure deployment
- AWS CLI configured with appropriate credentials
- On-Premises Infrastructure:
- Public IP address for VPN endpoint
- Source database (PostgreSQL, MySQL, Oracle, or SQL Server)
- Server for monitoring agent installation
-
Prepare Lambda Deployment Package
cd src/lambda/dms_controller ./package.sh # Linux/Mac # or .\package.ps1 # Windows
-
Configure Infrastructure
cd infrastructure/terraform cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars with your configuration
-
Deploy Infrastructure
terraform init terraform plan terraform apply
-
Configure On-Premises VPN
- Download VPN configuration from AWS Console
- Configure your VPN device with tunnel IPs and pre-shared keys
- Verify tunnel status:
aws ec2 describe-vpn-connections --vpn-connection-ids <vpn-id>
-
Install Monitoring Agent
# On your on-premises server pip install -r requirements.txt # Configure agent with AWS credentials from Terraform outputs python src/monitoring_agent/agent.py
-
Verify Deployment
- Check VPN tunnel status (both tunnels should be UP)
- Verify DMS endpoints can connect to databases
- Confirm CloudWatch alarms are created
- Test Lambda function manually
- Subscribe to Admin Alert SNS topic
For detailed deployment instructions, see docs/DEPLOYMENT.md.
Alarm Thresholds (adjust based on your database capacity):
high_load_threshold: Connection count to trigger DMS stop (default: 70% of max connections)low_load_threshold: Connection count to trigger DMS start (default: 30% of max connections)
DMS Settings:
dms_instance_class: Instance size (t3.medium, r5.large, r5.xlarge, etc.)dms_allocated_storage: Storage in GB (minimum 50)dms_table_mappings: JSON rules for table selection/filtering
Monitoring:
metric_collection_interval: How often to collect metrics (default: 60 seconds)alarm_evaluation_periods: Consecutive periods before triggering (High: 2, Low: 3)
See infrastructure/terraform/README.md for complete configuration reference.
┌─────────────────────────────────────────────────────────────────────┐
│ On-Premises Data Center │
│ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ Source Database │◄────────│ Monitoring Agent │ │
│ │ (Production) │ │ (Metrics Collector) │ │
│ └────────┬─────────┘ └──────────┬──────────┘ │
└───────────┼───────────────────────────────┼────────────────────────┘
│ VPN Tunnel │ CloudWatch
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ ┌──────────────────┐ ┌─────────────┐ ┌──────────────────┐ │
│ │ DMS Replication │───►│ Target │ │ CloudWatch │ │
│ │ Instance │ │ Database │ │ Alarms │ │
│ └──────────────────┘ └─────────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Step Functions │ │
│ │ State Machine │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ DMS Controller │ │
│ │ Lambda │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
-
High Load Detected:
- Monitoring agent publishes metrics to CloudWatch
- Metrics exceed threshold → HighLoadAlarm triggers
- Alarm publishes to SNS → Step Functions starts
- Lambda stops DMS task → Replication pauses
-
Low Load Detected:
- Metrics fall below threshold → LowLoadAlarm triggers
- Alarm publishes to SNS → Step Functions starts
- Lambda starts DMS task → Replication resumes
-
Failure Handling:
- Lambda operation fails → Step Functions retries (3 attempts)
- All retries fail → Admin alert sent via SNS
- Administrator investigates and resolves manually
CloudWatch Dashboards:
- DMS replication lag and throughput
- Source database load metrics
- Lambda execution metrics
- Step Functions execution status
Key Metrics to Monitor:
Custom/OnPremDB/ActiveConnections: Current connection countCustom/OnPremDB/CPUUtilization: Database CPU usageAWS/DMS/CDCLatencySource: Replication lagAWS/Lambda/Errors: Lambda function errorsAWS/States/ExecutionsFailed: Step Functions failures
Alarms:
- VPN tunnel down
- DMS replication lag exceeds threshold
- Lambda function errors
- Step Functions execution failures
For detailed operations procedures, see docs/OPERATIONS.md.
VPN Connection Not Establishing:
- Verify on-premises VPN device configuration
- Check security groups allow IPsec traffic (UDP 500, 4500)
- Verify BGP ASN matches on both sides
DMS Task Fails to Start/Stop:
- Check CloudWatch Logs:
/aws/lambda/<environment>-dms-controller - Verify IAM permissions are correct
- Check DMS task state in console
Monitoring Agent Not Publishing Metrics:
- Verify AWS credentials are configured correctly
- Check network connectivity to CloudWatch API
- Review agent logs for errors
High Replication Lag:
- Check source database load during replication
- Verify DMS instance size is appropriate
- Review table mappings for unnecessary tables
For comprehensive troubleshooting, see docs/TROUBLESHOOTING.md.
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements-dev.txt- Configure AWS credentials:
aws configure# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test file
pytest tests/test_dms_controller.py
# Run property-based tests only
pytest -k "property"This project uses both unit tests and property-based tests:
- Unit Tests: Verify specific examples and edge cases
- Property Tests: Verify universal properties across all inputs (using Hypothesis)
Each property test runs a minimum of 100 iterations to ensure comprehensive coverage.
- Secrets Management: Use AWS Secrets Manager for database passwords
- Least Privilege: All IAM roles follow least-privilege principle
- Encryption: All data in transit encrypted via VPN tunnel
- Network Isolation: DMS instance in private subnet
- Access Keys: Rotate monitoring agent credentials regularly
- DMS Controller Lambda: Only DMS start/stop/describe operations
- Monitoring Agent: Only CloudWatch PutMetricData
- Step Functions: Only Lambda invoke and SNS publish
- DMS Instance: Only source and target database access
- Use appropriate DMS instance class for workload
- Stop DMS instance when not in use (development only)
- Use single-AZ deployment for non-production
- Monitor CloudWatch Logs retention periods
- Review alarm evaluation periods to reduce false positives
- Deployment Guide:
docs/DEPLOYMENT.md - Operations Runbook:
docs/OPERATIONS.md - Troubleshooting:
docs/TROUBLESHOOTING.md