Skip to content

Comments

feat: Add ECS Fargate infrastructure and deployment configuration#362

Open
e9e4e5f0faef wants to merge 16 commits intostagefrom
feat/ecs-fargate-migration
Open

feat: Add ECS Fargate infrastructure and deployment configuration#362
e9e4e5f0faef wants to merge 16 commits intostagefrom
feat/ecs-fargate-migration

Conversation

@e9e4e5f0faef
Copy link
Collaborator

@e9e4e5f0faef e9e4e5f0faef commented Jan 24, 2026

Description

This PR adds infrastructure, CI/CD, testing, and operational tooling to deploy addons-server on AWS ECS Fargate. The full lifecycle has been validated: deploy, smoke test (8/8), read-only Django healthcheck (5/5), teardown -- repeatable across multiple cycles.

Files changed (18):

Category File Purpose
Docker Dockerfile.ecs ECS-optimised image (non-root, tini, health check)
Docker docker/docker-entrypoint.sh Multi-mode entrypoint (web/worker/versioncheck/manage) with --need-app fast-fail
CI/CD .github/workflows/build-and-push.yml GitHub Actions for ECR build/push with OIDC auth
CI/CD .github/workflows/validate.yml PR gate: ruff lint/format, YAML validation, Python syntax checks (path-filtered)
Pulumi infra/pulumi/__main__.py IaC program: VPC, ECR, Fargate, ElastiCache, Scheduled Tasks, VPC peering, SG hardening, autoscaling, task roles
Pulumi infra/pulumi/config.stage.yaml Stage environment config (~157 resources including autoscaling)
Pulumi infra/pulumi/Pulumi.yaml Project definition
Pulumi infra/pulumi/Pulumi.stage.yaml Stack config (aws:region=us-west-2)
Pulumi infra/pulumi/README.md Setup guide
Pulumi infra/pulumi/requirements.txt Python deps (tb_pulumi v0.0.16, Python 3.13+)
CI/CD .github/workflows/deploy-stage.yml Stage deploy workflow (preview-only, safety scaling / scale-to-0 handling as applicable)
Script infra/scripts/guardduty-cleanup.sh Cleanup GuardDuty auto-provisioned VPC artefacts after pulumi destroy -- tag-gated, dry-run, retry backoff
Test infra/tests/smoke_test.py RO integration test: connectivity, secrets, DNS, NAT (8 checks, env-var driven)
Test infra/tests/.env.example Example env vars for running smoke test
Test infra/tests/Dockerfile Lightweight image for running smoke test as ECS task
App src/olympia/amo/management/commands/ro_healthcheck.py RO Django healthcheck: enforced read-only MySQL, sanitised output, validates settings/DB/cache/broker/ES
Config settings_local_stage.py Stage settings: all endpoints via Secrets Manager (10 secrets), fixed CACHE_PREFIX, cors, slave DB, ES hostname
Config .gitignore Exclude Pulumi output files and local analysis docs

Context

ECS Fargate migration for ATN from EC2/Ansible, as discussed with @Sancus.

Networking:

  • New VPC 10.100.0.0/16 with public/private subnets across 3 AZs (approved by Andrei)
  • VPC peering to default VPC with routes on correct custom route tables (workaround for tb_pulumi routing to default RT)
  • DNS resolution enabled across peering; return route and SG rules Pulumi-managed
  • Default VPC SG rules configurable via tb:network:DefaultVpcIngressRules in config

Security:

  • Separate ALB and container SGs with dynamic source_security_group_id wiring (accounts-repo pattern)
  • Egress: all protocols (not just TCP -- avoids DNS/UDP issues)
  • PassRole scoped to specific cron roles (no Resource: * wildcard)
  • All internal endpoints (Redis, ES, DB, broker, cache) moved to Secrets Manager -- zero hardcoded metadata in settings file
  • Resource tagging: managed_by, owner, repository, service, lifecycle on all resources

ECS services + autoscaling:

  • Web, Worker, Versioncheck services with separate task role (task_role_arn) for runtime boto3
  • 16 cron jobs as EventBridge Scheduled Tasks
  • Target-tracking autoscaling (CPU + memory) per service; desired_count omitted so autoscaling owns the count
  • ACM cert wired for HTTPS on both ALBs

IAM and secrets:

  • OIDC role with strict trust policy (aud, iss, sub, job_workflow_ref)
  • Secrets IAM path reconciled: atn/stage/* policy on execution roles, task roles, and cron roles
  • 10 secrets in Secrets Manager (8 pre-existing + 2 new: celery_result_backend, elasticsearch_host)
  • AWS_ROLE_ARN repo variable set

Settings fixes (settings_local_stage.py):

  • DJANGO_SETTINGS_MODULE=settings_local_stage (was settings which loaded localhost defaults)
  • CACHE_PREFIX -> CACHE_KEY_PREFIX (NameError fix)
  • cors_endpoint_overrides removed (function doesn't exist in Thunderbird fork)
  • Slave DB pointed to RDS endpoint from secret (was private DNS name unreachable from ECS VPC)
  • ES hostname corrected (stage domain didn't exist; using shared amo-tb domain)

Post-deploy validation

Infrastructure smoke test (8/8): TCP connectivity to all backends from ECS private subnets.

RO Django healthcheck (5/5): Real app image booted in ECS Fargate, Django loaded settings_local_stage, all backends connected:

Check Result
Django settings import Pass (settings_local_stage)
MySQL database (read-only, enforced transaction_read_only) Pass (241,480 addons, 198ms)
Cache backend (Memcached) Pass (0ms)
Celery broker (RabbitMQ, cross-VPC) Pass (21ms)
Elasticsearch 5.6 (VPC endpoint, HTTPS) Pass (49ms)

Stack cleanly destroyed after validation (157 resources, zero errors). Full lifecycle proven across multiple deploy/test/destroy cycles.

Remaining follow-ups (separate from this PR):

  • Scale up services (versioncheck first, then web, then worker -- after merge)
  • OpenSearch, Memcached, EFS components (SOW items, deferred pending decisions)
  • CI/CD deploy workflow (in progress)
  • Production environment config

Testing

# Pulumi preview (requires Python 3.13+)
cd infra/pulumi
python3.13 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pulumi stack select thunderbird/thunderbird-addons/stage
pulumi preview  # Shows ~157 resources

# Docker build
docker build -f Dockerfile.ecs -t addons-server:test .

# Smoke test (env-var driven, runs as ECS task after pulumi up)
cd infra/tests
docker build -t atn-smoke-test .

# RO healthcheck (runs inside real app image)
# python manage.py ro_healthcheck
# See analysis.md for full ECS run-task commands

Checklist

  • Add a description of the changes introduced in this PR
  • The change has been successfully run locally (Pulumi preview passes)
  • Add tests to cover the changes (smoke test 8/8, RO healthcheck 5/5)
  • Screenshots -- N/A, no UI changes

@e9e4e5f0faef e9e4e5f0faef force-pushed the feat/ecs-fargate-migration branch from 60d4f86 to 2ee8f25 Compare January 24, 2026 01:42
@e9e4e5f0faef e9e4e5f0faef self-assigned this Jan 25, 2026
@e9e4e5f0faef e9e4e5f0faef force-pushed the feat/ecs-fargate-migration branch from 699facf to c54436f Compare January 31, 2026 17:07
@e9e4e5f0faef e9e4e5f0faef mentioned this pull request Feb 3, 2026
3 tasks
@e9e4e5f0faef e9e4e5f0faef force-pushed the feat/ecs-fargate-migration branch from 8dff9f9 to 65b3600 Compare February 14, 2026 17:32
@e9e4e5f0faef e9e4e5f0faef force-pushed the feat/ecs-fargate-migration branch from 65b3600 to 34a5a24 Compare February 14, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants