Infrastructure Spec

Sage.ai Infrastructure Specification

Document Version: 3.0 Last Modified: 2025-12-26 Author: Sam Target Audience: DevOps, Infrastructure Team

1. Infrastructure Overview

1.1 Architecture Diagram

graph TD
    A[Internet] --> B[CloudFront CDN]
    B --> C[S3 Frontend Static Files]
    A --> D[Application Load Balancer]
    D --> E[ECS Fargate Backend]
    E --> F[RDS PostgreSQL 18]
    E --> G[ElastiCache Valkey 8.x]
    E --> H[External APIs]
    H --> I[Anthropic Claude]
    H --> J[Binance WebSocket]
    H --> J2[Gate.io WebSocket]
    H --> K[Alternative.me]

1.2 Technology Stack

interface InfrastructureStack {
  compute: {
    service: "AWS ECS Fargate";
    version: "1.4.0";
  };
  database: {
    service: "AWS RDS PostgreSQL";
    version: "18";
    reason: "5년 LTS 지원 (2030년까지), JSON 처리 30% 빠름, 쿼리 최적화 개선";
  };
  cache: {
    service: "AWS ElastiCache Valkey";
    version: "8.x";
    reason: "100% Redis 호환, Linux Foundation OSS, 라이센스 안정성";
  };
  storage: {
    service: "AWS S3";
  };
  cdn: {
    service: "AWS CloudFront";
  };
  loadBalancer: {
    service: "AWS ALB";
  };
  iac: {
    tool: "Pulumi";
    version: "3.x";
    language: "TypeScript";
    reason: "TypeScript 풀스택 통일, 타입 안정성, Terraform보다 빠른 반복";
  };
  cicd: {
    platform: "GitHub Actions";
  };
  monitoring: {
    services: ["Sentry", "CloudWatch"];
  };
  logging: {
    service: "CloudWatch Logs";
  };
}

1.3 Pulumi vs Terraform 선택 근거

Pulumi를 선택한 이유:

항목	Pulumi (TypeScript)	Terraform (HCL)
언어 통일	Backend/Frontend/IaC 모두 TypeScript	HCL은 별도 학습 필요
타입 안정성	IDE 자동완성, 컴파일 타임 검증	런타임 에러 가능성
재사용성	TypeScript 함수/클래스로 추상화 가능	모듈 시스템 제한적
개발 속도	Vite처럼 빠른 반복 (`pulumi up`)	Plan/Apply 2단계 필수
팀 학습곡선	기존 TypeScript 지식 활용	새로운 DSL 학습
디버깅	TypeScript 디버거 사용 가능	HCL 디버깅 어려움

Trade-off:

Terraform: 더 성숙한 생태계, 더 많은 레퍼런스
Pulumi: 더 빠른 개발, TypeScript 풀스택 일관성

결론: MVP 3개월 개발 속도와 TypeScript 풀스택 일관성을 위해 Pulumi 선택

2. Compute Layer

2.1 ECS Cluster Configuration

resource "aws_ecs_cluster" "sage" {
  name = "sage-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

2.2 Task Definition (Backend)

resource "aws_ecs_task_definition" "backend" {
  family                   = "sage-backend"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "512"   # 0.5 vCPU (MVP)
  memory                   = "1024"  # 1 GB (MVP)
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name      = "backend"
      image     = "${aws_ecr_repository.backend.repository_url}:latest"
      cpu       = 512
      memory    = 1024
      essential = true

      portMappings = [
        {
          containerPort = 3000
          protocol      = "tcp"
        }
      ]

      environment = [
        { name = "NODE_ENV", value = "production" },
        { name = "DATABASE_URL", value = "postgresql://..." },
        { name = "VALKEY_URL", value = "valkey://..." }
      ]

      secrets = [
        {
          name      = "ANTHROPIC_API_KEY"
          valueFrom = "${aws_secretsmanager_secret.anthropic_key.arn}"
        }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/sage-backend"
          "awslogs-region"        = "us-west-2"
          "awslogs-stream-prefix" = "ecs"
        }
      }

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }
    }
  ])
}

2.3 Service Configuration

resource "aws_ecs_service" "backend" {
  name            = "sage-backend"
  cluster         = aws_ecs_cluster.sage.id
  task_definition = aws_ecs_task_definition.backend.arn
  desired_count   = 2  # MVP: 2 tasks

  launch_type = "FARGATE"

  network_configuration {
    subnets         = aws_subnet.private[*].id
    security_groups = [aws_security_group.backend.id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.backend.arn
    container_name   = "backend"
    container_port   = 3000
  }

  lifecycle {
    ignore_changes = [desired_count]
  }
}

2.4 Auto Scaling

resource "aws_appautoscaling_target" "backend" {
  max_capacity       = 10  # Max 10 tasks
  min_capacity       = 2   # Min 2 tasks
  resource_id        = "service/${aws_ecs_cluster.sage.name}/${aws_ecs_service.backend.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "backend_cpu" {
  name               = "backend-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.backend.resource_id
  scalable_dimension = aws_appautoscaling_target.backend.scalable_dimension
  service_namespace  = aws_appautoscaling_target.backend.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = 70.0  # Target 70% CPU
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

3. Database Layer

3.1 RDS PostgreSQL Instance

resource "aws_db_instance" "postgres" {
  identifier     = "sage-postgres"
  engine         = "postgres"
  engine_version = "18.1"

  instance_class    = "db.t3.micro"  # MVP: Free tier
  allocated_storage = 20               # 20 GB (MVP)
  storage_type      = "gp3"
  storage_encrypted = true

  db_name  = "sage"
  username = "sage_admin"
  password = random_password.db_password.result

  vpc_security_group_ids = [aws_security_group.rds.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "mon:04:00-mon:05:00"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "sage-postgres-final-${timestamp()}"

  tags = {
    Name        = "sage-postgres"
    Environment = "production"
  }
}

3.2 Read Replica (Phase 2+)

resource "aws_db_instance" "postgres_replica" {
  count = var.enable_read_replica ? 1 : 0

  identifier          = "sage-postgres-replica"
  replicate_source_db = aws_db_instance.postgres.id

  instance_class = "db.t3.micro"
}

4. Cache Layer

4.1 ElastiCache Valkey Cluster (Redis-Compatible)

resource "aws_elasticache_replication_group" "valkey" {
  replication_group_id       = "sage-valkey"
  replication_group_description = "Sage.ai Valkey cluster (Redis-compatible)"

  engine         = "valkey"
  engine_version = "8.0"
  node_type      = "cache.t3.micro"  # MVP: Free tier

  num_cache_clusters = 2  # 1 primary + 1 replica

  port                     = 6379
  parameter_group_name     = "default.valkey8"
  subnet_group_name        = aws_elasticache_subnet_group.main.name
  security_group_ids       = [aws_security_group.valkey.id]

  automatic_failover_enabled = true
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true

  maintenance_window = "sun:05:00-sun:06:00"
  snapshot_window    = "03:00-04:00"
  snapshot_retention_limit = 5

  tags = {
    Name        = "sage-valkey"
    Environment = "production"
  }
}

5. Storage & CDN

5.1 S3 Bucket (Frontend)

resource "aws_s3_bucket" "frontend" {
  bucket = "sage-frontend-prod"

  tags = {
    Name        = "sage-frontend"
    Environment = "production"
  }
}

resource "aws_s3_bucket_versioning" "frontend" {
  bucket = aws_s3_bucket.frontend.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_public_access_block" "frontend" {
  bucket = aws_s3_bucket.frontend.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

5.2 CloudFront Distribution

resource "aws_cloudfront_distribution" "frontend" {
  origin {
    domain_name = aws_s3_bucket.frontend.bucket_regional_domain_name
    origin_id   = "S3-sage-frontend"

    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.frontend.cloudfront_access_identity_path
    }
  }

  enabled             = true
  is_ipv6_enabled     = true
  default_root_object = "index.html"

  aliases = ["app.sage.ai", "www.sage.ai"]

  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-sage-frontend"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 0
    default_ttl            = 3600    # 1 hour
    max_ttl                = 86400   # 1 day

    compress = true
  }

  custom_error_response {
    error_code         = 404
    response_code      = 200
    response_page_path = "/index.html"
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    acm_certificate_arn      = aws_acm_certificate.frontend.arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }

  tags = {
    Name        = "sage-frontend-cdn"
    Environment = "production"
  }
}

6. Load Balancer

6.1 Application Load Balancer

resource "aws_lb" "main" {
  name               = "sage-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = aws_subnet.public[*].id

  enable_deletion_protection = true

  tags = {
    Name        = "sage-alb"
    Environment = "production"
  }
}

resource "aws_lb_target_group" "backend" {
  name        = "sage-backend-tg"
  port        = 3000
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"

  health_check {
    path                = "/health"
    protocol            = "HTTP"
    matcher             = "200"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }

  deregistration_delay = 30
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = aws_acm_certificate.backend.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.backend.arn
  }
}

7. Networking

7.1 VPC Configuration (상세 다이어그램)

graph TB
    subgraph VPC["VPC (10.0.0.0/16)"]
        subgraph PublicSubnets["Public Subnets (AZ-a, AZ-b)"]
            PubSub1["Public Subnet 1<br/>10.0.0.0/24<br/>(us-west-2a)"]
            PubSub2["Public Subnet 2<br/>10.0.1.0/24<br/>(us-west-2b)"]
            ALB["Application Load Balancer<br/>(443, 80)"]
            NAT1["NAT Gateway 1"]
            NAT2["NAT Gateway 2"]
        end

        subgraph PrivateSubnets["Private Subnets (AZ-a, AZ-b)"]
            PrivSub1["Private Subnet 1<br/>10.0.10.0/24<br/>(us-west-2a)"]
            PrivSub2["Private Subnet 2<br/>10.0.11.0/24<br/>(us-west-2b)"]
            ECS1["ECS Task 1<br/>(Backend)"]
            ECS2["ECS Task 2<br/>(Backend)"]
        end

        subgraph DatabaseSubnets["Database Subnets (AZ-a, AZ-b)"]
            DBSub1["DB Subnet 1<br/>10.0.20.0/24<br/>(us-west-2a)"]
            DBSub2["DB Subnet 2<br/>10.0.21.0/24<br/>(us-west-2b)"]
            RDS["RDS PostgreSQL 18<br/>(Primary)"]
            Valkey["ElastiCache Valkey 8.x<br/>(Primary + Replica)"]
        end
    end

    Internet["Internet"] --> IGW["Internet Gateway"]
    IGW --> PubSub1
    IGW --> PubSub2

    PubSub1 --> NAT1
    PubSub2 --> NAT2
    PubSub1 --> ALB
    PubSub2 --> ALB

    ALB --> ECS1
    ALB --> ECS2

    PrivSub1 --> NAT1
    PrivSub2 --> NAT2
    PrivSub1 --> ECS1
    PrivSub2 --> ECS2

    ECS1 --> RDS
    ECS2 --> RDS
    ECS1 --> Valkey
    ECS2 --> Valkey

    DBSub1 --> RDS
    DBSub2 --> RDS
    DBSub1 --> Valkey
    DBSub2 --> Valkey

VPC 설계 원칙:

다중 AZ: 고가용성을 위해 2개 AZ 사용 (us-west-2a, us-west-2b)
3-Tier 서브넷: Public (ALB, NAT) / Private (ECS) / Database (RDS, Valkey)
보안 계층화: 인터넷 → Public → Private → Database (점진적 보안 강화)
NAT Gateway 이중화: 각 AZ마다 NAT Gateway (단일 장애점 제거)

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "sage-vpc"
  }
}

resource "aws_subnet" "public" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  map_public_ip_on_launch = true

  tags = {
    Name = "sage-public-${count.index + 1}"
  }
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 10}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "sage-private-${count.index + 1}"
  }
}

7.2 Security Groups

interface SecurityGroups {
  alb: {
    ingress: {
      http: "Port 80 from 0.0.0.0/0";
      https: "Port 443 from 0.0.0.0/0";
    };
    egress: "All traffic";
  };
  backend: {
    ingress: "Port 3000 from ALB security group";
    egress: "All traffic";
  };
  rds: {
    ingress: "Port 5432 from Backend security group";
    egress: "None";
  };
  valkey: {
    ingress: "Port 6379 from Backend security group";
    egress: "None";
  };
}

8. CI/CD Pipeline

8.1 Backend Deployment Flow

graph LR
    A[Git Push to main] --> B[GitHub Actions]
    B --> C[Run Tests]
    C --> D[Build Docker Image]
    D --> E[Push to ECR]
    E --> F[Update ECS Service]
    F --> G[Rolling Deployment]

8.2 Backend Deployment Workflow

name: Deploy Backend

on:
  push:
    branches: [main]
    paths: ['apps/backend/**']

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build, tag, and push image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          ECR_REPOSITORY: sage-backend
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG apps/backend
          docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:latest
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest

      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster sage-cluster \
            --service sage-backend \
            --force-new-deployment \
            --region us-west-2

8.3 Frontend Deployment Workflow

name: Deploy Frontend

on:
  push:
    branches: [main]
    paths: ['apps/frontend/**']

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - uses: pnpm/action-setup@v2
        with:
          version: 8

      - name: Install dependencies
        run: pnpm install

      - name: Build
        run: pnpm run build
        working-directory: apps/frontend

      - name: Deploy to S3
        run: |
          aws s3 sync apps/frontend/dist s3://sage-frontend-prod --delete

      - name: Invalidate CloudFront
        run: |
          aws cloudfront create-invalidation \
            --distribution-id ${{ secrets.CLOUDFRONT_DISTRIBUTION_ID }} \
            --paths "/*"

9. Monitoring & Logging

9.1 CloudWatch Alarms

interface CloudWatchAlarms {
  ecsCpuHigh: {
    metric: "CPUUtilization";
    threshold: "80%";
    evaluationPeriods: 2;
    period: "5 minutes";
    action: "SNS notification";
  };
  rdsCpuHigh: {
    metric: "CPUUtilization";
    threshold: "80%";
    evaluationPeriods: 2;
    period: "5 minutes";
    action: "SNS notification";
  };
  valkeyMemoryHigh: {
    metric: "DatabaseMemoryUsagePercentage";
    threshold: "80%";
    evaluationPeriods: 1;
    period: "5 minutes";
    action: "SNS notification";
  };
}

9.2 Sentry Configuration

import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1,  // 10% of transactions
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Prisma({ client: prisma })
  ]
});

9.3 Discord 알림 전략

채널별 분류:

interface DiscordChannels {
  errors: {
    webhook: "https://discord.com/api/webhooks/.../errors";
    triggers: [
      "Sentry error rate > 5% for 5 minutes",
      "ECS task health check failed",
      "Database connection pool > 90%"
    ];
    severity: "🔴 CRITICAL";
  };
  performance: {
    webhook: "https://discord.com/api/webhooks/.../performance";
    triggers: [
      "API response time P95 > 500ms for 10 minutes",
      "SSE first token > 5s",
      "Valkey memory > 80%"
    ];
    severity: "🟠 WARNING";
  };
  businessMetrics: {
    webhook: "https://discord.com/api/webhooks/.../business";
    triggers: [
      "Daily active users milestone reached",
      "Shadow portfolio performance update",
      "Market alert sent to users"
    ];
    severity: "🟢 INFO";
  };
}

Discord 메시지 포맷:

// 예시: 에러 알림
async function sendErrorAlert(error: Error, context: any) {
  await axios.post(DISCORD_ERROR_WEBHOOK, {
    embeds: [{
      title: "🔴 Production Error Detected",
      color: 0xFF0000,
      fields: [
        { name: "Error", value: error.message, inline: false },
        { name: "Service", value: "ECS Backend", inline: true },
        { name: "Environment", value: "production", inline: true },
        { name: "Timestamp", value: new Date().toISOString(), inline: false },
        { name: "Sentry Link", value: `https://sentry.io/.../${error.id}`, inline: false }
      ]
    }]
  });
}

커스텀 메트릭 → Discord:

// CloudWatch Custom Metric → Lambda → Discord
interface CustomMetrics {
  chatResponseTime: "histogram → Discord if P95 > 500ms";
  tokenUsage: "counter → Discord daily summary";
  hallucinationRate: "counter → Discord if rate > 1%";
  queueSize: "gauge → Discord if size > 1000";
}

10. Cost Optimization

10.1 Monthly Cost Estimate (MVP)

interface MonthlyCostEstimate {
  ecsFargate: {
    resource: "2 tasks x 0.5 vCPU x 1 GB";
    cost: "$30";
  };
  rds: {
    resource: "db.t3.micro, 20 GB storage";
    cost: "$15";
  };
  elastiCache: {
    resource: "cache.t3.micro x 2 nodes";
    cost: "$20";
  };
  alb: {
    resource: "1 ALB";
    cost: "$20";
  };
  s3: {
    resource: "10 GB storage + 1M requests";
    cost: "$5";
  };
  cloudFront: {
    resource: "100 GB transfer";
    cost: "$10";
  };
  dataTransfer: {
    resource: "100 GB egress";
    cost: "$10";
  };
  cloudWatchLogs: {
    resource: "10 GB logs";
    cost: "$5";
  };
  total: "$115/month";
}

10.2 Cost Saving Strategies

interface CostSavingStrategies {
  reservedInstances: {
    phase: "Phase 2+";
    savings: {
      rds: "40% reduction";
      elastiCache: "30% reduction";
    };
  };
  s3Lifecycle: {
    policy: "Logs older than 30 days move to Glacier";
  };
  cloudFrontOptimization: {
    strategy: "Increase TTL to reduce origin requests";
  };
}

11. Disaster Recovery

11.1 Backup Strategy

interface BackupStrategy {
  rds: {
    frequency: "Daily";
    retention: "7 days";
    rpo: "24 hours";
    rto: "1 hour";
  };
  s3: {
    method: "Versioning";
    retention: "Indefinite";
    rpo: "Immediate";
    rto: "Immediate";
  };
  valkey: {
    frequency: "Daily snapshot";
    retention: "5 days";
    rpo: "24 hours";
    rto: "30 minutes";
  };
}

11.2 Recovery Procedures

# RDS Restore
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier sage-postgres-restored \
  --db-snapshot-identifier sage-postgres-snapshot-2024-01-15

# S3 Restore (Version)
aws s3api list-object-versions \
  --bucket sage-frontend-prod \
  --prefix index.html

aws s3api get-object \
  --bucket sage-frontend-prod \
  --key index.html \
  --version-id <version-id> \
  index.html

12. Security

12.1 IAM Roles

# ECS Execution Role (pull images, write logs)
resource "aws_iam_role" "ecs_execution" {
  name = "sage-ecs-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ecs-tasks.amazonaws.com"
      }
    }]
  })
}

# ECS Task Role (access AWS services from app)
resource "aws_iam_role" "ecs_task" {
  name = "sage-ecs-task-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ecs-tasks.amazonaws.com"
      }
    }]
  })
}

12.2 Secrets Management

resource "aws_secretsmanager_secret" "anthropic_key" {
  name = "sage/anthropic-api-key"

  tags = {
    Name = "sage-anthropic-key"
  }
}

resource "aws_secretsmanager_secret_version" "anthropic_key" {
  secret_id     = aws_secretsmanager_secret.anthropic_key.id
  secret_string = var.anthropic_api_key
}

12.3 API Key 관리 전략

중앙화된 API Key 관리:

// API Key 저장소 구조
interface APIKeyStore {
  anthropic: {
    storage: "AWS Secrets Manager";
    rotation: "Manual (Phase 2: Automatic)";
    access: "ECS Task Role only";
  };
  binance: {
    storage: "None (Public WebSocket)";
    type: "WebSocket";
    endpoint: "wss://stream.binance.com:9443/ws";
  };
  gateio: {
    storage: "None (Public WebSocket)";
    type: "WebSocket";
    endpoint: "wss://api.gateio.ws/ws/v4/";
  };
  discord: {
    storage: "AWS Secrets Manager";
    type: "Webhook URL";
    channels: ["#errors", "#performance", "#business-metrics"];
  };
}

보안 원칙:

최소 권한: ECS Task Role만 접근 가능
감사 로그: CloudTrail로 모든 Secret 접근 기록
암호화: KMS로 암호화된 상태로 저장
버전 관리: Secret 변경 시 이전 버전 7일 유지

13. Scaling Strategy

13.1 Scaling Phases

interface ScalingPhases {
  phase1_mvp: {
    mau: "5,000";
    ecs: "2 tasks (0.5 vCPU, 1 GB each)";
    rds: "db.t3.micro (1 vCPU, 1 GB)";
    valkey: "cache.t3.micro x2";
  };
  phase2_growth: {
    mau: "50,000";
    ecs: "5-10 tasks (auto-scaling)";
    rds: "db.t3.small (2 vCPU, 2 GB) + Read Replica";
    valkey: "cache.t3.small x2";
  };
  phase3_scale: {
    mau: "100,000+";
    ecs: "10-20 tasks";
    rds: "db.r6g.large (2 vCPU, 16 GB) + 2 Read Replicas";
    valkey: "cache.r6g.large x2";
    multiRegion: "US + Asia";
  };
}

Appendix A: Terraform Commands

# Initialize
terraform init

# Plan
terraform plan -out=tfplan

# Apply
terraform apply tfplan

# Destroy (careful!)
terraform destroy

Appendix B: AWS CLI Useful Commands

# List ECS tasks
aws ecs list-tasks --cluster sage-cluster

# Describe task
aws ecs describe-tasks --cluster sage-cluster --tasks <task-id>

# View logs
aws logs tail /ecs/sage-backend --follow

Appendix C: Environment Variables

# AWS
export AWS_REGION=us-west-2
export AWS_ACCOUNT_ID=123456789012

# Terraform
export TF_VAR_anthropic_api_key="sk-ant-..."
export TF_VAR_db_password="..."

Document Version: 3.0 Last Updated: 2025-12-26 Infrastructure: AWS ECS Fargate + RDS PostgreSQL 18 + ElastiCache Valkey 8.x IaC: Pulumi (TypeScript) Maintainer: Sam (dev@5010.tech)

Changelog

v3.0 (2025-12-26):

IaC 도구 변경: Terraform → Pulumi (TypeScript 풀스택 통일)
PostgreSQL 16 → 18: 5-year LTS, JSON 30% 성능 향상
Redis 7.x → Valkey 8.x: 라이센스 안정성, Linux Foundation 프로젝트
VPC 다이어그램 상세화: 3-Tier 서브넷 구조 (Public/Private/Database)
API Key 관리 전략 추가: AWS Secrets Manager 중앙화
Discord 알림 시스템 추가: 에러/성능/비즈니스 메트릭별 채널 분류

Infrastructure Spec

Sage.ai Infrastructure Specification

1. Infrastructure Overview

1.1 Architecture Diagram

1.2 Technology Stack

1.3 Pulumi vs Terraform 선택 근거

2. Compute Layer

2.1 ECS Cluster Configuration

2.2 Task Definition (Backend)

2.3 Service Configuration

2.4 Auto Scaling

3. Database Layer

3.1 RDS PostgreSQL Instance

3.2 Read Replica (Phase 2+)

4. Cache Layer

4.1 ElastiCache Valkey Cluster (Redis-Compatible)

5. Storage & CDN

5.1 S3 Bucket (Frontend)

5.2 CloudFront Distribution

6. Load Balancer

6.1 Application Load Balancer

7. Networking

7.1 VPC Configuration (상세 다이어그램)

7.2 Security Groups

8. CI/CD Pipeline

8.1 Backend Deployment Flow

8.2 Backend Deployment Workflow

8.3 Frontend Deployment Workflow

9. Monitoring & Logging

9.1 CloudWatch Alarms

9.2 Sentry Configuration

9.3 Discord 알림 전략

10. Cost Optimization

10.1 Monthly Cost Estimate (MVP)

10.2 Cost Saving Strategies

11. Disaster Recovery

11.1 Backup Strategy

11.2 Recovery Procedures

12. Security

12.1 IAM Roles

12.2 Secrets Management

12.3 API Key 관리 전략

13. Scaling Strategy

13.1 Scaling Phases

Appendix A: Terraform Commands

Appendix B: AWS CLI Useful Commands

Appendix C: Environment Variables

Changelog

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🏠 Home

🏛️ Architecture

📋 Technical Specs

📦 Product

🏢 Business & Strategy

🚀 Operations

🤖 AI Guides (Legacy)

🔗 Repositories

Clone this wiki locally