Skip to content

Sev3: High memory usage and OOMs in Octopets API (GET /api/listings/{id}) #82

@gderossilive

Description

@gderossilive

Incident: INC0010025 (ServiceNow)
Severity: 3
Investigation window (UTC): 2026-03-05T14:36:00Z → 2026-03-05T16:36:00Z
Target: Container App rg-octopets-demo-lab/octopetsapi

Impact summary:

  • 182 failed requests (500) on GET /api/listings/{id:int} within last 2h
  • Exceptions dominated by System.OutOfMemoryException (182)
  • Container memory sustained ~79–89% around 16:22–16:33Z, aligned with alert firing at 16:31Z
  • 5xx throughput ~6/min during peak window; response time elevated ~780–860 ms

Application Insights evidence:
KQL 1 (Top failing operations):
requests
| where timestamp > ago(2h)
| summarize total=count(), failures=sumif(1, success==false) by name, resultCode
| where failures > 0
| top 10 by failures desc
Results (last 2h):

  • GET /api/listings/{id:int}, 500: total=182, failures=182
  • GET /, 404: total=4, failures=4
  • GET /api/listings/, 499: total=1, failures=1

KQL 2 (Top exception types):
exceptions
| where timestamp > ago(2h)
| summarize count() by type
| top 10 by count_ desc
Results:

  • System.OutOfMemoryException: 182

KQL 3 (Sample failing request):
requests
| where timestamp > ago(2h) and success == false
| project timestamp, operation_Id, name, resultCode
| top 1 by timestamp desc
Sample:

  • 2026-03-05T16:34:02Z, operation_Id=d276a285f14c562542834892ec79fd76, name=GET /api/listings/{id:int}, resultCode=500

Azure Metrics (octopetsapi):

  • MemoryPercentage (avg): sustained 74–79% from ~16:20Z–16:33Z with spikes up to ~79–80%; dips after 16:34Z
  • WorkingSetBytes (avg): ~859–889 MB between 16:20Z–16:33Z (alert threshold 858,993,459 bytes ≈ 0.8 GB)
  • CpuPercentage (avg): ~16–28% during same period
  • Requests[statusCodeCategory=5xx] (avg/min): ~3–7.5 from 16:20Z–16:34Z
  • ResponseTime (avg ms): ~780–860 ms from 16:20Z–16:33Z
  • RestartCount: no recent increments observed in last 2h

Suspected root cause (ranked):

  1. Memory leak or unbounded object allocation in GET /api/listings/{id:int} code path → corroborated by 182 System.OutOfMemoryException and coincident high WorkingSetBytes.
  2. Payload amplification or inefficient serialization (e.g., loading full related entities/images into memory) causing large transient allocations per request → aligns with elevated response time and 5xx bursts without CPU saturation.
  3. Insufficient container memory limit relative to request workload profile → memory hovering just above the alert threshold suggests headroom is tight.

Proposed fixes:
Code (assuming .NET API):

  • Stream responses and use pagination; avoid materializing large collections.
  • Ensure IDisposable patterns are followed for streams/db contexts; avoid ToList()/Include() on large graphs.
  • Cap result sizes (max page size) for GET /api/listings/{id}; lazy-load large fields (e.g., images) via separate endpoints.
  • Add guards and fallbacks to return 429/503 instead of OOM on pressure; instrument GC counters.

IaC/config (Container Apps):

  • Raise memory limit or add autoscaling based on MemoryPercentage. Example bicep snippet tweak:
    // external/octopets/apphost/infra/main.bicep (container resources section)
    container: {
    image: ''
    resources: {
    cpu: 0.5
    memory: '1.5Gi' // was '1Gi'
    }
    env: [
    // consider ASPNETCORE_URLS, GC HeapHardLimitPercent if needed
    ]
    }
  • Configure scale rules to add replicas when MemoryPercentage > 70% sustained 5m.
  • Add Azure Monitor alerts for 5xx rate and ResponseTime alongside memory.

Next steps:

  • Reproduce locally/dev with representative payloads; collect dotnet-counters (GC HeapSize, LOH size) under load.
  • Implement streaming/pagination; add load test to validate memory stays <65% at p95.
  • Submit PR to update bicep with memory and scale adjustments.

References:

  • Alert fired: 2026-03-05T16:31:28Z (High Memory Usage - Octopets API)
  • Correlation sample: operation_Id d276a285f14c562542834892ec79fd76 (16:34:02Z)

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions