Skip to content

Octopets API: High memory usage causing OutOfMemoryException and 500s on GET /api/listings/{id:int} #86

@gderossilive

Description

@gderossilive

Incident: INC0010028 (sys_id: TBA)
Severity: Sev3
Investigation window (UTC): 2026-03-05T14:44:12Z to 2026-03-05T16:44:12Z
Target: Container Apps octopetsapi (/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi)

Impact summary:

  • ~182 failed requests (500) on GET /api/listings/{id:int} within the last 2 hours
  • Sustained memory usage ~880–891 MB (~79% MemoryPercentage) during 16:20–16:41 UTC
  • System.OutOfMemoryException occurrences aligned with failing requests
  • 5xx request rate persisted during the memory spike; response times ~1.0–1.3s for 5xx

Evidence:
Application Insights KQL:

  1. Top failing operations (last 2h)
    requests
    | where timestamp > ago(2h) and success == false
    | summarize failures=count() by name, resultCode
    | top 10 by failures desc
    Results (summary):
  • GET /api/listings/{id:int} 500 → 182
  • GET / 404 → 5
  • GET /api/listings/ 499 → 1
  1. Top exceptions (last 2h)
    exceptions
    | where timestamp > ago(2h)
    | summarize exceptions=count() by type, outerMessage
    | top 10 by exceptions desc
    Results (summary):
  • System.OutOfMemoryException → 182 (outerMessage: "Exception of type 'System.OutOfMemoryException' was thrown.")
  1. Sample failing operation (correlated)
    requests
    | where timestamp > ago(2h) and success == false and name has "/api/listings"
    | project timestamp, name, resultCode, operation_Id, cloud_RoleName
    | top 1 by timestamp desc
    Sample:
  • 2026-03-05T16:34:02.083Z, GET /api/listings/{id:int}, 500, operation_Id=d276a285f14c562542834892ec79fd76, role=[cae-y6uqzjyatoawm]/octopetsapi
  1. Exception sample (correlated)
    exceptions
    | where timestamp > ago(2h)
    | project timestamp, type, outerMessage, operation_Id
    | top 1 by timestamp desc
    Sample:
  • 2026-03-05T16:34:03.103Z, System.OutOfMemoryException, operation_Id=d276a285f14c562542834892ec79fd76
  1. Memory-related traces (last 2h)
    traces
    | where timestamp > ago(2h)
    | where tostring(message) has_any ("OutOfMemory", "OOM", "memory", "heap")
    | summarize entries=count() by bin(timestamp, 15m)
    Results: entries at 16:15 and 16:30 UTC

Azure Metrics (Microsoft.App/containerapps):

  • MemoryPercentage (last 2h): sustained ~78–79% from 16:22 to 16:41 UTC, then dropped to ~5–6% from 16:44 onward
  • WorkingSetBytes: ~880–891 MB from 16:22 to 16:41 UTC, then ~98–109 MB from 16:44 onward
  • CpuPercentage: ~15–28% during spike; near 0 around 16:36–16:41 and after 16:44
  • Requests (5xx): average ~3 at 16:20, ~6 steady through ~16:35, then 0 after ~16:36
  • ResponseTime (5xx): ~1.0–1.3s during spike window
  • RestartCount: near 0; brief non-zero after 16:44 (0–1.5), suggesting a restart or scale event

Suspected root cause(s):

  1. High-probability: Memory-intensive path in GET /api/listings/{id:int} leading to System.OutOfMemoryException under load. Supported by aligned OOM exceptions, high working set (~890 MB), sustained high memory percentage, and concentrated 500s on that endpoint.
  2. Medium-probability: Unbounded object/materialization (e.g., deserializing large payloads, loading large related blobs/images) or inefficient DTO mapping causing large transient allocations and GC pressure.
  3. Lower-probability: Misconfigured container app memory limit vs workload characteristics; minor restart/scale event observed post 16:42 reducing memory footprint.

Proposed fixes:
Code:

  • Stream data for GET /api/listings/{id:int} instead of materializing full objects; use projection to minimal DTOs and avoid loading large related data eagerly.
  • Add defensive guards for large payloads (size caps) and lazy-load/async-stream related resources (images/blobs).
  • Review EF/ORM query includes; replace with Select projections; ensure pagination where relevant; validate that images/blob content is not embedded inline for the id path.
  • Introduce memory profiling (e.g., dotnet-counters, dotnet-gcdump) in staging to identify hotspots.

IaC/config:

  • Explicitly set container memory requests/limits aligned to observed peaks (e.g., allocate >1GB if justified) and add autoscale policies based on MemoryPercentage to prevent saturation.
  • Add Azure Monitor alert for App Insights OutOfMemoryException count > N in 10m.
  • Configure health probes/timeouts to shed load gracefully during memory pressure.

Next steps:

  • Implement streaming/projection in the listings-by-id handler, add load test, and reprofile memory.
  • Adjust Container Apps memory limits/requests if code-level reductions are insufficient.

Please assign to API owners for remediation. Attach relevant code paths and diffs once identified.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions