Skip to content

INC0010020: High memory usage and 500s on Octopets API (/api/listings/{id:int}) – OutOfMemoryException #80

@gderossilive

Description

@gderossilive

Incident: INC0010020 (ServiceNow sys_id: TBD)
Severity: Sev3
Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/microsoft.insights/components/octopets_appinsights-y6uqzjyatoawm
Time window (UTC): 2026-03-05T14:24:16Z – 2026-03-05T16:24:16Z

Impact summary:

  • Spikes in 5xx responses on GET /api/listings/{id:int}
  • Average response time elevated (~0.8–0.86s) during the spike
  • OutOfMemoryException occurrences aligned with memory pressure

Evidence (KQL + key results):

  1. Requests volume/failures (15m bins)
    KQL:
    requests | where timestamp >= ago(2h) | summarize total=count(), failures=sumif(1, success==false) by bin(timestamp, 15m) | order by timestamp asc
    Sample result:
    2026-03-05T16:15:00Z total=92 failures=68

  2. Top failing operations (2h)
    KQL:
    requests | where timestamp >= ago(2h) and success == false | summarize count() by resultCode, name | top 10 by count_ desc
    Sample result:

  • 500 GET /api/listings/{id:int} count=67
  • 499 GET /api/listings/ count=1
  1. Exceptions (2h)
    KQL:
    exceptions | where timestamp >= ago(2h) | summarize count() by type, outerMessage | top 10 by count_ desc
    Sample result:
  • System.OutOfMemoryException "Exception of type 'System.OutOfMemoryException' was thrown." count=68
  1. Sample failed request (redacted)
    KQL:
    requests | where timestamp >= ago(2h) and success == false | top 1 by timestamp desc | project timestamp, operation_Id, name, resultCode, duration, url, performanceBucket
    Sample result:
  • 2026-03-05T16:24:30.613Z opId=e6b2699d9a5bd41855a20bcdca2ee4ec GET /api/listings/{id:int} 500 duration=1064ms url=/api/listings/2 bucket=1sec-3sec

Azure Metrics (octopetsapi):

  • Memory Working Set Bytes (avg): peaked ~860–882 MB around 16:20–16:23Z (alert threshold ~859,993,459 bytes [~0.86GB])
  • Memory Percentage (avg): ~76–78% during spike
  • Requests (total): increased around 16:20Z; 5xx present during spike
  • ResponseTime (ms): ~786–860ms during spike
  • CpuPercentage: low/normal; no CPU saturation
  • RestartCount: no restarts observed in window

Suspected root cause (ranked):

  1. High probability: Memory leak or excessive object retention in GET /api/listings/{id:int}, leading to OutOfMemoryException and 500s under load (evidence: 68 OOM exceptions; memory near limit; failures concentrated on this endpoint).
  2. Medium: Inefficient payload construction (e.g., materializing large collections, unbounded caching) causing elevated working set during request handling.

Proposed fixes:
Code-level:

  • Audit /api/listings/{id} for unbounded allocations; avoid loading entire object graphs; stream results; dispose unmanaged resources; ensure no static caches retain per-request objects.
  • Add pagination/field selection where applicable; enforce size limits; consider compression only after memory is controlled.
  • Add defensive guards and timeouts; map OOM to 503/429 only if upstream dependency issues are detected (avoid 500 for expected pressure).

IaC/config-level (non-prod proposal; to be PR’d and validated):

  • Container Apps template: increase memory limit from ~1Gi to 1.5–2Gi for API only if profiling indicates legitimate baseline need (short-term mitigation during testing). Example Bicep diff:
    • properties.template.containers[0].resources.memory: "1.5Gi"
  • Add Autoscaling based on MemoryPercentage > 70% to pre-empt pressure (KEDA/autoscale rules); cap max replicas appropriately.
  • Alerts: add MemoryPercentage early-warning at 70% (5m for 10m), and 5xx rate alert on API.

Next steps:

  • Reproduce locally with load test; run memory profiler; identify retention paths; fix code; PR with unit/integration tests.
  • Optional: raise memory limit and roll out canary to validate while code fix is prepared.

Please assign backend owners of /api/listings. We will update this issue with the ServiceNow sys_id once available.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions