-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Incident: INC0010020 (ServiceNow sys_id: TBD)
Severity: Sev3
Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/microsoft.insights/components/octopets_appinsights-y6uqzjyatoawm
Time window (UTC): 2026-03-05T14:24:16Z – 2026-03-05T16:24:16Z
Impact summary:
- Spikes in 5xx responses on GET /api/listings/{id:int}
- Average response time elevated (~0.8–0.86s) during the spike
- OutOfMemoryException occurrences aligned with memory pressure
Evidence (KQL + key results):
-
Requests volume/failures (15m bins)
KQL:
requests | where timestamp >= ago(2h) | summarize total=count(), failures=sumif(1, success==false) by bin(timestamp, 15m) | order by timestamp asc
Sample result:
2026-03-05T16:15:00Z total=92 failures=68 -
Top failing operations (2h)
KQL:
requests | where timestamp >= ago(2h) and success == false | summarize count() by resultCode, name | top 10 by count_ desc
Sample result:
- 500 GET /api/listings/{id:int} count=67
- 499 GET /api/listings/ count=1
- Exceptions (2h)
KQL:
exceptions | where timestamp >= ago(2h) | summarize count() by type, outerMessage | top 10 by count_ desc
Sample result:
- System.OutOfMemoryException "Exception of type 'System.OutOfMemoryException' was thrown." count=68
- Sample failed request (redacted)
KQL:
requests | where timestamp >= ago(2h) and success == false | top 1 by timestamp desc | project timestamp, operation_Id, name, resultCode, duration, url, performanceBucket
Sample result:
- 2026-03-05T16:24:30.613Z opId=e6b2699d9a5bd41855a20bcdca2ee4ec GET /api/listings/{id:int} 500 duration=1064ms url=/api/listings/2 bucket=1sec-3sec
Azure Metrics (octopetsapi):
- Memory Working Set Bytes (avg): peaked ~860–882 MB around 16:20–16:23Z (alert threshold ~859,993,459 bytes [~0.86GB])
- Memory Percentage (avg): ~76–78% during spike
- Requests (total): increased around 16:20Z; 5xx present during spike
- ResponseTime (ms): ~786–860ms during spike
- CpuPercentage: low/normal; no CPU saturation
- RestartCount: no restarts observed in window
Suspected root cause (ranked):
- High probability: Memory leak or excessive object retention in GET /api/listings/{id:int}, leading to OutOfMemoryException and 500s under load (evidence: 68 OOM exceptions; memory near limit; failures concentrated on this endpoint).
- Medium: Inefficient payload construction (e.g., materializing large collections, unbounded caching) causing elevated working set during request handling.
Proposed fixes:
Code-level:
- Audit /api/listings/{id} for unbounded allocations; avoid loading entire object graphs; stream results; dispose unmanaged resources; ensure no static caches retain per-request objects.
- Add pagination/field selection where applicable; enforce size limits; consider compression only after memory is controlled.
- Add defensive guards and timeouts; map OOM to 503/429 only if upstream dependency issues are detected (avoid 500 for expected pressure).
IaC/config-level (non-prod proposal; to be PR’d and validated):
- Container Apps template: increase memory limit from ~1Gi to 1.5–2Gi for API only if profiling indicates legitimate baseline need (short-term mitigation during testing). Example Bicep diff:
- properties.template.containers[0].resources.memory: "1.5Gi"
- Add Autoscaling based on MemoryPercentage > 70% to pre-empt pressure (KEDA/autoscale rules); cap max replicas appropriately.
- Alerts: add MemoryPercentage early-warning at 70% (5m for 10m), and 5xx rate alert on API.
Next steps:
- Reproduce locally with load test; run memory profiler; identify retention paths; fix code; PR with unit/integration tests.
- Optional: raise memory limit and roll out canary to validate while code fix is prepared.
Please assign backend owners of /api/listings. We will update this issue with the ServiceNow sys_id once available.
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here