-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Incident: INC0010025 (ServiceNow)
Severity: 3
Investigation window (UTC): 2026-03-05T14:36:00Z → 2026-03-05T16:36:00Z
Target: Container App rg-octopets-demo-lab/octopetsapi
Impact summary:
- 182 failed requests (500) on GET /api/listings/{id:int} within last 2h
- Exceptions dominated by System.OutOfMemoryException (182)
- Container memory sustained ~79–89% around 16:22–16:33Z, aligned with alert firing at 16:31Z
- 5xx throughput ~6/min during peak window; response time elevated ~780–860 ms
Application Insights evidence:
KQL 1 (Top failing operations):
requests
| where timestamp > ago(2h)
| summarize total=count(), failures=sumif(1, success==false) by name, resultCode
| where failures > 0
| top 10 by failures desc
Results (last 2h):
- GET /api/listings/{id:int}, 500: total=182, failures=182
- GET /, 404: total=4, failures=4
- GET /api/listings/, 499: total=1, failures=1
KQL 2 (Top exception types):
exceptions
| where timestamp > ago(2h)
| summarize count() by type
| top 10 by count_ desc
Results:
- System.OutOfMemoryException: 182
KQL 3 (Sample failing request):
requests
| where timestamp > ago(2h) and success == false
| project timestamp, operation_Id, name, resultCode
| top 1 by timestamp desc
Sample:
- 2026-03-05T16:34:02Z, operation_Id=d276a285f14c562542834892ec79fd76, name=GET /api/listings/{id:int}, resultCode=500
Azure Metrics (octopetsapi):
- MemoryPercentage (avg): sustained 74–79% from ~16:20Z–16:33Z with spikes up to ~79–80%; dips after 16:34Z
- WorkingSetBytes (avg): ~859–889 MB between 16:20Z–16:33Z (alert threshold 858,993,459 bytes ≈ 0.8 GB)
- CpuPercentage (avg): ~16–28% during same period
- Requests[statusCodeCategory=5xx] (avg/min): ~3–7.5 from 16:20Z–16:34Z
- ResponseTime (avg ms): ~780–860 ms from 16:20Z–16:33Z
- RestartCount: no recent increments observed in last 2h
Suspected root cause (ranked):
- Memory leak or unbounded object allocation in GET /api/listings/{id:int} code path → corroborated by 182 System.OutOfMemoryException and coincident high WorkingSetBytes.
- Payload amplification or inefficient serialization (e.g., loading full related entities/images into memory) causing large transient allocations per request → aligns with elevated response time and 5xx bursts without CPU saturation.
- Insufficient container memory limit relative to request workload profile → memory hovering just above the alert threshold suggests headroom is tight.
Proposed fixes:
Code (assuming .NET API):
- Stream responses and use pagination; avoid materializing large collections.
- Ensure IDisposable patterns are followed for streams/db contexts; avoid ToList()/Include() on large graphs.
- Cap result sizes (max page size) for GET /api/listings/{id}; lazy-load large fields (e.g., images) via separate endpoints.
- Add guards and fallbacks to return 429/503 instead of OOM on pressure; instrument GC counters.
IaC/config (Container Apps):
- Raise memory limit or add autoscaling based on MemoryPercentage. Example bicep snippet tweak:
// external/octopets/apphost/infra/main.bicep (container resources section)
container: {
image: ''
resources: {
cpu: 0.5
memory: '1.5Gi' // was '1Gi'
}
env: [
// consider ASPNETCORE_URLS, GC HeapHardLimitPercent if needed
]
} - Configure scale rules to add replicas when MemoryPercentage > 70% sustained 5m.
- Add Azure Monitor alerts for 5xx rate and ResponseTime alongside memory.
Next steps:
- Reproduce locally/dev with representative payloads; collect dotnet-counters (GC HeapSize, LOH size) under load.
- Implement streaming/pagination; add load test to validate memory stays <65% at p95.
- Submit PR to update bicep with memory and scale adjustments.
References:
- Alert fired: 2026-03-05T16:31:28Z (High Memory Usage - Octopets API)
- Correlation sample: operation_Id d276a285f14c562542834892ec79fd76 (16:34:02Z)
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here