-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Incident: INC0010028 (sys_id: TBA)
Severity: Sev3
Investigation window (UTC): 2026-03-05T14:44:12Z to 2026-03-05T16:44:12Z
Target: Container Apps octopetsapi (/subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi)
Impact summary:
- ~182 failed requests (500) on GET /api/listings/{id:int} within the last 2 hours
- Sustained memory usage ~880–891 MB (~79% MemoryPercentage) during 16:20–16:41 UTC
- System.OutOfMemoryException occurrences aligned with failing requests
- 5xx request rate persisted during the memory spike; response times ~1.0–1.3s for 5xx
Evidence:
Application Insights KQL:
- Top failing operations (last 2h)
requests
| where timestamp > ago(2h) and success == false
| summarize failures=count() by name, resultCode
| top 10 by failures desc
Results (summary):
- GET /api/listings/{id:int} 500 → 182
- GET / 404 → 5
- GET /api/listings/ 499 → 1
- Top exceptions (last 2h)
exceptions
| where timestamp > ago(2h)
| summarize exceptions=count() by type, outerMessage
| top 10 by exceptions desc
Results (summary):
- System.OutOfMemoryException → 182 (outerMessage: "Exception of type 'System.OutOfMemoryException' was thrown.")
- Sample failing operation (correlated)
requests
| where timestamp > ago(2h) and success == false and name has "/api/listings"
| project timestamp, name, resultCode, operation_Id, cloud_RoleName
| top 1 by timestamp desc
Sample:
- 2026-03-05T16:34:02.083Z, GET /api/listings/{id:int}, 500, operation_Id=d276a285f14c562542834892ec79fd76, role=[cae-y6uqzjyatoawm]/octopetsapi
- Exception sample (correlated)
exceptions
| where timestamp > ago(2h)
| project timestamp, type, outerMessage, operation_Id
| top 1 by timestamp desc
Sample:
- 2026-03-05T16:34:03.103Z, System.OutOfMemoryException, operation_Id=d276a285f14c562542834892ec79fd76
- Memory-related traces (last 2h)
traces
| where timestamp > ago(2h)
| where tostring(message) has_any ("OutOfMemory", "OOM", "memory", "heap")
| summarize entries=count() by bin(timestamp, 15m)
Results: entries at 16:15 and 16:30 UTC
Azure Metrics (Microsoft.App/containerapps):
- MemoryPercentage (last 2h): sustained ~78–79% from 16:22 to 16:41 UTC, then dropped to ~5–6% from 16:44 onward
- WorkingSetBytes: ~880–891 MB from 16:22 to 16:41 UTC, then ~98–109 MB from 16:44 onward
- CpuPercentage: ~15–28% during spike; near 0 around 16:36–16:41 and after 16:44
- Requests (5xx): average ~3 at 16:20, ~6 steady through ~16:35, then 0 after ~16:36
- ResponseTime (5xx): ~1.0–1.3s during spike window
- RestartCount: near 0; brief non-zero after 16:44 (0–1.5), suggesting a restart or scale event
Suspected root cause(s):
- High-probability: Memory-intensive path in GET /api/listings/{id:int} leading to System.OutOfMemoryException under load. Supported by aligned OOM exceptions, high working set (~890 MB), sustained high memory percentage, and concentrated 500s on that endpoint.
- Medium-probability: Unbounded object/materialization (e.g., deserializing large payloads, loading large related blobs/images) or inefficient DTO mapping causing large transient allocations and GC pressure.
- Lower-probability: Misconfigured container app memory limit vs workload characteristics; minor restart/scale event observed post 16:42 reducing memory footprint.
Proposed fixes:
Code:
- Stream data for GET /api/listings/{id:int} instead of materializing full objects; use projection to minimal DTOs and avoid loading large related data eagerly.
- Add defensive guards for large payloads (size caps) and lazy-load/async-stream related resources (images/blobs).
- Review EF/ORM query includes; replace with Select projections; ensure pagination where relevant; validate that images/blob content is not embedded inline for the id path.
- Introduce memory profiling (e.g., dotnet-counters, dotnet-gcdump) in staging to identify hotspots.
IaC/config:
- Explicitly set container memory requests/limits aligned to observed peaks (e.g., allocate >1GB if justified) and add autoscale policies based on MemoryPercentage to prevent saturation.
- Add Azure Monitor alert for App Insights OutOfMemoryException count > N in 10m.
- Configure health probes/timeouts to shed load gracefully during memory pressure.
Next steps:
- Implement streaming/projection in the listings-by-id handler, add load test, and reprofile memory.
- Adjust Container Apps memory limits/requests if code-level reductions are insufficient.
Please assign to API owners for remediation. Attach relevant code paths and diffs once identified.
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here
Reactions are currently unavailable