-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Incident: INC0010032 (sys_id: TBC)
Severity: 3
Service: Octopets API (Container Apps: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi)
Time window (UTC): 2026-03-05 14:49:36 → 16:49:36
Alert context: “High Memory Usage - Octopets API” fired at 16:41:18Z and resolved at 16:48:24Z
Impact summary:
- 250 total requests in last 2h; 188 failures (75.2% failure rate); p95 duration ≈ 1175 ms
- Top failing operation: GET /api/listings/{id:int} returning 500 (failures=182)
Application Insights evidence:
-
KQL: requests summary (last 2h)
requests
| summarize totalRequests=count(), failedRequests=countif(success==false), failureRatePct=100.0*countif(success==false)/count(), p95DurationMs=percentile(duration, 95)
Result: total=250, failed=188 (75.2%), p95≈1175 ms -
KQL: top failing operations
requests
| where timestamp >= ago(2h)
| where success == false
| summarize failures=count() by operationName=name, resultCode
| top 10 by failures desc
Result: GET /api/listings/{id:int} → 500:182; GET / → 404:5; GET /api/listings/ → 499:1 -
KQL: exception types
exceptions
| where timestamp >= ago(2h)
| summarize count() by type, outerType, operation_Name
| top 10 by count_ desc
Result: System.OutOfMemoryException count=182 -
KQL: most recent exception sample
exceptions
| where timestamp >= ago(2h)
| project timestamp, type, message, operation_Name, problemId, outerType
| top 1 by timestamp desc
Sample: 2026-03-05T16:34:03Z System.OutOfMemoryException at Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation:10 (message redacted) -
KQL: memory-related traces frequency
traces
| where timestamp >= ago(2h)
| where message has_any ("OutOfMemory","memory","GC","OOM")
| summarize count() by bin(timestamp, 5m)
Result: spikes at 16:15 and 16:40 UTC
Platform metrics (Container Apps: Microsoft.App/containerapps):
- MemoryPercentage: sustained ~78–79% from ~16:22–16:41Z, then dropped to ~6% by 16:44Z (aligns with alert fire/resolve)
- WorkingSetBytes: ~880–891 MB during incident, dropped to ~100–111 MB after 16:44Z
- CpuPercentage: low overall (peaks ~28%), not CPU-bound
- ResponseTime: elevated during incident window (≈ 780–860 ms averages in minute bins)
- RestartCount: no significant restarts observed
Representative failing request:
KQL:
requests
| where timestamp >= ago(2h)
| where name == "GET /api/listings/{id:int}" and success == false
| project timestamp, operation_Id, name, resultCode, duration
| top 1 by timestamp desc
Sample: 2026-03-05T16:34:02Z, opId=d276a285f14c562542834892ec79fd76, 500, duration≈1020 ms
Suspected root cause (ranked):
- High-probability: unbounded memory usage in ListingEndpoints.AReallyExpensiveOperation leading to System.OutOfMemoryException and 500s. Evidence: 182 OOM exceptions; memory ~880–891MB near limit; failures concentrated on GET /api/listings/{id:int}; alert fired exactly during sustained high memory; response times elevated.
- Medium: inefficient data/materialization for listing details (e.g., large object graphs or non-streamed payload) causing spikes under load. Evidence: endpoint-specific failures and long durations without CPU saturation.
Proposed fixes:
Code changes (C#):
- Audit AReallyExpensiveOperation for large allocations; replace with streaming/iterators and bounded buffers; avoid ToList()/ToArray() on large sequences.
- Add cancellation/timeout guards and chunked processing; ensure any caches have size limits and evictions.
- Validate JSON serialization to use streaming (System.Text.Json with Utf8JsonWriter) for large payloads.
- Add defensive try/catch to map OOM to 503 with retry-after rather than 500 (temporary mitigation once root cause fixed).
Example diff (pseudo):
// Octopets.Backend/Endpoints/ListingEndpoints.cs
- var data = await repo.GetAllListingsAsync();
- var enriched = data.Select(x => ExpensiveExpand(x)).ToList();
- await foreach (var item in repo.StreamListingsAsync(ct))
- {
-
using var pooled = ArrayPool<byte>.Shared.Rent(minSize); -
var enriched = LightweightExpand(item, pooled.Array); -
await writer.WriteAsync(enriched, ct); - }
IaC/config changes (Bicep/path: external/octopets/apphost/infra/main.bicep):
- Ensure memory requests/limits set explicitly; consider raising memory limit slightly (e.g., +10–20%) only after code fix validated.
- Add Azure Monitor alerts for 5xx rate and OOM exceptions.
- Ensure Container Apps revisions have min/max replicas aligned with workload; consider HPA based on memory if appropriate.
Next steps:
- Implement code optimizations above, add unit/integration tests to validate memory footprint.
- Add load test to reproduce and confirm memory behavior post-fix.
- After merge/deploy, monitor MemoryPercentage, 5xx, and exceptions for 24h.
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here