-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Incident Context
- ServiceNow Incident: INC0010021 (sys_id: TBD)
- Alert: High Response Time - Octopets API (Metric Alert)
- Fired: 2026-03-05T16:22:52Z
- Target: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
Investigation Window (UTC)
- Start: 2026-03-05T14:25:54Z
- End: 2026-03-05T16:25:54Z
Evidence (Application Insights)
- Requests summary (KQL):
requests
| where timestamp >= ago(2h)
| summarize total=count(), failed=countif(success==false), avgDurationMs=avg(duration), p95DurationMs=percentile(duration,95), p99DurationMs=percentile(duration,99)
Result:
- total=92, failed=68, avgDurationMs≈968, p95≈1525ms, p99≈1847ms
- Top failing operations (KQL):
requests
| where timestamp >= ago(2h) and success==false
| summarize fails=count() by name, resultCode
| top 10 by fails desc
Result:
- GET /api/listings/{id:int} → 500: 67
- GET /api/listings/ → 499: 1
- Exceptions (KQL):
exceptions
| where timestamp >= ago(2h)
| summarize exc=count() by type, outerType, method, problemId
| top 10 by exc desc
Result:
- System.OutOfMemoryException (Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation): 68
- Sample failing request (KQL):
requests
| where timestamp >= ago(2h) and success==false
| project timestamp, name, resultCode, duration, operation_Id, operation_ParentId, cloud_RoleName, url
| top 1 by timestamp desc
Sample Result (sensitive fields redacted if any):
- 2026-03-05T16:24:30Z GET /api/listings/{id:int} resultCode=500 duration≈1064ms opId=e6b2699d9a5bd41855a20bcdca2ee4ec role=[cae-y6uqzjyatoawm]/octopetsapi url=http://.../api/listings/2
Evidence (Azure Metrics – Microsoft.App/containerapps)
- ResponseTime (avg): sustained elevation 16:20–16:24Z ~786–859ms (alert threshold=700ms)
- Requests by statusCodeCategory: 5xx rose at 16:20Z (avg≈3), peaking 16:21–16:24Z (avg≈5–7.5); 2xx present concurrently; 0xx negligible
- CpuPercentage: increased 16:20–16:24Z (avg≈76→78%)
- MemoryPercentage: increased 16:20–16:24Z (avg≈39→58.5%)
- RestartCount: stable (no recent increases)
Suspected Root Cause (ranked)
- High-probability: Memory pressure in AReallyExpensiveOperation causing System.OutOfMemoryException, returning 500s and increasing latency. Evidence: 68 OutOfMemoryExceptions tied to Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation; 67 failures on GET /api/listings/{id:int}; Memory% rising during the alert; no restarts.
- Medium: Inefficient data processing in listing-by-id path leading to excessive allocations and GC pressure, extending response times. Evidence: elevated p95/p99 durations; CPU% elevated without restarts.
Proposed Fixes
Code
- Optimize AReallyExpensiveOperation: eliminate large in-memory aggregations; stream results; use paging; avoid ToList()/materializing whole collections; reuse buffers.
- Add cancellation & timeouts around the expensive call; return 429/503 on overload rather than 500 when upstream constraints hit.
- Guard against nulls and add defensive bounds on payload sizes; add metrics for allocation hotspots.
- Example (diff-style, C#):
--- a/Octopets.Backend/Endpoints/ListingEndpoints.cs
+++ b/Octopets.Backend/Endpoints/ListingEndpoints.cs
@@ - var data = await repository.GetAllAsync();
- var result = AReallyExpensiveOperation(data);
- await foreach (var chunk in repository.StreamAsync(ct))
- {
-
foreach (var item in chunk) -
{ -
// process item and write to response incrementally -
} - }
- // add cancellation token and size limits to avoid OOM
IaC/Config (Container Apps)
- Increase memory limit modestly and set requests/limits explicitly to prevent OOM and throttle earlier:
external/octopets/apphost/infra/main.bicep (or container module)
containers:
resources:
cpu: 0.5 # example
memory: 1.0Gi - Add env for .NET GC hard limit to avoid process-wide OOM:
name: DOTNET_GCHeapHardLimit, value: "768m" (tune to below container limit) - Create alerts:
- MemoryPercentage > 75% for 5m
- Exceptions (OutOfMemoryException) count > 1/min for 5m
- Enable structured logging for allocations / request sizes around ListingEndpoints.
Next Steps
- Implement code changes, add unit/integration tests for bounded memory behavior.
- Adjust container resources and deploy to staging; run load tests for GET /api/listings/{id:int}.
- Roll out gradually; monitor ResponseTime, 5xx, exceptions.
References
- Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
- App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/microsoft.insights/components/octopets_appinsights-y6uqzjyatoawm
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here
Reactions are currently unavailable