Sev2: Octopets API high latency and 500s due to OutOfMemoryException in AReallyExpensiveOperation

Incident Context
- ServiceNow Incident: INC0010021 (sys_id: TBD)
- Alert: High Response Time - Octopets API (Metric Alert)
- Fired: 2026-03-05T16:22:52Z
- Target: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi

Investigation Window (UTC)
- Start: 2026-03-05T14:25:54Z
- End:   2026-03-05T16:25:54Z

Evidence (Application Insights)
1) Requests summary (KQL):
requests
| where timestamp >= ago(2h)
| summarize total=count(), failed=countif(success==false), avgDurationMs=avg(duration), p95DurationMs=percentile(duration,95), p99DurationMs=percentile(duration,99)

Result:
- total=92, failed=68, avgDurationMs≈968, p95≈1525ms, p99≈1847ms

2) Top failing operations (KQL):
requests
| where timestamp >= ago(2h) and success==false
| summarize fails=count() by name, resultCode
| top 10 by fails desc

Result:
- GET /api/listings/{id:int} → 500: 67
- GET /api/listings/ → 499: 1

3) Exceptions (KQL):
exceptions
| where timestamp >= ago(2h)
| summarize exc=count() by type, outerType, method, problemId
| top 10 by exc desc

Result:
- System.OutOfMemoryException (Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation): 68

4) Sample failing request (KQL):
requests
| where timestamp >= ago(2h) and success==false
| project timestamp, name, resultCode, duration, operation_Id, operation_ParentId, cloud_RoleName, url
| top 1 by timestamp desc

Sample Result (sensitive fields redacted if any):
- 2026-03-05T16:24:30Z GET /api/listings/{id:int} resultCode=500 duration≈1064ms opId=e6b2699d9a5bd41855a20bcdca2ee4ec role=[cae-y6uqzjyatoawm]/octopetsapi url=http://.../api/listings/2

Evidence (Azure Metrics – Microsoft.App/containerapps)
- ResponseTime (avg): sustained elevation 16:20–16:24Z ~786–859ms (alert threshold=700ms)
- Requests by statusCodeCategory: 5xx rose at 16:20Z (avg≈3), peaking 16:21–16:24Z (avg≈5–7.5); 2xx present concurrently; 0xx negligible
- CpuPercentage: increased 16:20–16:24Z (avg≈76→78%)
- MemoryPercentage: increased 16:20–16:24Z (avg≈39→58.5%)
- RestartCount: stable (no recent increases)

Suspected Root Cause (ranked)
1) High-probability: Memory pressure in AReallyExpensiveOperation causing System.OutOfMemoryException, returning 500s and increasing latency. Evidence: 68 OutOfMemoryExceptions tied to Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation; 67 failures on GET /api/listings/{id:int}; Memory% rising during the alert; no restarts.
2) Medium: Inefficient data processing in listing-by-id path leading to excessive allocations and GC pressure, extending response times. Evidence: elevated p95/p99 durations; CPU% elevated without restarts.

Proposed Fixes
Code
- Optimize AReallyExpensiveOperation: eliminate large in-memory aggregations; stream results; use paging; avoid ToList()/materializing whole collections; reuse buffers.
- Add cancellation & timeouts around the expensive call; return 429/503 on overload rather than 500 when upstream constraints hit.
- Guard against nulls and add defensive bounds on payload sizes; add metrics for allocation hotspots.
- Example (diff-style, C#):
--- a/Octopets.Backend/Endpoints/ListingEndpoints.cs
+++ b/Octopets.Backend/Endpoints/ListingEndpoints.cs
@@
- var data = await repository.GetAllAsync();
- var result = AReallyExpensiveOperation(data);
+ await foreach (var chunk in repository.StreamAsync(ct))
+ {
+     foreach (var item in chunk)
+     {
+         // process item and write to response incrementally
+     }
+ }
+ // add cancellation token and size limits to avoid OOM

IaC/Config (Container Apps)
- Increase memory limit modestly and set requests/limits explicitly to prevent OOM and throttle earlier:
  external/octopets/apphost/infra/main.bicep (or container module)
  containers:
    resources:
      cpu: 0.5  # example
      memory: 1.0Gi
- Add env for .NET GC hard limit to avoid process-wide OOM:
  name: DOTNET_GCHeapHardLimit, value: "768m" (tune to below container limit)
- Create alerts:
  - MemoryPercentage > 75% for 5m
  - Exceptions (OutOfMemoryException) count > 1/min for 5m
- Enable structured logging for allocations / request sizes around ListingEndpoints.

Next Steps
- Implement code changes, add unit/integration tests for bounded memory behavior.
- Adjust container resources and deploy to staging; run load tests for GET /api/listings/{id:int}.
- Roll out gradually; monitor ResponseTime, 5xx, exceptions.

References
- Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
- App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/microsoft.insights/components/octopets_appinsights-y6uqzjyatoawm

---
*This issue was created by sre-agent-demo--c3c0627e*
Tracked by the SRE agent [here](https://portal.azure.com/?feature.customPortal=false&feature.canmodifystamps=true&feature.fastmanifest=false&nocdn=force&websitesextension_loglevel=verbose&Microsoft_Azure_PaasServerless=beta&microsoft_azure_paasserverless_assettypeoptions=%7B%22SreAgentCustomMenu%22%3A%7B%22options%22%3A%22%22%7D%7D#view/Microsoft_Azure_PaasServerless/AgentFrameBlade.ReactView/id/%2Fsubscriptions%2F06dbbc7b-2363-4dd4-9803-95d07f1a8d3e%2FresourceGroups%2Frg-sre-agent-demo%2Fproviders%2FMicrosoft.App%2Fagents%2Fsre-agent-demo/sreLink/%2Fviews%2Factivities%2Fthreads%2F1f71ac0b-3976-4819-829e-9e8c13449fdc)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sev2: Octopets API high latency and 500s due to OutOfMemoryException in AReallyExpensiveOperation #79

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sev2: Octopets API high latency and 500s due to OutOfMemoryException in AReallyExpensiveOperation #79

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions