Skip to content

Sev2: Octopets API high latency and 500s due to OutOfMemoryException in AReallyExpensiveOperation #79

@gderossilive

Description

@gderossilive

Incident Context

  • ServiceNow Incident: INC0010021 (sys_id: TBD)
  • Alert: High Response Time - Octopets API (Metric Alert)
  • Fired: 2026-03-05T16:22:52Z
  • Target: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi

Investigation Window (UTC)

  • Start: 2026-03-05T14:25:54Z
  • End: 2026-03-05T16:25:54Z

Evidence (Application Insights)

  1. Requests summary (KQL):
    requests
    | where timestamp >= ago(2h)
    | summarize total=count(), failed=countif(success==false), avgDurationMs=avg(duration), p95DurationMs=percentile(duration,95), p99DurationMs=percentile(duration,99)

Result:

  • total=92, failed=68, avgDurationMs≈968, p95≈1525ms, p99≈1847ms
  1. Top failing operations (KQL):
    requests
    | where timestamp >= ago(2h) and success==false
    | summarize fails=count() by name, resultCode
    | top 10 by fails desc

Result:

  • GET /api/listings/{id:int} → 500: 67
  • GET /api/listings/ → 499: 1
  1. Exceptions (KQL):
    exceptions
    | where timestamp >= ago(2h)
    | summarize exc=count() by type, outerType, method, problemId
    | top 10 by exc desc

Result:

  • System.OutOfMemoryException (Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation): 68
  1. Sample failing request (KQL):
    requests
    | where timestamp >= ago(2h) and success==false
    | project timestamp, name, resultCode, duration, operation_Id, operation_ParentId, cloud_RoleName, url
    | top 1 by timestamp desc

Sample Result (sensitive fields redacted if any):

  • 2026-03-05T16:24:30Z GET /api/listings/{id:int} resultCode=500 duration≈1064ms opId=e6b2699d9a5bd41855a20bcdca2ee4ec role=[cae-y6uqzjyatoawm]/octopetsapi url=http://.../api/listings/2

Evidence (Azure Metrics – Microsoft.App/containerapps)

  • ResponseTime (avg): sustained elevation 16:20–16:24Z ~786–859ms (alert threshold=700ms)
  • Requests by statusCodeCategory: 5xx rose at 16:20Z (avg≈3), peaking 16:21–16:24Z (avg≈5–7.5); 2xx present concurrently; 0xx negligible
  • CpuPercentage: increased 16:20–16:24Z (avg≈76→78%)
  • MemoryPercentage: increased 16:20–16:24Z (avg≈39→58.5%)
  • RestartCount: stable (no recent increases)

Suspected Root Cause (ranked)

  1. High-probability: Memory pressure in AReallyExpensiveOperation causing System.OutOfMemoryException, returning 500s and increasing latency. Evidence: 68 OutOfMemoryExceptions tied to Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation; 67 failures on GET /api/listings/{id:int}; Memory% rising during the alert; no restarts.
  2. Medium: Inefficient data processing in listing-by-id path leading to excessive allocations and GC pressure, extending response times. Evidence: elevated p95/p99 durations; CPU% elevated without restarts.

Proposed Fixes
Code

  • Optimize AReallyExpensiveOperation: eliminate large in-memory aggregations; stream results; use paging; avoid ToList()/materializing whole collections; reuse buffers.
  • Add cancellation & timeouts around the expensive call; return 429/503 on overload rather than 500 when upstream constraints hit.
  • Guard against nulls and add defensive bounds on payload sizes; add metrics for allocation hotspots.
  • Example (diff-style, C#):
    --- a/Octopets.Backend/Endpoints/ListingEndpoints.cs
    +++ b/Octopets.Backend/Endpoints/ListingEndpoints.cs
    @@
  • var data = await repository.GetAllAsync();
  • var result = AReallyExpensiveOperation(data);
  • await foreach (var chunk in repository.StreamAsync(ct))
  • {
  • foreach (var item in chunk)
    
  • {
    
  •     // process item and write to response incrementally
    
  • }
    
  • }
  • // add cancellation token and size limits to avoid OOM

IaC/Config (Container Apps)

  • Increase memory limit modestly and set requests/limits explicitly to prevent OOM and throttle earlier:
    external/octopets/apphost/infra/main.bicep (or container module)
    containers:
    resources:
    cpu: 0.5 # example
    memory: 1.0Gi
  • Add env for .NET GC hard limit to avoid process-wide OOM:
    name: DOTNET_GCHeapHardLimit, value: "768m" (tune to below container limit)
  • Create alerts:
    • MemoryPercentage > 75% for 5m
    • Exceptions (OutOfMemoryException) count > 1/min for 5m
  • Enable structured logging for allocations / request sizes around ListingEndpoints.

Next Steps

  • Implement code changes, add unit/integration tests for bounded memory behavior.
  • Adjust container resources and deploy to staging; run load tests for GET /api/listings/{id:int}.
  • Roll out gradually; monitor ResponseTime, 5xx, exceptions.

References

  • Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
  • App Insights: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/microsoft.insights/components/octopets_appinsights-y6uqzjyatoawm

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions