Skip to content

Octopets API: OutOfMemoryExceptions and high memory usage causing 500s on GET /api/listings/{id:int} (INC0010032) #87

@gderossilive

Description

@gderossilive

Incident: INC0010032 (sys_id: TBC)
Severity: 3
Service: Octopets API (Container Apps: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi)
Time window (UTC): 2026-03-05 14:49:36 → 16:49:36
Alert context: “High Memory Usage - Octopets API” fired at 16:41:18Z and resolved at 16:48:24Z

Impact summary:

  • 250 total requests in last 2h; 188 failures (75.2% failure rate); p95 duration ≈ 1175 ms
  • Top failing operation: GET /api/listings/{id:int} returning 500 (failures=182)

Application Insights evidence:

  1. KQL: requests summary (last 2h)
    requests
    | summarize totalRequests=count(), failedRequests=countif(success==false), failureRatePct=100.0*countif(success==false)/count(), p95DurationMs=percentile(duration, 95)
    Result: total=250, failed=188 (75.2%), p95≈1175 ms

  2. KQL: top failing operations
    requests
    | where timestamp >= ago(2h)
    | where success == false
    | summarize failures=count() by operationName=name, resultCode
    | top 10 by failures desc
    Result: GET /api/listings/{id:int} → 500:182; GET / → 404:5; GET /api/listings/ → 499:1

  3. KQL: exception types
    exceptions
    | where timestamp >= ago(2h)
    | summarize count() by type, outerType, operation_Name
    | top 10 by count_ desc
    Result: System.OutOfMemoryException count=182

  4. KQL: most recent exception sample
    exceptions
    | where timestamp >= ago(2h)
    | project timestamp, type, message, operation_Name, problemId, outerType
    | top 1 by timestamp desc
    Sample: 2026-03-05T16:34:03Z System.OutOfMemoryException at Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation:10 (message redacted)

  5. KQL: memory-related traces frequency
    traces
    | where timestamp >= ago(2h)
    | where message has_any ("OutOfMemory","memory","GC","OOM")
    | summarize count() by bin(timestamp, 5m)
    Result: spikes at 16:15 and 16:40 UTC

Platform metrics (Container Apps: Microsoft.App/containerapps):

  • MemoryPercentage: sustained ~78–79% from ~16:22–16:41Z, then dropped to ~6% by 16:44Z (aligns with alert fire/resolve)
  • WorkingSetBytes: ~880–891 MB during incident, dropped to ~100–111 MB after 16:44Z
  • CpuPercentage: low overall (peaks ~28%), not CPU-bound
  • ResponseTime: elevated during incident window (≈ 780–860 ms averages in minute bins)
  • RestartCount: no significant restarts observed

Representative failing request:
KQL:
requests
| where timestamp >= ago(2h)
| where name == "GET /api/listings/{id:int}" and success == false
| project timestamp, operation_Id, name, resultCode, duration
| top 1 by timestamp desc
Sample: 2026-03-05T16:34:02Z, opId=d276a285f14c562542834892ec79fd76, 500, duration≈1020 ms

Suspected root cause (ranked):

  1. High-probability: unbounded memory usage in ListingEndpoints.AReallyExpensiveOperation leading to System.OutOfMemoryException and 500s. Evidence: 182 OOM exceptions; memory ~880–891MB near limit; failures concentrated on GET /api/listings/{id:int}; alert fired exactly during sustained high memory; response times elevated.
  2. Medium: inefficient data/materialization for listing details (e.g., large object graphs or non-streamed payload) causing spikes under load. Evidence: endpoint-specific failures and long durations without CPU saturation.

Proposed fixes:
Code changes (C#):

  • Audit AReallyExpensiveOperation for large allocations; replace with streaming/iterators and bounded buffers; avoid ToList()/ToArray() on large sequences.
  • Add cancellation/timeout guards and chunked processing; ensure any caches have size limits and evictions.
  • Validate JSON serialization to use streaming (System.Text.Json with Utf8JsonWriter) for large payloads.
  • Add defensive try/catch to map OOM to 503 with retry-after rather than 500 (temporary mitigation once root cause fixed).

Example diff (pseudo):
// Octopets.Backend/Endpoints/ListingEndpoints.cs

  • var data = await repo.GetAllListingsAsync();
  • var enriched = data.Select(x => ExpensiveExpand(x)).ToList();
  • await foreach (var item in repo.StreamListingsAsync(ct))
  • {
  • using var pooled = ArrayPool<byte>.Shared.Rent(minSize);
    
  • var enriched = LightweightExpand(item, pooled.Array);
    
  • await writer.WriteAsync(enriched, ct);
    
  • }

IaC/config changes (Bicep/path: external/octopets/apphost/infra/main.bicep):

  • Ensure memory requests/limits set explicitly; consider raising memory limit slightly (e.g., +10–20%) only after code fix validated.
  • Add Azure Monitor alerts for 5xx rate and OOM exceptions.
  • Ensure Container Apps revisions have min/max replicas aligned with workload; consider HPA based on memory if appropriate.

Next steps:

  • Implement code optimizations above, add unit/integration tests to validate memory footprint.
  • Add load test to reproduce and confirm memory behavior post-fix.
  • After merge/deploy, monitor MemoryPercentage, 5xx, and exceptions for 24h.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions