Skip to content

Sev3: High Response Time & 500s on Octopets API (INC0010029) #84

@gderossilive

Description

@gderossilive

Incident context

  • Incident: INC0010029 (sys_id: pending; blocked by ServiceNow auth)
  • Alert: High Response Time - Octopets API (metric alert fired at 2026-03-05T16:41:50Z)
  • Resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
  • Investigation window (UTC): 2026-03-05T14:45:16Z to 2026-03-05T16:45:16Z

Evidence (Application Insights)
KQL 1 (Requests summary):

let startTime=ago(2h);
let endTime=now();
requests
| where timestamp between (startTime .. endTime)
| summarize total=count(), failed=countif(success==false), avgDuration=avg(duration), p95Duration=percentile(duration,95)
| project total, failed, avgDuration, p95Duration

Results: total=250, failed=188 (~75%), avgDuration≈916.9 ms, p95≈1175 ms.

KQL 2 (Top failing operations):

let startTime=ago(2h);
let endTime=now();
requests
| where timestamp between (startTime .. endTime)
| summarize total=count(), failures=countif(success==false), avgMs=avg(duration) by name, resultCode
| order by failures desc, total desc
| take 10

Top:

  • GET /api/listings/{id:int}, 500 → total 182, failures 182, avg≈1073 ms
  • GET /, 404 → total 5, failures 5 (minor)
  • GET /api/listings/ (499) → total 1, failures 1
  • GET /api/listings/ (200) → total 62, failures 0, avg≈528 ms

KQL 3 (Sample failing request):

let startTime=ago(2h);
let endTime=now();
requests
| where timestamp between (startTime .. endTime)
| where success == false
| project timestamp, name, resultCode, duration, operation_Id, operation_Name
| order by timestamp desc
| take 1

Sample: 2026-03-05T16:43:07Z, GET /, 404, duration 93.9 ms, operation_Id e7d3c0ec6789eac1a5ba3f178712c9d5 (non-sensitive fields only).

KQL 4 (Exceptions summary):

let startTime=ago(2h);
let endTime=now();
exceptions
| where timestamp between (startTime .. endTime)
| summarize count() by type, outerMessage
| order by count_ desc
| take 5

Top exception: System.OutOfMemoryException (count 182), message "Exception of type 'System.OutOfMemoryException' was thrown."

KQL 5 (Sample exception):

let startTime=ago(2h);
let endTime=now();
exceptions
| where timestamp between (startTime .. endTime)
| project timestamp, type, problemId, outerMessage, operation_Id
| order by timestamp desc
| take 1

Sample: 2026-03-05T16:34:03Z, type System.OutOfMemoryException, problemId "System.OutOfMemoryException at Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation:10", operation_Id d276a285f14c562542834892ec79fd76.

KQL 6 (Traces):

let startTime=ago(2h);
let endTime=now();
traces
| where timestamp between (startTime .. endTime)
| where severityLevel >= 3 or tostring(message) contains_cs "error"
| summarize count() by severityLevel
| order by severityLevel desc

No error traces returned.

Azure Metrics (Container Apps: Microsoft.App/containerapps)

  • ResponseTime (Average): sustained elevation ~780–860 ms between 16:20–16:35; 16:35 shows 801.5 ms matching alert window.
  • Requests by statusCategory: 5xx increased concurrently (3–7.5/min); 2xx low; 4xx sporadic.
  • CpuPercentage: ~74–79% during the same window.
  • MemoryPercentage: reported 0 average (likely not instrumented/unsupported for .NET container; correlate with OOM exceptions instead).
  • RestartCount: 0; no restarts observed.

Suspected root causes (ranked)

  1. High memory consumption in AReallyExpensiveOperation causing System.OutOfMemoryException → leads to 500s on GET /api/listings/{id:int} and elevated response times.
    Evidence: 182 OOM exceptions; 182 500s for that endpoint; p95 ~1175 ms; CPU elevated but no restarts.
  2. Inefficient payload/query processing in listings endpoint (materializing large collections, lack of streaming/pagination) exacerbating memory pressure and latency.
    Evidence: High avg durations at that endpoint (>1s), contrast with successful GET /api/listings/ avg ~528 ms.
  3. Minor routing/404 noise (GET /) not contributory to Sev2 signal.

Proposed fixes
Code (Octopets.Backend.Endpoints.ListingEndpoints):

  • Refactor AReallyExpensiveOperation to stream results (IAsyncEnumerable), avoid ToList()/full materialization, and introduce pagination (limit=50 default).
  • Implement defensive limits (max result size) and use cancellation tokens; add timeouts on downstream calls; prefer memory-efficient serializers.
  • Catch OOM risk via size checks and return 429/503 with retry-after vs 500 when approaching resource limits.

Example patch (pseudo-diff):

--- ListingEndpoints.cs
+++ ListingEndpoints.cs
@@
-public async Task<IResult> AReallyExpensiveOperation(int id)
+public async Task<IResult> AReallyEfficientOperation(int id, int limit = 50, CancellationToken ct = default)
 {
-    var items = await repo.GetListingsForId(id); // returns full list
-    var payload = JsonSerializer.Serialize(items); // large materialization
-    return Results.Ok(payload);
+    await foreach (var item in repo.StreamListingsForId(id).WithCancellation(ct).Take(limit))
+    {
+        // write to response stream incrementally
+    }
+    return Results.Ok();
 }

IaC/config (Container Apps bicep/yaml)

  • Increase memory limit for octopetsapi and set resource constraints (e.g., cpu=1, memory=2Gi) and enable autoscaling on CPU and concurrent requests.
  • Add Azure Monitor alerts for OOM exceptions and 5xx rate; keep existing response time alert.
  • Configure health probes and circuit breaker/retry limits to avoid compounded retries under memory pressure.

Next steps

  • Implement code refactor; add unit/perf tests for large payloads.
  • Update IaC to set memory/resource limits and scaling; deploy to staging; run load test; observe App Insights.

Please assign appropriate owners and track remediation.


This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions