Skip to content

Incident INC0010026: High memory usage and 500s on GET /api/listings/{id:int} in Octopets API #83

@gderossilive

Description

@gderossilive

Incident context

  • Incident: INC0010026 (ServiceNow sys_id: pending – API access blocked, will update)
  • Severity: Sev3
  • Alert: High Memory Usage - Octopets API (resolved)
  • Affected resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
  • Time window (UTC): 2026-03-05T14:40:33Z → 2026-03-05T16:40:33Z
  • Alert fired: 2026-03-05T16:31:28Z, resolved: 2026-03-05T16:38:18Z

Impact summary

  • Spike of HTTP 5xx centered around 16:20–16:36Z on octopetsapi.
  • Endpoint most affected: GET /api/listings/{id:int} returning 500s; sample operation_Id d276a285f14c562542834892ec79fd76 at 2026-03-05T16:34:02Z.
  • Working set memory averaged ~0.88–0.89 GB during the event with MemoryPercentage ~76–79%, nearing the 80% alert threshold.
  • Response times elevated: 5xx ~1.05–1.37s; 2xx ~0.52–0.59s.
  • No container restarts recorded.

Evidence
Application Insights KQL and summary

  1. Failed request counts (last 2h)
    KQL:
    requests
    | where timestamp > ago(2h)
    | where success == false
    | summarize failed=count() by name, resultCode
    | top 10 by failed desc
    Summary:
  • GET /api/listings/{id:int} → 500: 182
  • GET / → 404: 4
  • GET /api/listings/ → 499: 1
  1. Sample failing request
    KQL:
    requests
    | where timestamp > ago(2h)
    | where success == false
    | project timestamp, name, resultCode, operation_Id, url
    | top 1 by timestamp desc
    Sample:
  1. Traces summary
    KQL:
    traces
    | where timestamp > ago(2h)
    | summarize errors=countif(severityLevel >= 3), warnings=countif(severityLevel == 2)
    Summary: errors=0, warnings=7

Note: Exceptions aggregation query errored; please retrieve exception telemetry around the provided operation_Id for stack traces.

Azure Metrics (octopetsapi) – last 2h

  • WorkingSetBytes: sustained ~670MB–891MB; peaks ~891MB between 16:20–16:40Z.
  • MemoryPercentage: ~76–79% across 16:20–16:40Z (alert threshold set at ~80%).
  • Requests by statusCodeCategory: sustained 5xx avg ~3–7/min from 16:20–16:35Z; 2xx also present; 4xx occasional.
  • ResponseTime: 5xx ~1.05–1.37s, 2xx ~0.52–0.59s during event window.
  • RestartCount: no increases observed.

Suspected root cause (ranked)

  1. Endpoint-specific defect in GET /api/listings/{id:int} causing 500s under load and elevated memory (likely unbounded in-memory object creation, inefficient serialization, or unhandled exception) – supported by high 5xx counts for this operation and near-threshold memory.
  2. Data access inefficiency (e.g., materializing large result sets or missing caching) increasing allocations and latency – supported by elevated response times and memory during requests.
  3. Logging/telemetry overhead or large payloads inflating memory per request – fewer warnings seen, but possible given memory pattern.

Proposed fixes
Code-level

  • Add robust error handling and input validation for GET /api/listings/{id:int}; return 404/400 instead of 500 on not-found/parse errors.
  • Implement response size controls: project to DTOs, avoid loading navigation graphs; enable pagination for listings.
  • Introduce caching for individual listing lookups with short TTL to reduce repeated heavy queries.
  • Add cancellation/timeout and defensive guards to external calls; ensure using pooled HttpClient.
  • Add a unit/integration test that loads multiple GET /api/listings/{id} calls and asserts memory stays below target and no 500s.

IaC/config

  • Review and, if appropriate, raise container memory limit or requests slightly (e.g., from 1.0Gi to 1.25Gi) to provide headroom while code is fixed.
  • Enable autoscaling based on concurrent requests and memory percentage (KEDA) with conservative max replicas.
  • Add Azure Monitor alerts for 5xx rate and p95 latency for octopetsapi; keep the existing memory alert.
  • Turn on sampling for verbose traces if currently high; keep exception telemetry.

Example diff snippets (illustrative)

  • .NET controller (ListingsController.cs):
    try { /* fetch */ } catch (Exception ex) { _logger.LogError(ex, "Failed"); return StatusCode(500, new { error = "Internal" }); }
    // replace with specific catch + 404/400; ensure DTO projections.

  • Bicep (external/octopets/apphost/infra/main.bicep):
    // Container resources
    // resources: memory=1.25Gi, cpu=1.0
    // autoscale: memoryPercentage > 75 scale out to 2

Next steps

  • Owners: API team to debug opId d276a285f14c562542834892ec79fd76 in App Insights, capture stack trace, and patch GET /api/listings/{id:int}.
  • SRE to verify memory and 5xx trend post-patch and adjust memory/scale targets as needed.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions