Incident INC0010026: High memory usage and 500s on GET /api/listings/{id:int} in Octopets API

Incident context
- Incident: INC0010026 (ServiceNow sys_id: pending – API access blocked, will update)
- Severity: Sev3
- Alert: High Memory Usage - Octopets API (resolved)
- Affected resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
- Time window (UTC): 2026-03-05T14:40:33Z → 2026-03-05T16:40:33Z
- Alert fired: 2026-03-05T16:31:28Z, resolved: 2026-03-05T16:38:18Z

Impact summary
- Spike of HTTP 5xx centered around 16:20–16:36Z on octopetsapi.
- Endpoint most affected: GET /api/listings/{id:int} returning 500s; sample operation_Id d276a285f14c562542834892ec79fd76 at 2026-03-05T16:34:02Z.
- Working set memory averaged ~0.88–0.89 GB during the event with MemoryPercentage ~76–79%, nearing the 80% alert threshold.
- Response times elevated: 5xx ~1.05–1.37s; 2xx ~0.52–0.59s.
- No container restarts recorded.

Evidence
Application Insights KQL and summary
1) Failed request counts (last 2h)
KQL:
requests
| where timestamp > ago(2h)
| where success == false
| summarize failed=count() by name, resultCode
| top 10 by failed desc
Summary:
- GET /api/listings/{id:int} → 500: 182
- GET / → 404: 4
- GET /api/listings/ → 499: 1

2) Sample failing request
KQL:
requests
| where timestamp > ago(2h)
| where success == false
| project timestamp, name, resultCode, operation_Id, url
| top 1 by timestamp desc
Sample:
- 2026-03-05T16:34:02Z, GET /api/listings/{id:int}, 500, opId d276a285f14c562542834892ec79fd76, url http://octopetsapi.../api/listings/3

3) Traces summary
KQL:
traces
| where timestamp > ago(2h)
| summarize errors=countif(severityLevel >= 3), warnings=countif(severityLevel == 2)
Summary: errors=0, warnings=7

Note: Exceptions aggregation query errored; please retrieve exception telemetry around the provided operation_Id for stack traces.

Azure Metrics (octopetsapi) – last 2h
- WorkingSetBytes: sustained ~670MB–891MB; peaks ~891MB between 16:20–16:40Z.
- MemoryPercentage: ~76–79% across 16:20–16:40Z (alert threshold set at ~80%).
- Requests by statusCodeCategory: sustained 5xx avg ~3–7/min from 16:20–16:35Z; 2xx also present; 4xx occasional.
- ResponseTime: 5xx ~1.05–1.37s, 2xx ~0.52–0.59s during event window.
- RestartCount: no increases observed.

Suspected root cause (ranked)
1) Endpoint-specific defect in GET /api/listings/{id:int} causing 500s under load and elevated memory (likely unbounded in-memory object creation, inefficient serialization, or unhandled exception) – supported by high 5xx counts for this operation and near-threshold memory.
2) Data access inefficiency (e.g., materializing large result sets or missing caching) increasing allocations and latency – supported by elevated response times and memory during requests.
3) Logging/telemetry overhead or large payloads inflating memory per request – fewer warnings seen, but possible given memory pattern.

Proposed fixes
Code-level
- Add robust error handling and input validation for GET /api/listings/{id:int}; return 404/400 instead of 500 on not-found/parse errors.
- Implement response size controls: project to DTOs, avoid loading navigation graphs; enable pagination for listings.
- Introduce caching for individual listing lookups with short TTL to reduce repeated heavy queries.
- Add cancellation/timeout and defensive guards to external calls; ensure using pooled HttpClient.
- Add a unit/integration test that loads multiple GET /api/listings/{id} calls and asserts memory stays below target and no 500s.

IaC/config
- Review and, if appropriate, raise container memory limit or requests slightly (e.g., from 1.0Gi to 1.25Gi) to provide headroom while code is fixed.
- Enable autoscaling based on concurrent requests and memory percentage (KEDA) with conservative max replicas.
- Add Azure Monitor alerts for 5xx rate and p95 latency for octopetsapi; keep the existing memory alert.
- Turn on sampling for verbose traces if currently high; keep exception telemetry.

Example diff snippets (illustrative)
- .NET controller (ListingsController.cs):
  try { /* fetch */ } catch (Exception ex) { _logger.LogError(ex, "Failed"); return StatusCode(500, new { error = "Internal" }); }
  // replace with specific catch + 404/400; ensure DTO projections.

- Bicep (external/octopets/apphost/infra/main.bicep):
  // Container resources
  // resources: memory=1.25Gi, cpu=1.0
  // autoscale: memoryPercentage > 75 scale out to 2

Next steps
- Owners: API team to debug opId d276a285f14c562542834892ec79fd76 in App Insights, capture stack trace, and patch GET /api/listings/{id:int}.
- SRE to verify memory and 5xx trend post-patch and adjust memory/scale targets as needed.

---
*This issue was created by sre-agent-demo--c3c0627e*
Tracked by the SRE agent [here](https://portal.azure.com/?feature.customPortal=false&feature.canmodifystamps=true&feature.fastmanifest=false&nocdn=force&websitesextension_loglevel=verbose&Microsoft_Azure_PaasServerless=beta&microsoft_azure_paasserverless_assettypeoptions=%7B%22SreAgentCustomMenu%22%3A%7B%22options%22%3A%22%22%7D%7D#view/Microsoft_Azure_PaasServerless/AgentFrameBlade.ReactView/id/%2Fsubscriptions%2F06dbbc7b-2363-4dd4-9803-95d07f1a8d3e%2FresourceGroups%2Frg-sre-agent-demo%2Fproviders%2FMicrosoft.App%2Fagents%2Fsre-agent-demo/sreLink/%2Fviews%2Factivities%2Fthreads%2F88d187ef-f1aa-4e92-9786-7e976dffbdd0)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident INC0010026: High memory usage and 500s on GET /api/listings/{id:int} in Octopets API #83

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Incident INC0010026: High memory usage and 500s on GET /api/listings/{id:int} in Octopets API #83

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions