-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Incident context
- Incident: INC0010026 (ServiceNow sys_id: pending – API access blocked, will update)
- Severity: Sev3
- Alert: High Memory Usage - Octopets API (resolved)
- Affected resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
- Time window (UTC): 2026-03-05T14:40:33Z → 2026-03-05T16:40:33Z
- Alert fired: 2026-03-05T16:31:28Z, resolved: 2026-03-05T16:38:18Z
Impact summary
- Spike of HTTP 5xx centered around 16:20–16:36Z on octopetsapi.
- Endpoint most affected: GET /api/listings/{id:int} returning 500s; sample operation_Id d276a285f14c562542834892ec79fd76 at 2026-03-05T16:34:02Z.
- Working set memory averaged ~0.88–0.89 GB during the event with MemoryPercentage ~76–79%, nearing the 80% alert threshold.
- Response times elevated: 5xx ~1.05–1.37s; 2xx ~0.52–0.59s.
- No container restarts recorded.
Evidence
Application Insights KQL and summary
- Failed request counts (last 2h)
KQL:
requests
| where timestamp > ago(2h)
| where success == false
| summarize failed=count() by name, resultCode
| top 10 by failed desc
Summary:
- GET /api/listings/{id:int} → 500: 182
- GET / → 404: 4
- GET /api/listings/ → 499: 1
- Sample failing request
KQL:
requests
| where timestamp > ago(2h)
| where success == false
| project timestamp, name, resultCode, operation_Id, url
| top 1 by timestamp desc
Sample:
- 2026-03-05T16:34:02Z, GET /api/listings/{id:int}, 500, opId d276a285f14c562542834892ec79fd76, url http://octopetsapi.../api/listings/3
- Traces summary
KQL:
traces
| where timestamp > ago(2h)
| summarize errors=countif(severityLevel >= 3), warnings=countif(severityLevel == 2)
Summary: errors=0, warnings=7
Note: Exceptions aggregation query errored; please retrieve exception telemetry around the provided operation_Id for stack traces.
Azure Metrics (octopetsapi) – last 2h
- WorkingSetBytes: sustained ~670MB–891MB; peaks ~891MB between 16:20–16:40Z.
- MemoryPercentage: ~76–79% across 16:20–16:40Z (alert threshold set at ~80%).
- Requests by statusCodeCategory: sustained 5xx avg ~3–7/min from 16:20–16:35Z; 2xx also present; 4xx occasional.
- ResponseTime: 5xx ~1.05–1.37s, 2xx ~0.52–0.59s during event window.
- RestartCount: no increases observed.
Suspected root cause (ranked)
- Endpoint-specific defect in GET /api/listings/{id:int} causing 500s under load and elevated memory (likely unbounded in-memory object creation, inefficient serialization, or unhandled exception) – supported by high 5xx counts for this operation and near-threshold memory.
- Data access inefficiency (e.g., materializing large result sets or missing caching) increasing allocations and latency – supported by elevated response times and memory during requests.
- Logging/telemetry overhead or large payloads inflating memory per request – fewer warnings seen, but possible given memory pattern.
Proposed fixes
Code-level
- Add robust error handling and input validation for GET /api/listings/{id:int}; return 404/400 instead of 500 on not-found/parse errors.
- Implement response size controls: project to DTOs, avoid loading navigation graphs; enable pagination for listings.
- Introduce caching for individual listing lookups with short TTL to reduce repeated heavy queries.
- Add cancellation/timeout and defensive guards to external calls; ensure using pooled HttpClient.
- Add a unit/integration test that loads multiple GET /api/listings/{id} calls and asserts memory stays below target and no 500s.
IaC/config
- Review and, if appropriate, raise container memory limit or requests slightly (e.g., from 1.0Gi to 1.25Gi) to provide headroom while code is fixed.
- Enable autoscaling based on concurrent requests and memory percentage (KEDA) with conservative max replicas.
- Add Azure Monitor alerts for 5xx rate and p95 latency for octopetsapi; keep the existing memory alert.
- Turn on sampling for verbose traces if currently high; keep exception telemetry.
Example diff snippets (illustrative)
-
.NET controller (ListingsController.cs):
try { /* fetch */ } catch (Exception ex) { _logger.LogError(ex, "Failed"); return StatusCode(500, new { error = "Internal" }); }
// replace with specific catch + 404/400; ensure DTO projections. -
Bicep (external/octopets/apphost/infra/main.bicep):
// Container resources
// resources: memory=1.25Gi, cpu=1.0
// autoscale: memoryPercentage > 75 scale out to 2
Next steps
- Owners: API team to debug opId d276a285f14c562542834892ec79fd76 in App Insights, capture stack trace, and patch GET /api/listings/{id:int}.
- SRE to verify memory and 5xx trend post-patch and adjust memory/scale targets as needed.
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here