-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Incident context
- Source: ServiceNow INC0010027 (sys_id: TBD – please update once available)
- Alert: High Response Time - Octopets API (metric alert, Sev2)
- Fired: 2026-03-05T16:22:52Z, Resolved: 2026-03-05T16:40:48Z
- Target resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi
Investigation window (UTC)
- Start: 2026-03-05 14:43:50Z
- End: 2026-03-05 16:43:50Z
Evidence – Application Insights (no recent telemetry)
- Observation: No requests/exceptions returned for the last 2 hours. This suggests missing/disabled AI telemetry for the API or routing to a different AI instance.
- KQL executed:
- Summary
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end)
| summarize total=count(), failed=countif(success == false), avgDurationMs=avg(duration), p95DurationMs=percentile(duration, 95), p99DurationMs=percentile(duration, 99)
- Top failing operations
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end) and success == false
| summarize failures=count(), avgDurationMs=avg(duration), p95DurationMs=percentile(duration,95) by operation_Name, resultCode
| top 10 by failures desc
- Exceptions
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
exceptions
| where timestamp between (start .. end)
| summarize total=count() by type
| top 10 by total desc
- One sample failing request
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end) and success == false
| project timestamp, operation_Name, resultCode, duration, operation_Id, itemId
| top 1 by timestamp desc
Evidence – Azure Metrics (Container Apps: octopetsapi)
- ResponseTime (avg): spike ~786–859 ms from 16:20 to ~16:35, then normalized.
- Requests by status code category:
- 2xx: low but present during spike window; 5xx: small non-zero (peaked ~7.5 avg at 16:21) during spike; 4xx: occasional 0–1.
- CpuPercentage: sustained ~74–79% during spike window.
- MemoryPercentage: near 0% reported (likely an artifact; working set limits not approached).
- Resiliency metrics:
- ResiliencyRequestsPendingConnectionPool: increased sharply at 16:20 and sustained ~74–79 during spike window.
- ResiliencyRequestTimeouts/ResiliencyConnectTimeouts: brief non-zero during 16:20–16:25.
- Replicas: ~1 during spike; returned to 0/1 in late window.
- RestartCount: no notable increases.
Suspected root cause (ranked)
- Connection pool contention at the API’s outbound HTTP/DB layer causing queueing and longer response times. Evidence: ResiliencyRequestsPendingConnectionPool sustained ~74–79 during the same interval as ResponseTime spike; minor request/connection timeouts observed; CPU elevated to ~79% but not fully saturated.
- Transient downstream dependency latency (e.g., DB or external service) leading to slower responses and retries. Evidence: request/connection timeouts briefly non-zero; 5xx low but present during spike.
- Missing APM instrumentation/ingestion preventing richer diagnosis; lack of App Insights telemetry hindered endpoint/exception attribution.
Proposed fixes
Code-level
- If .NET: use a single shared HttpClient with SocketsHttpHandler and set:
- MaxConnectionsPerServer: 256 (start with 128 if low traffic; tune via load tests)
- PooledConnectionIdleTimeout: 2m, PooledConnectionLifetime: 5–10m
- Request timeout: 10–15s and per-try timeout 5s with Polly retry (2–3 attempts, jittered backoff) on transient errors only.
- If Java (Spring): tune HttpClient/RestTemplate or WebClient connection pool, and HikariCP for DB:
- maxPoolSize: 30–50; connectionTimeout: 5s; minimumIdle: 5–10; validation timeout short.
- Add explicit timeouts for all outbound calls and short-circuit with a circuit breaker around slow dependencies.
IaC/config
- Container Apps:
- Scale-out policy: consider adding KEDA-based scaling on RequestsPendingConnectionPool and/or ResponseTime to add replicas when queueing grows.
- Ensure resource limits/requests match workload; CPU near 80% during spike suggests bumping CPU request/limit or enabling horizontal scale earlier.
- Observability:
- Wire Application Insights (connection string/instrumentation key) for octopetsapi so requests/exceptions are captured.
- Add Azure Monitor alerts for ResiliencyRequestsPendingConnectionPool and ResiliencyRequestTimeouts in addition to ResponseTime.
Action items
- Instrument octopetsapi with App Insights and verify telemetry flow.
- Implement connection pooling and timeout improvements as above.
- Add/adjust autoscale or KEDA triggers for queueing/latency.
- Create load-test to validate p95 under spike.
Notes on IaC repo
- Repo detected: gderossilive/AzSreAgentLab. Infra appears to deploy via Bicep (see scripts/30-deploy-octopets.sh and external/octopets/apphost/infra/main.bicep). Consider adding app settings for AI connection string and autoscaling policies here.
Please triage by adding telemetry first, then pool/timeouts tuning. Update this issue with the ServiceNow sys_id and link once available.
This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here
Reactions are currently unavailable