Sev3: High response time spike on Octopets API (2026-03-05 16:20–16:35 UTC)

Incident context
- Source: ServiceNow INC0010027 (sys_id: TBD – please update once available)
- Alert: High Response Time - Octopets API (metric alert, Sev2)
- Fired: 2026-03-05T16:22:52Z, Resolved: 2026-03-05T16:40:48Z
- Target resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi

Investigation window (UTC)
- Start: 2026-03-05 14:43:50Z
- End:   2026-03-05 16:43:50Z

Evidence – Application Insights (no recent telemetry)
- Observation: No requests/exceptions returned for the last 2 hours. This suggests missing/disabled AI telemetry for the API or routing to a different AI instance.
- KQL executed:
1) Summary
```
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end)
| summarize total=count(), failed=countif(success == false), avgDurationMs=avg(duration), p95DurationMs=percentile(duration, 95), p99DurationMs=percentile(duration, 99)
```
2) Top failing operations
```
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end) and success == false
| summarize failures=count(), avgDurationMs=avg(duration), p95DurationMs=percentile(duration,95) by operation_Name, resultCode
| top 10 by failures desc
```
3) Exceptions
```
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
exceptions
| where timestamp between (start .. end)
| summarize total=count() by type
| top 10 by total desc
```
4) One sample failing request
```
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end) and success == false
| project timestamp, operation_Name, resultCode, duration, operation_Id, itemId
| top 1 by timestamp desc
```

Evidence – Azure Metrics (Container Apps: octopetsapi)
- ResponseTime (avg): spike ~786–859 ms from 16:20 to ~16:35, then normalized.
- Requests by status code category:
  - 2xx: low but present during spike window; 5xx: small non-zero (peaked ~7.5 avg at 16:21) during spike; 4xx: occasional 0–1.
- CpuPercentage: sustained ~74–79% during spike window.
- MemoryPercentage: near 0% reported (likely an artifact; working set limits not approached).
- Resiliency metrics:
  - ResiliencyRequestsPendingConnectionPool: increased sharply at 16:20 and sustained ~74–79 during spike window.
  - ResiliencyRequestTimeouts/ResiliencyConnectTimeouts: brief non-zero during 16:20–16:25.
- Replicas: ~1 during spike; returned to 0/1 in late window.
- RestartCount: no notable increases.

Suspected root cause (ranked)
1) Connection pool contention at the API’s outbound HTTP/DB layer causing queueing and longer response times. Evidence: ResiliencyRequestsPendingConnectionPool sustained ~74–79 during the same interval as ResponseTime spike; minor request/connection timeouts observed; CPU elevated to ~79% but not fully saturated.
2) Transient downstream dependency latency (e.g., DB or external service) leading to slower responses and retries. Evidence: request/connection timeouts briefly non-zero; 5xx low but present during spike.
3) Missing APM instrumentation/ingestion preventing richer diagnosis; lack of App Insights telemetry hindered endpoint/exception attribution.

Proposed fixes
Code-level
- If .NET: use a single shared HttpClient with SocketsHttpHandler and set:
  - MaxConnectionsPerServer: 256 (start with 128 if low traffic; tune via load tests)
  - PooledConnectionIdleTimeout: 2m, PooledConnectionLifetime: 5–10m
  - Request timeout: 10–15s and per-try timeout 5s with Polly retry (2–3 attempts, jittered backoff) on transient errors only.
- If Java (Spring): tune HttpClient/RestTemplate or WebClient connection pool, and HikariCP for DB:
  - maxPoolSize: 30–50; connectionTimeout: 5s; minimumIdle: 5–10; validation timeout short.
- Add explicit timeouts for all outbound calls and short-circuit with a circuit breaker around slow dependencies.

IaC/config
- Container Apps:
  - Scale-out policy: consider adding KEDA-based scaling on RequestsPendingConnectionPool and/or ResponseTime to add replicas when queueing grows.
  - Ensure resource limits/requests match workload; CPU near 80% during spike suggests bumping CPU request/limit or enabling horizontal scale earlier.
- Observability:
  - Wire Application Insights (connection string/instrumentation key) for octopetsapi so requests/exceptions are captured.
  - Add Azure Monitor alerts for ResiliencyRequestsPendingConnectionPool and ResiliencyRequestTimeouts in addition to ResponseTime.

Action items
- Instrument octopetsapi with App Insights and verify telemetry flow.
- Implement connection pooling and timeout improvements as above.
- Add/adjust autoscale or KEDA triggers for queueing/latency.
- Create load-test to validate p95 under spike.

Notes on IaC repo
- Repo detected: gderossilive/AzSreAgentLab. Infra appears to deploy via Bicep (see scripts/30-deploy-octopets.sh and external/octopets/apphost/infra/main.bicep). Consider adding app settings for AI connection string and autoscaling policies here.

Please triage by adding telemetry first, then pool/timeouts tuning. Update this issue with the ServiceNow sys_id and link once available.
---
*This issue was created by sre-agent-demo--c3c0627e*
Tracked by the SRE agent [here](https://portal.azure.com/?feature.customPortal=false&feature.canmodifystamps=true&feature.fastmanifest=false&nocdn=force&websitesextension_loglevel=verbose&Microsoft_Azure_PaasServerless=beta&microsoft_azure_paasserverless_assettypeoptions=%7B%22SreAgentCustomMenu%22%3A%7B%22options%22%3A%22%22%7D%7D#view/Microsoft_Azure_PaasServerless/AgentFrameBlade.ReactView/id/%2Fsubscriptions%2F06dbbc7b-2363-4dd4-9803-95d07f1a8d3e%2FresourceGroups%2Frg-sre-agent-demo%2Fproviders%2FMicrosoft.App%2Fagents%2Fsre-agent-demo/sreLink/%2Fviews%2Factivities%2Fthreads%2F2299d4e4-6e02-4b5d-b0ce-42f24888beac)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sev3: High response time spike on Octopets API (2026-03-05 16:20–16:35 UTC) #85

Please triage by adding telemetry first, then pool/timeouts tuning. Update this issue with the ServiceNow sys_id and link once available.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sev3: High response time spike on Octopets API (2026-03-05 16:20–16:35 UTC) #85

Description

Please triage by adding telemetry first, then pool/timeouts tuning. Update this issue with the ServiceNow sys_id and link once available.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions