Skip to content

Sev3: High response time spike on Octopets API (2026-03-05 16:20–16:35 UTC) #85

@gderossilive

Description

@gderossilive

Incident context

  • Source: ServiceNow INC0010027 (sys_id: TBD – please update once available)
  • Alert: High Response Time - Octopets API (metric alert, Sev2)
  • Fired: 2026-03-05T16:22:52Z, Resolved: 2026-03-05T16:40:48Z
  • Target resource: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi

Investigation window (UTC)

  • Start: 2026-03-05 14:43:50Z
  • End: 2026-03-05 16:43:50Z

Evidence – Application Insights (no recent telemetry)

  • Observation: No requests/exceptions returned for the last 2 hours. This suggests missing/disabled AI telemetry for the API or routing to a different AI instance.
  • KQL executed:
  1. Summary
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end)
| summarize total=count(), failed=countif(success == false), avgDurationMs=avg(duration), p95DurationMs=percentile(duration, 95), p99DurationMs=percentile(duration, 99)
  1. Top failing operations
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end) and success == false
| summarize failures=count(), avgDurationMs=avg(duration), p95DurationMs=percentile(duration,95) by operation_Name, resultCode
| top 10 by failures desc
  1. Exceptions
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
exceptions
| where timestamp between (start .. end)
| summarize total=count() by type
| top 10 by total desc
  1. One sample failing request
let start=datetime(2026-03-05 14:43:50Z); let end=datetime(2026-03-05 16:43:50Z);
requests
| where timestamp between (start .. end) and success == false
| project timestamp, operation_Name, resultCode, duration, operation_Id, itemId
| top 1 by timestamp desc

Evidence – Azure Metrics (Container Apps: octopetsapi)

  • ResponseTime (avg): spike ~786–859 ms from 16:20 to ~16:35, then normalized.
  • Requests by status code category:
    • 2xx: low but present during spike window; 5xx: small non-zero (peaked ~7.5 avg at 16:21) during spike; 4xx: occasional 0–1.
  • CpuPercentage: sustained ~74–79% during spike window.
  • MemoryPercentage: near 0% reported (likely an artifact; working set limits not approached).
  • Resiliency metrics:
    • ResiliencyRequestsPendingConnectionPool: increased sharply at 16:20 and sustained ~74–79 during spike window.
    • ResiliencyRequestTimeouts/ResiliencyConnectTimeouts: brief non-zero during 16:20–16:25.
  • Replicas: ~1 during spike; returned to 0/1 in late window.
  • RestartCount: no notable increases.

Suspected root cause (ranked)

  1. Connection pool contention at the API’s outbound HTTP/DB layer causing queueing and longer response times. Evidence: ResiliencyRequestsPendingConnectionPool sustained ~74–79 during the same interval as ResponseTime spike; minor request/connection timeouts observed; CPU elevated to ~79% but not fully saturated.
  2. Transient downstream dependency latency (e.g., DB or external service) leading to slower responses and retries. Evidence: request/connection timeouts briefly non-zero; 5xx low but present during spike.
  3. Missing APM instrumentation/ingestion preventing richer diagnosis; lack of App Insights telemetry hindered endpoint/exception attribution.

Proposed fixes
Code-level

  • If .NET: use a single shared HttpClient with SocketsHttpHandler and set:
    • MaxConnectionsPerServer: 256 (start with 128 if low traffic; tune via load tests)
    • PooledConnectionIdleTimeout: 2m, PooledConnectionLifetime: 5–10m
    • Request timeout: 10–15s and per-try timeout 5s with Polly retry (2–3 attempts, jittered backoff) on transient errors only.
  • If Java (Spring): tune HttpClient/RestTemplate or WebClient connection pool, and HikariCP for DB:
    • maxPoolSize: 30–50; connectionTimeout: 5s; minimumIdle: 5–10; validation timeout short.
  • Add explicit timeouts for all outbound calls and short-circuit with a circuit breaker around slow dependencies.

IaC/config

  • Container Apps:
    • Scale-out policy: consider adding KEDA-based scaling on RequestsPendingConnectionPool and/or ResponseTime to add replicas when queueing grows.
    • Ensure resource limits/requests match workload; CPU near 80% during spike suggests bumping CPU request/limit or enabling horizontal scale earlier.
  • Observability:
    • Wire Application Insights (connection string/instrumentation key) for octopetsapi so requests/exceptions are captured.
    • Add Azure Monitor alerts for ResiliencyRequestsPendingConnectionPool and ResiliencyRequestTimeouts in addition to ResponseTime.

Action items

  • Instrument octopetsapi with App Insights and verify telemetry flow.
  • Implement connection pooling and timeout improvements as above.
  • Add/adjust autoscale or KEDA triggers for queueing/latency.
  • Create load-test to validate p95 under spike.

Notes on IaC repo

  • Repo detected: gderossilive/AzSreAgentLab. Infra appears to deploy via Bicep (see scripts/30-deploy-octopets.sh and external/octopets/apphost/infra/main.bicep). Consider adding app settings for AI connection string and autoscaling policies here.

Please triage by adding telemetry first, then pool/timeouts tuning. Update this issue with the ServiceNow sys_id and link once available.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions