Skip to content

[WIP] Fix 5xx response spike in Octopets API#35

Closed
Copilot wants to merge 1 commit intomainfrom
copilot/investigate-octopets-api-5xx
Closed

[WIP] Fix 5xx response spike in Octopets API#35
Copilot wants to merge 1 commit intomainfrom
copilot/investigate-octopets-api-5xx

Conversation

Copy link

Copilot AI commented Jan 28, 2026

Fix Plan for Octopets API 5xx Spike (INC0010065)

Investigation Summary

  • Explored repository structure and external/octopets code
  • Analyzed current telemetry configuration (OpenTelemetry + App Insights SDK)
  • Reviewed backend endpoints and error handling
  • Examined deployment scripts and infrastructure setup
  • Identified ServiceDefaults with built-in resilience patterns

Code Changes Needed

  • Add global exception handler to catch and log unhandled exceptions with correlation IDs
  • Enhance error handling in endpoints with structured logging
  • Add explicit exception handling for repository operations
  • Verify and fix Application Insights connection string configuration
  • Add CPU stress endpoint implementation (currently referenced but not implemented)
  • Configure proper sampling (OTEL_TRACES_SAMPLER already set to always_on in deployment script)

IaC/Configuration Changes

  • Create new Bicep template for enhanced metric alerts (5xx rate, response time, resiliency metrics)
  • Add script to verify Application Insights configuration
  • Document resilience configuration already present in ServiceDefaults
  • Add deployment validation script

Testing & Validation

  • Create load test scenario to reproduce 5xx spike
  • Validate telemetry flow to Application Insights
  • Test exception logging and correlation
  • Verify resilience patterns (circuit breaker, retry, timeout)

Security

  • Run CodeQL checker before completion
  • Review for any security vulnerabilities in error handling
Original prompt

This section details on the original issue you should resolve

<issue_title>Sev1: Octopets API 5xx spike around 11:09–11:14 UTC (INC0010065)</issue_title>
<issue_description>Incident context

  • Incident: INC0010065 (sys_id: PENDING – need from ServiceNow)
  • Alert: Azure Monitor metric alert "HTTP 5xx - Octopets API" fired at 2026-01-28T11:12:53Z
  • Impact: Elevated 5xx responses and high response times on octopetsapi

Investigation window (UTC)

  • Start: 2026-01-28T09:23:52Z
  • End: 2026-01-28T11:23:52Z

Application Insights evidence (resource: octopets_appinsights-y6uqzjyatoawm)

  • KQL executed (ZERO RESULTS across these; likely missing/disabled SDK telemetry):
    1. requests | where timestamp >= ago(2h) | summarize totalRequests=count(), failedRequests=countif(success==false), successRate=100.0*countif(success==true)/count() by bin(timestamp, 5m) | order by timestamp asc
    2. requests | where timestamp >= ago(2h) and success == false | summarize failures=count() by name, resultCode | top 15 by failures desc
    3. exceptions | where timestamp >= ago(2h) | summarize count() by type, outerMessage | top 15 by count_ desc
    4. traces | where timestamp >= ago(2h) and severityLevel >= 3 | summarize count() by message | top 20 by count_ desc
    5. requests | where timestamp >= ago(2h) and success == false | project timestamp, name, url, resultCode, operation_Id, cloud_RoleName | order by timestamp desc | take 1
  • Actionable note: App Insights appears not ingesting backend telemetry; please confirm SDK/connection string is configured and emitting.

Azure Metrics (octopetsapi: /subscriptions/06dbbc7b-2363-4dd4-9803-95d07f1a8d3e/resourceGroups/rg-octopets-demo-lab/providers/Microsoft.App/containerApps/octopetsapi)

  • Requests (split by statusCodeCategory):
    • 5xx: spike begins ~11:09, peaks ~11:13, then drops to 0 by ~11:15
    • 2xx: low volume during spike; minor activity at 11:06–11:14
  • ResponseTime (ms, by statusCodeCategory):
    • 5xx: ~1100–1400ms during 11:09–11:14
    • 2xx: varied; one high value at 11:09 (~167ms) then normal single-digit values
  • CpuPercentage: baseline ~57%, increases to ~79% from 11:11 onward, sustained
  • MemoryPercentage: 0 throughout (likely metric not populated or misreported)
  • RestartCount: 0 (no replica restarts observed)

Suspected root cause(s)

  1. Backend dependency errors/timeouts causing 5xx responses (highest probability). Evidence: sharp 5xx spike with elevated response times for 5xx; no restarts; CPU increased but not pegged; lack of App Insights exceptions suggests missing telemetry rather than absence of errors.
  2. Connection pool saturation or upstream ejection under load (medium). Evidence: elevated latency concurrent with 5xx; available resiliency metrics (PendingConnectionPool, RequestTimeouts) should be checked/alerted; not currently queried due to time.
  3. Misconfigured telemetry/monitoring (lower, but impactful). Evidence: App Insights queries returned zero; SDK/config may be disabled, hindering diagnosis.

Proposed fixes (code + IaC/config)
Code:

  • Add structured exception handling and map dependency failures to appropriate 5xx/4xx with detailed logs (correlation IDs). Implement timeout/retry with jitter and circuit breaker (e.g., Polly for .NET or resilience patterns in chosen stack).
  • Review connection pool sizing and request concurrency; set sane timeouts (connect/read) to avoid long hangs cascading into 5xx.
    IaC/config:
  • Ensure Application Insights SDK is enabled with correct connection string in octopetsapi and sampling not set to 100% drop; emit requests, exceptions, traces.
  • Enable/monitor resiliency metrics (ResiliencyRequestTimeouts, ResiliencyRequestsPendingConnectionPool) with alerts; add a latency alert using ResponseTime for 5xx.
  • Consider health probe tuning (readiness/liveness) so traffic is shed from unhealthy replicas quickly.

Concrete next steps

  • Verify App Insights instrumentation in backend; add basic exception + dependency telemetry.
  • Add guards around external calls; apply timeout (e.g., 3–5s), max retries (<=2), and circuit breaker.
  • Create/adjust alerts for resiliency metrics and 5xx rate; add a dashboard for Requests split, ResponseTime, CPU.
  • Run load test to reproduce spike and validate fixes.

Repository context

  • Connected repo: https://github.com/gderossilive/AzSreAgentLab
  • IaC scripts: scripts/30-deploy-octopets.sh calls external/octopets/apphost/infra/main.bicep; propose adding App Insights settings/variables and verifying telemetry config.

Please assign backend owners to instrument and implement resilience, then validate under load. Attach logs/exceptions in follow-up once telemetry is enabled.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent [here](https://portal.azure.com/?feature.customPortal=false&feature.canmodifystamps=true&feature.fastmanifest=false&nocdn=force&websitesextension_loglevel=verbose&Microsoft_Azure_PaasServerless=beta&microsoft_azure_paasserverless_assettypeoptions=%7...

  • Fixes gderossilive/AzSreAgentLab#34

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants