Skip to content

[WIP] Fix misconfigured HTTP 5xx alert for Octopets API#18

Closed
Copilot wants to merge 1 commit intomainfrom
copilot/fix-octopets-api-http-5xx-alert
Closed

[WIP] Fix misconfigured HTTP 5xx alert for Octopets API#18
Copilot wants to merge 1 commit intomainfrom
copilot/fix-octopets-api-http-5xx-alert

Conversation

Copy link

Copilot AI commented Jan 26, 2026

Fix Octopets API HTTP 5xx Alert and Scale-to-Zero Issue

Analysis

The incident (INC0010043) revealed two critical issues:

  1. Misconfigured Alert: Alert named "HTTP 5xx - Octopets API" is actually monitoring 2xx requests with threshold >10 instead of 5xx errors
  2. Scale-to-Zero: Replicas reported as 0 during the incident window, suggesting the app scaled down completely

Planned Changes

  • Investigate existing infrastructure deployment structure
  • Create or update main.bicep for Octopets infrastructure with:
    • Container App configuration with minReplicas >= 1
    • Properly configured 5xx metric alert (statusCodeCategory='5xx')
    • Readiness/liveness probes
    • Application Insights integration
  • Update deployment scripts if needed
  • Verify the alert configuration matches the correct pattern from octopets-az-monitor-alerts.bicep
  • Test the deployment

Scope

Minimal changes to:

  • Create/update infrastructure Bicep templates
  • Fix alert criteria to monitor actual 5xx errors
  • Set minReplicas to prevent scale-to-zero during business hours
Original prompt

This section details on the original issue you should resolve

<issue_title>Octopets API: Sev1 alert "HTTP 5xx" fired/resolved; metrics suggest misconfigured alert and potential scale-to-zero/cold start</issue_title>
<issue_description>Incident: INC0010043 (ServiceNow) — need sys_id to link once available.
Azure Alert: "HTTP 5xx - Octopets API" (Metric Alert)

Evidence (UTC, last 2h):

  1. Requests (statusCodeCategory split)
    • 2xx averages: 16:30 ~8, 16:31 ~9.5, 16:32 ~3; then 0 from 16:33–16:40.
    • No series returned for 5xx (empty), and alert payload shows criteria on 2xx Requests with operator GreaterThan threshold=10, which does not match the alert title.
  2. ResponseTime (by statusCodeCategory)
    • 2xx avg response time: 16:30 ~629ms; 16:31 ~529ms; 16:32 ~508ms.
  3. RestartCount
    • Flat at 1 throughout; no new restarts during the window.
  4. Replicas
    • Reported 0 across the period (suggests scale-to-zero or misread of metric).
  5. ResiliencyRequestTimeouts
    • No data.

Assessment:

  • The alert name suggests monitoring 5xx errors, but criteria in the alert context show 2xx Requests metric with GreaterThan 10 and observed metricValue=0, indicating a likely misconfiguration of the alert.
  • Drop to 0 2xx requests from 16:33 onward could indicate the app was not serving requests (possible scale-to-zero/cold start, ingress issue, or upstream dependency failure returning non-2xx). Absent 5xx metrics implies either errors were not 5xx at the Container Apps metric source or there was no traffic.

Proposed fixes:
A) Alerting (IaC/config):

  • Update the metric alert to target Requests with statusCodeCategory='5xx' (Total) above threshold X in window Y, or use error percentage (5xx/total) > N%.
  • Ensure dimensions split by statusCodeCategory and optionally statusCode.
    Path (IaC): external/octopets/apphost/infra/main.bicep (metric alert resource under Microsoft.Insights/metricAlerts) — adjust criteria accordingly.

B) Reliability/serving:

  • Confirm Container App scale settings; set minReplicas >= 1 to avoid scale-to-zero during business hours.
    Path: external/octopets/apphost/infra/main.bicep → container app configuration.scale.minReplicas: 1.
  • Add/verify readiness/liveness probes and ingress configuration.
  • Ensure Application Insights is enabled with dependency/exception telemetry; add structured error logging in Octopets API so 5xx are observable.

C) Code improvements:

  • Implement global exception handling to return consistent 5xx with actionable logs and correlation IDs.
  • Add timeout/circuit breaker around downstream calls (e.g., HTTP client policies) to prevent hangs that may produce 5xx.

Requested actions:

  • Fix metric alert criteria to monitor 5xx or error rate
  • Set minReplicas >= 1 for octopetsapi (business hours)
  • Verify probes/ingress and enable App Insights telemetry
  • Add error handling and resilience policies in API code

Please link commits/PRs and reference this issue. Once sys_id for INC0010043 is provided, we'll update the incident with the GitHub link and resolve it.

This issue was created by sre-agent-demo--c3c0627e
Tracked by the SRE agent here
</issue_description>

Comments on the Issue (you are @copilot in this section)

  • Fixes gderossilive/AzSreAgentLab#17

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants