Skip to content

gderossilive/AzSreAgentLab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

75 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Azure SRE Agent Lab Environment

A complete lab environment for testing Azure SRE Agent (Preview) with a realistic workload, deployed to Sweden Central region.

🎯 Overview

This lab deploys:

  • Octopets Sample Application: A .NET Aspire application (React frontend + ASP.NET Core backend) running on Azure Container Apps
  • Azure SRE Agent: AI-powered reliability assistant configured with High access scoped to the Octopets resource group only
  • Autonomous Health Monitoring: Scheduled health checks with statistical anomaly detection and Teams notifications

Included Demos

  1. ServiceNow Incident Automation (demos/ServiceNowAzureResourceHandler/)

    • End-to-end automated incident response (Azure Monitor β†’ ServiceNow β†’ SRE Agent β†’ GitHub)
    • See ServiceNow Demo for details
  2. Azure Health Check with Teams Alerts (demos/AzureHealthCheck/)

    • Scheduled autonomous monitoring of Azure resources (Container Apps, VMs, AKS, App Service)
    • Statistical anomaly detection using MAD/z-score analysis
    • Cost monitoring, Azure Advisor integration, dependency health checks
    • Adaptive Card alerts sent to Microsoft Teams
    • Demo flow diagram: demos/AzureHealthCheck/README.md#demo-flow-end-to-end
    • See Health Check Demo for details
  3. Proactive Reliability (App Service Slot Swap) (demos/ProactiveReliabilityAppService/)

  4. Grocery SRE Demo (Container Apps + Managed Grafana + MCP) (demos/GrocerySreDemo/)

    • Sample β€œgrocery” app (API + web) with an optional Loki log pipeline and optional Grafana MCP integration
    • Useful for demos that involve querying Grafana dashboards/logs via MCP (e.g., from agents)
  5. Grubify Incident Lab (azd-based, 3 personas) (demos/GrubifyIncidentLab/)

    • Standalone azd up demo: deploys Grubify (Node.js food ordering app) with intentional memory leak
    • Three acts: IT Ops (autonomous diagnosis/remediation), Developer (code analysis β†’ GitHub issue), Workflow (issue triage)
    • Bicep infrastructure, knowledge base, subagents, and response plan deployed via post-provision hook
    • See Grubify Incident Lab Demo for details

Demo β†’ scripts map

Demo Demo folder Key config files Related scripts (run order)
Azure Health Check (scheduled anomaly detection β†’ Teams) demos/AzureHealthCheck/ demos/AzureHealthCheck/README.md, demos/AzureHealthCheck/azurehealthcheck-subagent-simple.yaml scripts/70-test-teams-webhook.sh β†’ scripts/71-send-sample-anomaly.sh β†’ (optional) scripts/60-generate-traffic.sh
ServiceNow Incident Automation (Azure Monitor alerts β†’ ServiceNow incident β†’ SRE Agent subagent) demos/ServiceNowAzureResourceHandler/ demos/ServiceNowAzureResourceHandler/README.md, demos/ServiceNowAzureResourceHandler/servicenow-subagent-simple.yaml, demos/ServiceNowAzureResourceHandler/servicenow-logic-app.bicep, demos/ServiceNowAzureResourceHandler/octopets-service-now-alerts.bicep scripts/50-deploy-logic-app.sh β†’ scripts/50-deploy-alert-rules.sh β†’ scripts/63-enable-memory-errors.sh (or scripts/61-enable-cpu-stress.sh) β†’ scripts/60-generate-traffic.sh β†’ verify with scripts/61-check-memory.sh β†’ cleanup: scripts/64-disable-memory-errors.sh / scripts/62-disable-cpu-stress.sh
Proactive Reliability (App Service slot swap β†’ expected rollback) demos/ProactiveReliabilityAppService/ demos/ProactiveReliabilityAppService/README.md, demos/ProactiveReliabilityAppService/demo-config.json, demos/ProactiveReliabilityAppService/SubAgents/ demos/ProactiveReliabilityAppService/scripts/01-setup-demo.sh β†’ demos/ProactiveReliabilityAppService/scripts/02-run-demo.sh β†’ (optional) demos/ProactiveReliabilityAppService/scripts/03-reset-demo.sh
Grocery SRE Demo (Container Apps + Managed Grafana + MCP) demos/GrocerySreDemo/ demos/GrocerySreDemo/README.md, demos/GrocerySreDemo/demo-config.json demos/GrocerySreDemo/scripts/01-setup-demo.sh β†’ demos/GrocerySreDemo/scripts/02-build-and-deploy-containers.sh β†’ demos/GrocerySreDemo/scripts/03-smoke-and-trigger.sh
Grubify Incident Lab (azd-based, 3 personas) demos/GrubifyIncidentLab/ demos/GrubifyIncidentLab/README.md, demos/GrubifyIncidentLab/azure.yaml cd demos/GrubifyIncidentLab && azd up β†’ demos/GrubifyIncidentLab/scripts/break-app.sh

πŸ“‹ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Resource Group: rg-octopets-lab (Sweden Central)            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Container Apps Environment                                β”‚
β”‚ β€’ Azure Container Registry                                  β”‚
β”‚ β€’ Log Analytics Workspace                                   β”‚
β”‚ β€’ Application Insights                                      β”‚
β”‚ β€’ Backend Container App (ASP.NET Core)                      β”‚
β”‚ β€’ Frontend Container App (React/Nginx)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Resource Group: rg-sre-agent-lab (Sweden Central)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ SRE Agent Resource                                        β”‚
β”‚ β€’ Managed Identity (with scoped permissions)                β”‚
β”‚ β€’ Log Analytics Workspace                                   β”‚
β”‚ β€’ Application Insights                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RBAC Configuration

The SRE Agent's managed identity has High access with these roles scoped only to the Octopets resource group (OCTOPETS_RG_NAME):

  • Contributor - Enables remediation actions
  • Reader - Read access to resources
  • Log Analytics Contributor - Access to logs for diagnostics

πŸš€ Quick Start

Prerequisites

  • Azure CLI (az)
  • Bash shell
  • Azure subscription with permissions to create resources and role assignments
  • Dev container environment (included)

Notes:

  • Local Docker is not required; container images are built remotely using Azure Container Registry (az acr build).
  • For hosted load generation with Azure Load Testing (Apache JMeter), see docs/azure-load-testing-jmeter.md.

Permissions needed (common working combinations):

  • Ability to run subscription-scope deployments that create resource groups (az deployment sub create)
  • Ability to create RBAC role assignments scoped to the Octopets resource group and the SRE Agent resource group
  • Typically: Owner, or Contributor + User Access Administrator at the required scopes

Region requirement:

  • swedencentral (SRE Agent preview constraint)

Security constraints:

  • Never commit .env (or any secrets)
  • Do not grant the SRE Agent subscription-wide permissions; scope access to the target resource group(s) only

Deployment

  1. Configure Environment

    # Copy and edit .env with your Azure credentials
    cp .env.example .env

    Minimum variables for the happy path:

    • AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID, AZURE_LOCATION
    • OCTOPETS_ENV_NAME
    • SRE_AGENT_RG_NAME, SRE_AGENT_NAME, SRE_AGENT_ACCESS_LEVEL

    Notes:

    • OCTOPETS_RG_NAME and SRE_AGENT_TARGET_RESOURCE_GROUPS are auto-populated by scripts/30-deploy-octopets.sh.
    • OCTOPETS_API_URL and OCTOPETS_FE_URL are auto-populated by scripts/31-deploy-octopets-containers.sh.
  2. Authenticate

    source scripts/load-env.sh
    scripts/20-az-login.sh
  3. Deploy Octopets Infrastructure

    scripts/30-deploy-octopets.sh

    This deploys infrastructure via Azure CLI + Bicep at subscription scope and sets OCTOPETS_RG_NAME in your .env. It also sets SRE_AGENT_TARGET_RESOURCE_GROUPS in your .env to the same RG name.

  4. Build and Deploy Containers

    scripts/31-deploy-octopets-containers.sh

    This uses ACR remote builds (az acr build) and updates .env with OCTOPETS_API_URL and OCTOPETS_FE_URL.

    Verification:

    • Open the OCTOPETS_FE_URL in a browser; the frontend should load.

4b. (Optional) OpenAPI smoke test

scripts/32-openapi-smoke-test.sh
  1. Ensure SRE Agent reference repo is present

    # Only needed if external/sre-agent is missing
    scripts/10-clone-repos.sh
  2. Deploy SRE Agent

    scripts/40-deploy-sre-agent.sh

Fresh environment (new deployment)

To create a fresh deployment (new resource groups), start with a fresh .env and new names:

rm -f .env
cp .env.example .env

# Edit these before deploying (examples)
scripts/set-dotenv-value.sh "OCTOPETS_ENV_NAME" "octopets-lab-$(date +%Y%m%d%H%M%S)"
scripts/set-dotenv-value.sh "SRE_AGENT_RG_NAME" "rg-sre-agent-lab-$(date +%Y%m%d%H%M%S)"
scripts/set-dotenv-value.sh "SRE_AGENT_NAME" "sre-agent-lab-$(date +%Y%m%d%H%M%S)"

Then follow the same happy-path deployment sequence above.

  1. [Optional] Deploy ServiceNow Integration Demo
    # See demos/ServiceNowAzureResourceHandler/README.md for complete instructions
    # Requires ServiceNow developer instance and credentials in .env
    scripts/50-deploy-logic-app.sh
    scripts/50-deploy-alert-rules.sh

πŸ“ Project Structure

.
β”œβ”€β”€ .env                    # Environment configuration (not in git)
β”œβ”€β”€ .env.example           # Template for environment variables
β”œβ”€β”€ specs/
β”‚   β”œβ”€β”€ EnvSetupSpecs.md   # Complete lab specification
β”‚   └── IncidentAutomationServiceNow.md  # ServiceNow demo spec
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 10-clone-repos.sh  # Bootstrap external repos (optional)
β”‚   β”œβ”€β”€ 11-vendor-external-repo.sh  # Vendor another external repo into external/ (modifiable, no nested .git)
β”‚   β”œβ”€β”€ 20-az-login.sh     # Azure authentication
β”‚   β”œβ”€β”€ 30-deploy-octopets.sh        # Deploy Octopets infrastructure (Azure CLI + Bicep, subscription scope)
β”‚   β”œβ”€β”€ 31-deploy-octopets-containers.sh  # Build & deploy containers (ACR remote builds, no Docker)
β”‚   β”œβ”€β”€ 32-openapi-smoke-test.sh     # OpenAPI sanity checks for Octopets API
β”‚   β”œβ”€β”€ 40-deploy-sre-agent.sh       # Deploy SRE Agent
β”‚   β”œβ”€β”€ 50-deploy-alert-rules.sh     # Deploy ServiceNow integration
β”‚   β”œβ”€β”€ load-env.sh        # Load environment variables
β”‚   └── set-dotenv-value.sh          # Update .env values
β”œβ”€β”€ demos/
β”‚   β”œβ”€β”€ ServiceNowAzureResourceHandler/
β”‚   β”‚   β”œβ”€β”€ README.md      # ServiceNow demo execution guide
β”‚   β”‚   β”œβ”€β”€ servicenow-subagent-simple.yaml  # SRE Agent subagent
β”‚   β”‚   └── octopets-service-now-alerts.bicep       # Alert rules template
β”‚   └── AzureHealthCheck/
β”‚       β”œβ”€β”€ README.md      # Health check setup guide
β”‚       └── azurehealthcheck-subagent-simple.yaml  # Health monitoring subagent
β”‚   └── ProactiveReliabilityAppService/
β”‚       β”œβ”€β”€ README.md      # App Service slot swap demo guide
β”‚       β”œβ”€β”€ demo-config.json  # Output from setup script (safe to commit)
β”‚       β”œβ”€β”€ SubAgents/     # Portal YAML templates (placeholders)
β”‚       └── scripts/       # Setup/run/reset scripts
β”‚   └── GrocerySreDemo/
β”‚       β”œβ”€β”€ README.md      # Grocery SRE demo execution guide
β”‚       β”œβ”€β”€ infrastructure/ # Bicep templates for optional components (Loki, MCP)
β”‚       └── scripts/       # Deploy/test scripts
β”‚   └── GrubifyIncidentLab/
β”‚       β”œβ”€β”€ README.md      # Grubify incident lab guide (3 personas)
β”‚       β”œβ”€β”€ azure.yaml     # azd template entry point
β”‚       β”œβ”€β”€ infrastructure/ # Bicep (subscription-scoped main + 6 modules)
β”‚       β”œβ”€β”€ knowledge/     # SRE knowledge base (4 MD runbooks)
β”‚       β”œβ”€β”€ sre-config/    # Subagent YAMLs + GitHub MCP connector
β”‚       β”œβ”€β”€ scripts/       # post-provision.sh, break-app.sh, helpers
β”‚       └── src/           # Grubify app source (clone from GitHub)
└── external/
    β”œβ”€β”€ octopets/          # Octopets sample app
    └── sre-agent/         # SRE Agent reference repo

πŸ€– Copilot prompts

Reusable prompt templates live under .github/prompts/.

External repositories (vendored copies)

The directories under external/ are vendored snapshots of upstream GitHub repositories.

  • They are not Git submodules (no .gitmodules), and the vendored folders typically do not contain a .git/ directory.
  • Each vendored repo includes an ORIGIN.md documenting the upstream URL and why it was copied (and, for Octopets, what was modified).
  • Updating a vendored repo is a manual process (re-vendor a pinned upstream ref and apply any lab-specific changes).

scripts/10-clone-repos.sh is provided as a convenience for fresh workspaces: it clones the upstream repos into external/ only if the target directory does not already exist. In this repo, external/ is already present and tracked in Git.

πŸ”§ Configuration

Environment Variables

Key variables in .env:

# Azure Context
AZURE_TENANT_ID=<your-tenant-id>
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_LOCATION=swedencentral

# Proactive Reliability Demo (App Service slot swap)
PROACTIVE_DEMO_RG_NAME=rg-sre-proactive-demo
PROACTIVE_DEMO_APP_SERVICE_NAME=<unique-app-service-name>

# Octopets Application
OCTOPETS_ENV_NAME=octopets-lab
OCTOPETS_RG_NAME=<OCTOPETS_RESOURCE_GROUP_NAME>

# Optional outputs captured after scripts/31-deploy-octopets-containers.sh
OCTOPETS_API_URL=<OCTOPETS_API_URL>
OCTOPETS_FE_URL=<OCTOPETS_FE_URL>

# SRE Agent
SRE_AGENT_RG_NAME=rg-sre-agent-lab
SRE_AGENT_NAME=sre-agent-lab
SRE_AGENT_ACCESS_LEVEL=High
SRE_AGENT_TARGET_RESOURCE_GROUPS=<OCTOPETS_RESOURCE_GROUP_NAME>

# ServiceNow Webhook URL (auto-populated after deploying the Logic App)
SERVICENOW_WEBHOOK_URL=<AUTO_POPULATED_AFTER_ALERT_DEPLOYMENT>

# ServiceNow Integration (Optional - for demo)
SERVICENOW_INSTANCE=dev12345
SERVICENOW_USERNAME=admin
SERVICENOW_PASSWORD=<password>
INCIDENT_NOTIFICATION_EMAIL=your-email@example.com

# Teams Webhook (Power Automate)
TEAMS_WEBHOOK_URL=<YOUR_TEAMS_WEBHOOK_URL>

πŸ› οΈ Technical Details

Deployment Approach

This lab uses a Docker-free deployment strategy:

  • Infrastructure: Deployed via Bicep templates at subscription scope
  • Container Builds: Remote builds using Azure Container Registry Tasks (az acr build)
  • Container Deployment: Azure Container Apps

This approach bypasses the Docker Desktop requirement typically needed for .NET Aspire applications.

Modified Dockerfiles

The original Octopets Dockerfiles were modified to work with the project root as the build context:

  • backend/Dockerfile: Updated COPY paths for servicedefaults
  • frontend/Dockerfile: Updated COPY paths for package.json and nginx.conf

πŸ“š Reference Documentation

🎭 ServiceNow Incident Automation Demo

The lab includes an optional demo that showcases automated incident management with ServiceNow:

What it demonstrates:

  • Azure Monitor detects memory leak or CPU stress in Octopets backend
  • ServiceNow incident automatically created via webhook
  • SRE Agent investigates using Log Analytics and metrics
  • GitHub issue created with root cause analysis
  • ServiceNow incident updated with resolution details
  • Microsoft Teams notifications sent to channels

Quick Start:

# 1. Sign up for ServiceNow developer instance (free)
# Visit: https://developer.servicenow.com/dev.do

# 2. Configure credentials in .env
scripts/set-dotenv-value.sh "SERVICENOW_INSTANCE" "dev12345"
scripts/set-dotenv-value.sh "SERVICENOW_USERNAME" "admin"
scripts/set-dotenv-value.sh "SERVICENOW_PASSWORD" "your-password"
scripts/set-dotenv-value.sh "INCIDENT_NOTIFICATION_EMAIL" "your-email@example.com"

# 3. Deploy Logic App webhook (writes SERVICENOW_WEBHOOK_URL into .env)
scripts/50-deploy-logic-app.sh

# 4. Deploy alert rules and action group
scripts/50-deploy-alert-rules.sh

# 4. Configure SRE Agent subagent (Azure Portal)
# Copy YAML from: demos/ServiceNowAzureResourceHandler/servicenow-subagent-simple.yaml

# 5. Run the demo
# See: demos/ServiceNowAzureResourceHandler/README.md for complete step-by-step instructions

Components:

  • 4 Azure Monitor Alert Rules: Memory (80%, 90%) and error rate (10, 50 per min) thresholds
  • Optional CPU alert: Deploy separately via scripts/65-deploy-cpu-alert.sh
  • ServiceNow Action Group: Webhook integration for incident creation
  • SRE Agent Subagent: Automated investigation and remediation workflow with Teams notifications
  • Expected Duration: 5-15 minutes end-to-end

Documentation:

Testing Scenarios

The Octopets backend supports two independent stress testing scenarios:

Memory Stress Testing (allocates 1GB memory):

# Enable memory stress
./scripts/63-enable-memory-errors.sh

# Generate traffic to trigger allocation
./scripts/60-generate-traffic.sh 20

# Disable after testing
./scripts/64-disable-memory-errors.sh

CPU Stress Testing (burns CPU for 500ms per request):

# Enable CPU stress
./scripts/61-enable-cpu-stress.sh

# Generate traffic to trigger CPU burn
./scripts/60-generate-traffic.sh 50

# Disable after testing
./scripts/62-disable-cpu-stress.sh

Combined Testing (both scenarios simultaneously):

# Enable both flags
./scripts/63-enable-memory-errors.sh
./scripts/61-enable-cpu-stress.sh

# Generate traffic
./scripts/60-generate-traffic.sh 30

# Disable both
./scripts/64-disable-memory-errors.sh
./scripts/62-disable-cpu-stress.sh

These scenarios are useful for:

  • Testing Azure Monitor alert rules
  • Validating SRE Agent anomaly detection
  • Demonstrating auto-remediation workflows
  • Training on incident response

πŸ₯ Azure Health Check Demo

The lab includes an autonomous health monitoring demo that uses statistical analysis to detect anomalies and send intelligent alerts to Microsoft Teams:

What it demonstrates:

  • Scheduled health checks (daily, every 6h, every 12h) across multiple Azure resource types
  • Statistical anomaly detection using Median Absolute Deviation (MAD) and z-score analysis
  • Cost anomaly detection (>50% spike vs 7-day average)
  • Azure Advisor recommendations integration (security, performance, cost, reliability)
  • Resource dependency health monitoring
  • Week-over-week trend analysis
  • Auto-remediation suggestions based on detected anomalies
  • Microsoft Teams notifications with rich Adaptive Cards

Supported Resource Types:

  • Azure Container Apps
  • Virtual Machines
  • Azure Kubernetes Service (AKS)
  • App Service (Web Apps, Function Apps)

Detection Methods:

  • Statistical Analysis: MAD/z-score β‰₯3 for metrics over 24h window
  • Cost Monitoring: Daily cost spikes >50% vs 7-day average
  • Azure Advisor: High/Critical recommendations
  • Dependency Health: Degraded/Unavailable linked resources
  • Week-over-Week: 30% performance degradation vs same time last week

Quick Start:

# 1. Create Teams webhook via Power Automate
# Follow: demos/AzureHealthCheck/README.md (Power Automate Setup)

# 2. Configure webhook URL in .env
scripts/set-dotenv-value.sh "TEAMS_WEBHOOK_URL" "https://prod-xx.logic.azure.com:443/workflows/..."

# 3. Test webhook connectivity
scripts/70-test-teams-webhook.sh
scripts/71-send-sample-anomaly.sh

# 4. Upload subagent to Azure Portal
# Navigate to: Azure Portal β†’ rg-sre-agent-lab β†’ sre-agent-lab β†’ Subagent Builder
# Upload: demos/AzureHealthCheck/azurehealthcheck-subagent-simple.yaml
# Trigger: Scheduled (cron: 0 0 * * * for daily at midnight)

# 5. Configure Teams connector in SRE Agent
# Navigate to: Connectors β†’ Add Microsoft Teams
# Name: AzureHealthAlerts
# Webhook URL: (from .env TEAMS_WEBHOOK_URL)

# 6. Test manual execution
# Click "Run Now" in subagent β†’ Monitor Execution History

Teams Message Features:

  • Alert severity badges (Critical/High/Medium) with color coding
  • Resource details (type, name, resource group, location, health status)
  • Anomaly metrics with z-scores, baselines, min/max values, trend indicators (↑↓→)
  • Week-over-week change percentages
  • Top 3 Azure Advisor recommendations with categories
  • Dependency health status for linked resources
  • Analysis summary (root cause hypothesis, impact assessment, recommended actions)
  • Auto-remediation options (scale-up/out, rightsizing, rollback suggestions)
  • Action buttons (View in Portal, Metrics Dashboard, Logs, Cost Analysis, Advisor Recommendations)

Adaptive Card Format:

{
  "type": "AdaptiveCard",
  "version": "1.4",
  "body": [
    {
      "type": "TextBlock",
      "text": "πŸ”΄ Critical Alert: Container App Memory Anomaly",
      "weight": "Bolder",
      "size": "Large",
      "color": "Attention"
    },
    {
      "type": "FactSet",
      "facts": [
        {"title": "Resource", "value": "octopetsapi"},
        {"title": "Z-Score", "value": "4.2"},
        {"title": "Current", "value": "1.2 GB"},
        {"title": "Baseline", "value": "512 MB"}
      ]
    }
  ]
}

Monitoring Metrics:

  • Container Apps: Memory (WorkingSetBytes), CPU, Requests, Replicas
  • VMs: CPU %, Available Memory, Disk I/O, Network I/O
  • AKS: Node CPU/Memory %, Pod counts
  • App Service: Memory, CPU, HTTP 5xx errors, Response time

Scheduled Trigger Options:

  • Daily at midnight: 0 0 * * *
  • Every 6 hours: 0 */6 * * *
  • Every 12 hours: 0 */12 * * *
  • Business hours only (9 AM Mon-Fri): 0 9 * * 1-5

Documentation:

Configuration Variables:

# Teams Integration
TEAMS_WEBHOOK_URL="https://prod-xx.logic.azure.com:443/workflows/..."  # Power Automate webhook

Note: Uses Power Automate "When a HTTP request is received" trigger, not traditional Teams Incoming Webhook.

πŸ§ͺ Testing the Lab

1. Verify Octopets Application

Access the frontend at the URL from deployment output:

source scripts/load-env.sh
echo "Frontend: $OCTOPETS_FE_URL"
echo "Backend: $OCTOPETS_API_URL"

2. Configure SRE Agent

  1. Navigate to Azure Portal β†’ Resource Groups β†’ rg-sre-agent-lab β†’ sre-agent-lab
  2. Configure Azure Monitor as the incident platform
  3. Set up workflows and monitoring rules
  4. Start in Review mode before switching to Autonomous

3. Create Test Alert

  1. Create an Azure Monitor alert rule in rg-octopets-lab
  2. Trigger the alert (e.g., CPU threshold)
  3. Verify the SRE Agent ingests the incident
  4. Confirm the agent can diagnose and suggest remediation with Contributor permissions

πŸ” Security Considerations

  • SRE Agent has Contributor access only to rg-octopets-lab, not the entire subscription
  • Managed identity follows least-privilege principle
  • All secrets are managed via Azure-native authentication (no keys in .env)

🧹 Cleanup

To delete all lab resources:

source scripts/load-env.sh

# Delete resource groups
az group delete -n rg-octopets-lab --yes --no-wait
az group delete -n rg-sre-agent-lab --yes --no-wait

🍱 Grocery SRE Demo: Container Apps inventory + cleanup

The Grocery SRE demo typically uses a dedicated resource group (commonly rg-grocery-sre-demo) and deploys several Azure Container Apps.

Keep (core app)

  • ca-api-* (example: ca-api-pu3vvmgkrke3q) β€” Grocery API (externally reachable)
  • ca-web-* (example: ca-web-pu3vvmgkrke3q) β€” Grocery Web (externally reachable)

Keep (if you use Grafana MCP from outside Azure / from MCP clients)

  • ca-mcp-amg-proxy β€” Managed Identity + Streamable HTTP MCP endpoint (externally reachable, path /mcp)

Optional (only if you still use Loki-backed dashboards/log queries)

  • ca-loki β€” Loki service (externally reachable); deleting this breaks the Loki datasource + any Loki panels until redeployed

Safe to delete (optional / debug / one-off runners)

  • ca-mcp-amg β€” stdio-only MCP β€œpivot” (no ingress); not required if you use ca-mcp-amg-proxy
  • ca-mcp-amg-debug β€” troubleshooting variant of ca-mcp-amg
  • ca-amg-loki-q-* β€” one-off Loki query runner apps created by the query script

Delete examples:

# Remove stdio pivot + debug apps
az containerapp delete -g rg-grocery-sre-demo -n ca-mcp-amg -y
az containerapp delete -g rg-grocery-sre-demo -n ca-mcp-amg-debug -y

# Remove an old one-shot Loki query runner (example name)
az containerapp delete -g rg-grocery-sre-demo -n ca-amg-loki-q-260130095250 -y

# OPTIONAL: remove Loki (only if you don't need Loki dashboards/log queries)
az containerapp delete -g rg-grocery-sre-demo -n ca-loki -y

πŸ“ Notes

  • Region: Sweden Central is one of 4 regions supporting Azure SRE Agent (eastus2, swedencentral, uksouth, australiaeast)
  • Access Model: High access provides full remediation capabilities but is scoped to a single resource group
  • Incident Platform: Azure Monitor integration requires additional configuration in the SRE Agent portal

🀝 Contributing

This is a lab environment. For issues or improvements:

  1. Check the reference repositories for upstream updates
  2. Review the specs/specs.md for design decisions
  3. Test changes in a non-production subscription

πŸ“„ License

This lab environment references:


Last Updated: December 2025
Lab Version: 1.0
Supported Regions: swedencentral (configured), eastus2, uksouth, australiaeast

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors