Reduce noise from MachineOutOfCompliance alerts by dustman9000 · Pull Request #2604 · openshift/managed-cluster-config

dustman9000 · 2025-12-06T00:05:29Z

Summary

This PR splits the MachineOutOfComplianceSRE alert into two separate alerts to reduce noise while maintaining visibility into compliance-monkey failures.

Problem

The current alert fires when any machine is >28 days old. However, when many machines age out simultaneously (e.g., all created during MC provisioning), compliance-monkey processes them at 1 machine per 15 minutes. This creates a queue where some machines exceed 28 days while waiting their turn, triggering critical alerts even though the automation is working correctly.

Solution

Two-tier alert system:

MachineOutOfComplianceSRE (critical, >35 days, 1h for):
- Fires when ANY machine exceeds 35 days old
- Indicates a clear compliance-monkey failure requiring immediate attention
- 7 days of buffer beyond the 28-day threshold allows adequate time for queue processing
MachineOutOfComplianceSREWarning (warning, >5 machines >28 days, 4h for):
- Fires when multiple machines (>5) are >28 days old for 4+ hours
- Indicates a queue backup that warrants monitoring
- Expected behavior when many machines age out simultaneously
- Provides visibility without generating critical pages

Benefits

Reduced noise: Normal queue processing no longer triggers critical alerts
Better signal-to-noise: Critical alerts reserved for true failures (>35 days)
Maintained visibility: Warning alerts for queue backlogs provide awareness
Clear escalation path: Warning → Critical based on severity and duration

Testing

Alert logic validated against current MC cluster state showing 72 machines >21 days old
At 1 machine/15min, queue processes within 18 hours, well under the 35-day threshold
Warning alert would fire for large backlogs, critical only for stuck machines

References

Original alert: https://issues.redhat.com/browse/OSD-17905
compliance-monkey workload: https://issues.redhat.com/browse/OSD-17902

Split the MachineOutOfComplianceSRE alert into two separate alerts to reduce noise while maintaining visibility: 1. MachineOutOfComplianceSRE (critical): Fires when ANY machine is >35 days old, indicating a clear compliance-monkey failure that requires immediate attention. 2. MachineOutOfComplianceSREWarning (warning): Fires when >5 machines are >28 days old for 4+ hours, indicating a queue backup that may warrant monitoring but is expected behavior when many machines age out simultaneously. This change addresses the issue where compliance-monkey processes machines at 1 per 15 minutes, causing some machines to exceed 28 days while waiting in the normal replacement queue. The new thresholds provide better signal-to-noise ratio for on-call responders.

openshift-ci · 2025-12-06T00:05:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~deploy/sre-prometheus/OWNERS~~ [dustman9000]
~~hack/OWNERS~~ [dustman9000]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-12-06T00:09:55Z

@dustman9000: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

joshbranham · 2025-12-08T21:47:20Z

deploy/sre-prometheus/management-cluster/100-machine-out-of-compliance.PrometheusRule.yaml

-      expr: (time() - mapi_machine_created_timestamp_seconds) > 2419200
-      for: 60m
+      # Fires when ANY machine exceeds 35 days old, indicating compliance-monkey failed to replace it.
+      expr: (time() - mapi_machine_created_timestamp_seconds) > 3024000


Isn't there a requirement where we need to replace machines older than 28 days, which means that firing at 35 means we are out of compliance?

joshbranham · 2025-12-19T21:56:44Z

We now have jitter introduced in compliance-monkey, so machines all with the same creation time won't actually be phased out at the same time, therefore we may not need these changes anymore, but will defer to you 👍

openshift-ci bot requested review from boranx and rogbas December 6, 2025 00:05

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 6, 2025

joshbranham reviewed Dec 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Reduce noise from MachineOutOfCompliance alerts#2604

Reduce noise from MachineOutOfCompliance alerts#2604
dustman9000 wants to merge 1 commit intoopenshift:masterfrom
dustman9000:reduce-machine-compliance-alert-noise

dustman9000 commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

joshbranham Dec 8, 2025

Uh oh!

joshbranham commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

dustman9000 commented Dec 6, 2025

Summary

Problem

Solution

Benefits

Testing

References

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

openshift-ci bot commented Dec 6, 2025

Uh oh!

joshbranham Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

joshbranham commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants