Reduce noise from MachineOutOfCompliance alerts#2604
Reduce noise from MachineOutOfCompliance alerts#2604dustman9000 wants to merge 1 commit intoopenshift:masterfrom
Conversation
Split the MachineOutOfComplianceSRE alert into two separate alerts to reduce noise while maintaining visibility: 1. MachineOutOfComplianceSRE (critical): Fires when ANY machine is >35 days old, indicating a clear compliance-monkey failure that requires immediate attention. 2. MachineOutOfComplianceSREWarning (warning): Fires when >5 machines are >28 days old for 4+ hours, indicating a queue backup that may warrant monitoring but is expected behavior when many machines age out simultaneously. This change addresses the issue where compliance-monkey processes machines at 1 per 15 minutes, causing some machines to exceed 28 days while waiting in the normal replacement queue. The new thresholds provide better signal-to-noise ratio for on-call responders.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dustman9000 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@dustman9000: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| expr: (time() - mapi_machine_created_timestamp_seconds) > 2419200 | ||
| for: 60m | ||
| # Fires when ANY machine exceeds 35 days old, indicating compliance-monkey failed to replace it. | ||
| expr: (time() - mapi_machine_created_timestamp_seconds) > 3024000 |
There was a problem hiding this comment.
Isn't there a requirement where we need to replace machines older than 28 days, which means that firing at 35 means we are out of compliance?
|
We now have jitter introduced in compliance-monkey, so machines all with the same creation time won't actually be phased out at the same time, therefore we may not need these changes anymore, but will defer to you 👍 |
Summary
This PR splits the
MachineOutOfComplianceSREalert into two separate alerts to reduce noise while maintaining visibility into compliance-monkey failures.Problem
The current alert fires when any machine is >28 days old. However, when many machines age out simultaneously (e.g., all created during MC provisioning), compliance-monkey processes them at 1 machine per 15 minutes. This creates a queue where some machines exceed 28 days while waiting their turn, triggering critical alerts even though the automation is working correctly.
Solution
Two-tier alert system:
MachineOutOfComplianceSRE (critical, >35 days, 1h for):
MachineOutOfComplianceSREWarning (warning, >5 machines >28 days, 4h for):
Benefits
Testing
References