metrics: add PD_leader_service_stuck alert for etcd-leader without PD service#10321
metrics: add PD_leader_service_stuck alert for etcd-leader without PD service#10321liyishuai wants to merge 5 commits intotikv:masterfrom
Conversation
… service When service_member_role drops to 0 and never recovers while etcd_server_is_leader stays stable, no existing alert fires: - PD_leader_lease_drop_without_failover requires service_member_role==1 at eval time and changes>=2 (persistent drop gives changes==1, value=0) - PD_leader_change detects TSO-save handoff; with no active PD leader, no saves are emitted and the count stays below threshold - All cluster-health alerts rely on pd_cluster_status/pd_regions_status, which are only emitted by the active PD leader Add PD_leader_service_stuck (critical, for:1m) that fires when the etcd leader node's PD service layer is not serving as PD leader. Normal failovers are naturally excluded: when etcd leadership transfers, the departing node's etcd_server_is_leader drops to 0, making the join condition false without any extra suppression logic. Add two promtool unit tests: - pd-leader-service-stuck: positive case fires after 1m of stuck state - pd-leader-service-stuck-suppressed-by-failover: normal failover with matching etcd+PD leadership transfer stays silent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @liyishuai. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new Prometheus alert rule PD_leader_service_stuck that fires when an embedded etcd leader's PD service layer is non-leader for 1m, and adds multiple alert-test scenarios covering trigger, normal failover suppression, and staggered failover (some scenarios duplicated). Changes
Sequence Diagram(s)(Skipped — change is a new alert rule plus tests; no new multi-component sequential control flow diagram generated.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a new PD alert to detect a previously silent failure mode where the embedded etcd leader remains stable but the PD service layer is stuck in a non-leader state (no PD leader serving requests).
Changes:
- Add
PD_leader_service_stuck(critical,for: 1m) alert rule based onservice_member_role==0on the etcd leader. - Add promtool unit tests covering both the positive (stuck) case and a failover-suppressed (negative) case.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
metrics/alertmanager/pd.rules.yml |
Adds the new PD_leader_service_stuck alert rule and associated labels/annotations. |
tests/alertmanager/pd.rules.test.yml |
Adds promtool tests validating the alert fires for persistent PD service loss and stays silent during normal failover. |
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/alertmanager/pd.rules.test.yml (1)
145-163: Assert suppression during the failover window, not only at 12m.
exp_alerts: []here only proves the alert is absent long after the handoff. It would still miss a transient page around minute 5-6, which is the exact behavior thefor: 1minmetrics/alertmanager/pd.rules.yml:156-169is meant to suppress. Please add evals near the transition, and ideally one staggered handoff case whereservice_member_rolegoes0slightly beforeetcd_server_is_leaderflips, so the no-false-positive failover behavior is actually locked in.Example of tightening this test
alert_rule_test: + - eval_time: 5m45s + alertname: PD_leader_service_stuck + exp_alerts: [] - eval_time: 12m alertname: PD_leader_service_stuck exp_alerts: []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/alertmanager/pd.rules.test.yml` around lines 145 - 163, The test currently only checks at eval_time 12m; add additional alert_rule_test entries that evaluate during and immediately after the failover window (e.g., around 5m, 5m30s, 6m) to assert suppression while the handoff is occurring; include at least one staggered handoff scenario where the series 'service_member_role{job="pd",service="PD",instance="pd-1"}' flips to 0 slightly before 'etcd_server_is_leader{job="pd",instance="pd-1"}' flips (and corresponding pd-2 series flips on) and set exp_alerts: [] for those evals to ensure the rule PD_leader_service_stuck (and the rule’s for: 1m behavior) does not fire during the transient.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 145-163: The test currently only checks at eval_time 12m; add
additional alert_rule_test entries that evaluate during and immediately after
the failover window (e.g., around 5m, 5m30s, 6m) to assert suppression while the
handoff is occurring; include at least one staggered handoff scenario where the
series 'service_member_role{job="pd",service="PD",instance="pd-1"}' flips to 0
slightly before 'etcd_server_is_leader{job="pd",instance="pd-1"}' flips (and
corresponding pd-2 series flips on) and set exp_alerts: [] for those evals to
ensure the rule PD_leader_service_stuck (and the rule’s for: 1m behavior) does
not fire during the transient.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5cc9def5-f11b-42fd-b942-10e2121166be
📒 Files selected for processing (1)
tests/alertmanager/pd.rules.test.yml
…_service_stuck Add alert_rule_test entries at 5m, 5m30s, and 6m to the instant-failover suppression test to assert the rule stays silent while and immediately after the handoff is occurring. Add pd-leader-service-stuck-staggered-failover: pd-1 service_member_role flips to 0 thirty seconds before etcd_server_is_leader transfers. The transient window (2 evaluations × 15s = 30s) is shorter than for:1m's required 4 consecutive evaluations, so the alert must not fire. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
Test Report: PD_leader_service_stuck AlertDate: 2026-03-10 Alert OverviewThis alert fires when the embedded etcd leader node's PD service layer is not serving as PD leader for more than 1 minute — a silent failure mode where the cluster has no PD leader yet no existing alert triggers. Expression: How Test EnvironmentRequired Components
Failpoints Used
Test ProcedureStep 1: Build PD with Failpoint Supportcd /path/to/pd
make pd-server-failpointStep 2: Start a 3-Node PD Cluster# Clean up any existing cluster
pkill -9 -f "pd-server|tiup.*playground"
rm -rf ~/.tiup/data/alert-test
# Start 3-node cluster with failpoint-enabled binary
tiup playground --pd 3 --pd.binpath $(pwd)/bin/pd-server \
--kv 0 --db 0 --tiflash 0 --without-monitor \
--tag alert-test > /tmp/tiup-playground.log 2>&1 &
# Wait for cluster to stabilize (20-30 seconds)
sleep 25Verify cluster and identify leader: # Check all nodes' metrics to find leader
for port in 2379 2382 2384; do
echo "=== Port $port ==="
curl -s "http://127.0.0.1:$port/metrics" | \
grep -E "^etcd_server_is_leader|^service_member_role"
doneExpected output (pd-1 at port 2382 is leader): Step 3: Configure PrometheusCreate global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /path/to/pd/metrics/alertmanager/pd.rules.yml
scrape_configs:
- job_name: pd
static_configs:
- targets: ["127.0.0.1:2379", "127.0.0.1:2382", "127.0.0.1:2384"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '127.0.0.1:2379'
replacement: 'pd-0'
- source_labels: [__address__]
target_label: instance
regex: '127.0.0.1:2382'
replacement: 'pd-1'
- source_labels: [__address__]
target_label: instance
regex: '127.0.0.1:2384'
replacement: 'pd-2'Start Prometheus: prometheus --config.file=/tmp/prometheus.yml \
--storage.tsdb.path=/tmp/prometheus-data \
> /tmp/prometheus.log 2>&1 &
# Wait for data collection
sleep 15Step 4: Trigger the Alert#!/usr/bin/env bash
set -euo pipefail
# PD leader (adjust if different node is leader)
PD_LEADER="http://127.0.0.1:2382"
FP_SKIP="github.com/tikv/pd/pkg/election/skipGrantLeader"
FP_EXIT="github.com/tikv/pd/server/exitCampaignLeader"
# Get leader's member_id
MEMBER_ID=$(curl -s "$PD_LEADER/pd/api/v1/leader" | \
python3 -c "import json,sys; print(json.load(sys.stdin)['member_id'])")
echo "=== Leader member_id: $MEMBER_ID ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"
# Block ALL nodes from campaigning (pause = indefinite block)
echo ""
echo "Setting skipGrantLeader (pause) on all nodes..."
for port in 2379 2382 2384; do
curl -s -X PUT "http://127.0.0.1:$port/pd/api/v1/fail/$FP_SKIP" -d 'pause'
done
echo "Done"
# Force leader to step down
echo ""
echo "Triggering exitCampaignLeader for member $MEMBER_ID..."
curl -s -X PUT "$PD_LEADER/pd/api/v1/fail/$FP_EXIT" \
-d "return(\"$MEMBER_ID\")"
echo "Done"
# Verify stuck state
sleep 2
echo ""
echo "=== Stuck state achieved ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"
# Hold stuck state for >1m to trigger alert
echo ""
echo "Holding stuck state for 70s (alert fires after 60s)..."
sleep 70
echo ""
echo "=== Final state ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"Step 5: Verify Alert FiringCheck via Prometheus UI: # On macOS
open http://localhost:9090/alerts
# On Linux
xdg-open http://localhost:9090/alertsLook for Check via Prometheus API: curl -s http://127.0.0.1:9090/api/v1/alerts | python3 -c "
import json, sys
data = json.load(sys.stdin)
alerts = [a for a in data['data']['alerts']
if a['labels']['alertname'] == 'PD_leader_service_stuck']
for a in alerts:
print(f\"Instance: {a['labels']['instance']}\")
print(f\"State: {a['state']}\")
print(f\"Value: {a['annotations']['value']}\")
"Expected output: Observed BehaviorWhen the stuck condition is triggered: Important note: The Alert State Progression
Test Coverage (promtool unit tests)Run all tests: promtool test rules tests/alertmanager/pd.rules.test.yml
# → SUCCESSAll 7 test groups pass (15 total
Clean Up# Stop cluster and Prometheus
pkill -9 -f "pd-server|tiup.*playground|prometheus"
rm -rf ~/.tiup/data/alert-test /tmp/prometheus-dataReferences
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 191-206: Add an explicit sample at the 5m15s boundary inside the
alert_rule_test block to pin the for:1m suppression window: insert an entry with
eval_time: 5m15s, alertname: PD_leader_service_stuck and exp_alerts: [] so the
test exercises the transient window itself (this ensures rules shortened to
for:15s would be caught); update the sequence near the existing eval_time: 5m
and 5m30s entries in the alert_rule_test case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 16857d64-2260-453e-9292-640220537e5e
📒 Files selected for processing (1)
tests/alertmanager/pd.rules.test.yml
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 129-143: The test for PD_leader_service_stuck in alert_rule_test
currently only checks that the alert eventually fires; update the test to assert
the exact 1m "for" boundary by adding a no-alert check just before the 1-minute
boundary and a firing check at the 1-minute mark: under the
PD_leader_service_stuck case (eval_time/exp_alerts) add an eval_time entry
(e.g., 59s or 59s since start) asserting no alerts with the same labels, and
another eval_time entry at 1m asserting the alert is firing with
exp_labels/exp_annotations as shown, so the test enforces the 1m hold time
precisely.
- Around line 155-156: The fixture currently includes a series for
service_member_role with instance="pd-2" and many leading zeros; instead model
this metric as absent until pd-2 takes leadership by removing that series (or
replacing it with an absent/empty series) so the metric is truly omitted before
takeover; apply the same change to the duplicate occurrence referenced in the
comment (the other service_member_role{instance="pd-2"} fixture).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7ed5393b-6cc6-4316-950b-4f6b195ff1b1
📒 Files selected for processing (1)
tests/alertmanager/pd.rules.test.yml
…accuracy
Add eval_time: 3m45s (PENDING, no alert) and eval_time: 4m (exactly at
for:1m boundary, FIRING) to the pd-leader-service-stuck test so the suite
enforces the hold time precisely rather than only checking a late eval.
Remove the service_member_role{instance="pd-2"} series with leading zeros
from pd-leader-service-stuck-suppressed-by-failover and
pd-leader-service-stuck-staggered-failover. In practice the metric is only
initialized when a node first wins PD leadership (server.go:1797 Set(1));
followers that have never been leader do not export it at all. Keeping the
series absent before takeover makes the fixture match the real metric
cardinality.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10321 +/- ##
==========================================
+ Coverage 78.78% 78.84% +0.06%
==========================================
Files 527 527
Lines 70916 70920 +4
==========================================
+ Hits 55870 55917 +47
+ Misses 11026 10986 -40
+ Partials 4020 4017 -3
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
✅ Actions performedReview triggered.
|
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
| annotations: | ||
| description: 'cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, PD service is not the PD leader while being the embedded etcd leader; values:{{ $value }}' | ||
| value: '{{ $value }}' | ||
| summary: PD leader service is stuck in non-leader state |
There was a problem hiding this comment.
How about adding a summary to show the etcd leader is normal?
What problem does this PR solve?
Issue Number: Close #10320
When
service_member_roledrops to0and never recovers — whileetcd_server_is_leaderstays stable and the process has not restarted — no existing alert fires. The cluster has no PD leader serving requests, but the entire alert suite is silent (see issue for the full per-rule analysis).What is changed and how does it work?
Add
PD_leader_service_stuck(level: critical,for: 1m):Fires when the etcd-leader node's PD service layer is not serving as PD leader for a sustained period. Normal failovers are naturally excluded: when etcd leadership transfers, the departing node's
etcd_server_is_leaderdrops to0, making the join condition false without any extra suppression logic.Two
promtoolunit tests are added:pd-leader-service-stuck— positive: service drops at minute 3, stays down, fires ateval_time: 6mpd-leader-service-stuck-suppressed-by-failover— negative: pd-1 loses both PD and etcd leadership to pd-2; no alert firesCheck List
Tests
Code changes
Side effects
Release note
Summary by CodeRabbit
New Features
Tests