Skip to content

metrics: add PD_leader_service_stuck alert for etcd-leader without PD service#10321

Open
liyishuai wants to merge 5 commits intotikv:masterfrom
liyishuai:metrics/pd-leader-service-stuck
Open

metrics: add PD_leader_service_stuck alert for etcd-leader without PD service#10321
liyishuai wants to merge 5 commits intotikv:masterfrom
liyishuai:metrics/pd-leader-service-stuck

Conversation

@liyishuai
Copy link
Contributor

@liyishuai liyishuai commented Mar 10, 2026

What problem does this PR solve?

Issue Number: Close #10320

When service_member_role drops to 0 and never recovers — while etcd_server_is_leader stays stable and the process has not restarted — no existing alert fires. The cluster has no PD leader serving requests, but the entire alert suite is silent (see issue for the full per-rule analysis).

What is changed and how does it work?

Add PD_leader_service_stuck (level: critical, for: 1m):

(service_member_role{job="pd",service="PD"} == 0)
and on(instance,job) (etcd_server_is_leader{job="pd"} == 1)

Fires when the etcd-leader node's PD service layer is not serving as PD leader for a sustained period. Normal failovers are naturally excluded: when etcd leadership transfers, the departing node's etcd_server_is_leader drops to 0, making the join condition false without any extra suppression logic.

Two promtool unit tests are added:

  • pd-leader-service-stuck — positive: service drops at minute 3, stays down, fires at eval_time: 6m
  • pd-leader-service-stuck-suppressed-by-failover — negative: pd-1 loses both PD and etcd leadership to pd-2; no alert fires
When service_member_role drops to 0 and never recovers while
etcd_server_is_leader stays stable, no existing alert fires:
- PD_leader_lease_drop_without_failover requires service_member_role==1
  at eval time and changes>=2 (persistent drop gives changes==1, value=0)
- PD_leader_change detects TSO-save handoff; with no active PD leader,
  no saves are emitted and the count stays below threshold
- All cluster-health alerts rely on pd_cluster_status/pd_regions_status,
  which are only emitted by the active PD leader

Add PD_leader_service_stuck (critical, for:1m) that fires when the etcd
leader node's PD service layer is not serving as PD leader. Normal
failovers are naturally excluded: when etcd leadership transfers, the
departing node's etcd_server_is_leader drops to 0, making the join
condition false without any extra suppression logic.

Check List

Tests

  • Unit test

Code changes

  • No code (alert rule + promtool tests only)

Side effects

  • None

Release note

Add `PD_leader_service_stuck` alert (critical) that fires when the embedded etcd leader node's PD service layer is not serving as PD leader for more than 1 minute, covering a previously undetected failure mode where no existing alert would trigger.

Summary by CodeRabbit

  • New Features

    • Added a critical alert that detects an unresponsive PD leader service, with severity, summary and description.
  • Tests

    • Added many alert test scenarios covering service outage, normal failover suppression, staggered failover, and repeated duplicated scenarios that effectively double some tests.

… service

When service_member_role drops to 0 and never recovers while
etcd_server_is_leader stays stable, no existing alert fires:
- PD_leader_lease_drop_without_failover requires service_member_role==1
  at eval time and changes>=2 (persistent drop gives changes==1, value=0)
- PD_leader_change detects TSO-save handoff; with no active PD leader,
  no saves are emitted and the count stays below threshold
- All cluster-health alerts rely on pd_cluster_status/pd_regions_status,
  which are only emitted by the active PD leader

Add PD_leader_service_stuck (critical, for:1m) that fires when the etcd
leader node's PD service layer is not serving as PD leader. Normal
failovers are naturally excluded: when etcd leadership transfers, the
departing node's etcd_server_is_leader drops to 0, making the join
condition false without any extra suppression logic.

Add two promtool unit tests:
- pd-leader-service-stuck: positive case fires after 1m of stuck state
- pd-leader-service-stuck-suppressed-by-failover: normal failover with
  matching etcd+PD leadership transfer stays silent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
Copilot AI review requested due to automatic review settings March 10, 2026 04:37
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 10, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Mar 10, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

Hi @liyishuai. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 10, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48d0063d-3ae6-42e6-9048-d008d9f9b725

📥 Commits

Reviewing files that changed from the base of the PR and between e03fe8e and f318359.

📒 Files selected for processing (1)
  • tests/alertmanager/pd.rules.test.yml

📝 Walkthrough

Walkthrough

Adds a new Prometheus alert rule PD_leader_service_stuck that fires when an embedded etcd leader's PD service layer is non-leader for 1m, and adds multiple alert-test scenarios covering trigger, normal failover suppression, and staggered failover (some scenarios duplicated).

Changes

Cohort / File(s) Summary
Alert Rule
metrics/alertmanager/pd.rules.yml
Insert PD_leader_service_stuck alert with expression (service_member_role{job="pd",service="PD"} == 0) and on(instance,job) (etcd_server_is_leader{job="pd"} == 1), for: 1m, labels (env, level: critical) and accompanying summary/description/value.
Alert Tests
tests/alertmanager/pd.rules.test.yml
Add multiple alert-test scenarios: a triggering case and several suppression/failover cases (including staggered failover), with input_series, eval_times, and expected alert/no-alert assertions. Two scenarios are duplicated, increasing test coverage and count.

Sequence Diagram(s)

(Skipped — change is a new alert rule plus tests; no new multi-component sequential control flow diagram generated.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • JmPotato
  • bufferflies
  • okJiang

Poem

🐇
I sniffed the metrics, soft and quick,
When etcd leads but PD stays sick.
One minute waits, the bell will peep,
A hoppity alert wakes from its sleep,
Hooray — the watcher guards the creek!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new Prometheus alert rule for detecting when an etcd-leader node's PD service is stuck in non-leader state.
Description check ✅ Passed The PR description thoroughly addresses the template requirements: links issue #10320, explains the problem and solution clearly, describes the changes in detail with promql expression, includes commit message, and provides release notes documenting the new alert.
Linked Issues check ✅ Passed All coding requirements from issue #10320 are met: the alert rule is implemented with correct expression and 1m duration, unit tests validate both positive case (service stuck) and negative cases (normal failovers), and test coverage confirms alert fires as expected.
Out of Scope Changes check ✅ Passed All changes are scoped to requirements: alert rule addition, promtool unit tests, and release notes. No unrelated code modifications or out-of-scope changes detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new PD alert to detect a previously silent failure mode where the embedded etcd leader remains stable but the PD service layer is stuck in a non-leader state (no PD leader serving requests).

Changes:

  • Add PD_leader_service_stuck (critical, for: 1m) alert rule based on service_member_role==0 on the etcd leader.
  • Add promtool unit tests covering both the positive (stuck) case and a failover-suppressed (negative) case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
metrics/alertmanager/pd.rules.yml Adds the new PD_leader_service_stuck alert rule and associated labels/annotations.
tests/alertmanager/pd.rules.test.yml Adds promtool tests validating the alert fires for persistent PD service loss and stays silent during normal failover.

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/alertmanager/pd.rules.test.yml (1)

145-163: Assert suppression during the failover window, not only at 12m.

exp_alerts: [] here only proves the alert is absent long after the handoff. It would still miss a transient page around minute 5-6, which is the exact behavior the for: 1m in metrics/alertmanager/pd.rules.yml:156-169 is meant to suppress. Please add evals near the transition, and ideally one staggered handoff case where service_member_role goes 0 slightly before etcd_server_is_leader flips, so the no-false-positive failover behavior is actually locked in.

Example of tightening this test
   alert_rule_test:
+    - eval_time: 5m45s
+      alertname: PD_leader_service_stuck
+      exp_alerts: []
     - eval_time: 12m
       alertname: PD_leader_service_stuck
       exp_alerts: []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/alertmanager/pd.rules.test.yml` around lines 145 - 163, The test
currently only checks at eval_time 12m; add additional alert_rule_test entries
that evaluate during and immediately after the failover window (e.g., around 5m,
5m30s, 6m) to assert suppression while the handoff is occurring; include at
least one staggered handoff scenario where the series
'service_member_role{job="pd",service="PD",instance="pd-1"}' flips to 0 slightly
before 'etcd_server_is_leader{job="pd",instance="pd-1"}' flips (and
corresponding pd-2 series flips on) and set exp_alerts: [] for those evals to
ensure the rule PD_leader_service_stuck (and the rule’s for: 1m behavior) does
not fire during the transient.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 145-163: The test currently only checks at eval_time 12m; add
additional alert_rule_test entries that evaluate during and immediately after
the failover window (e.g., around 5m, 5m30s, 6m) to assert suppression while the
handoff is occurring; include at least one staggered handoff scenario where the
series 'service_member_role{job="pd",service="PD",instance="pd-1"}' flips to 0
slightly before 'etcd_server_is_leader{job="pd",instance="pd-1"}' flips (and
corresponding pd-2 series flips on) and set exp_alerts: [] for those evals to
ensure the rule PD_leader_service_stuck (and the rule’s for: 1m behavior) does
not fire during the transient.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cc9def5-f11b-42fd-b942-10e2121166be

📥 Commits

Reviewing files that changed from the base of the PR and between 6ee40bd and b595d41.

📒 Files selected for processing (1)
  • tests/alertmanager/pd.rules.test.yml

…_service_stuck

Add alert_rule_test entries at 5m, 5m30s, and 6m to the instant-failover
suppression test to assert the rule stays silent while and immediately after
the handoff is occurring.

Add pd-leader-service-stuck-staggered-failover: pd-1 service_member_role
flips to 0 thirty seconds before etcd_server_is_leader transfers. The
transient window (2 evaluations × 15s = 30s) is shorter than for:1m's
required 4 consecutive evaluations, so the alert must not fire.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 10, 2026
@liyishuai
Copy link
Contributor Author

liyishuai commented Mar 10, 2026

Test Report: PD_leader_service_stuck Alert

Date: 2026-03-10
Alert: PD_leader_service_stuck
PR: #10321

Alert Overview

This alert fires when the embedded etcd leader node's PD service layer is not serving as PD leader for more than 1 minute — a silent failure mode where the cluster has no PD leader yet no existing alert triggers.

Expression:

(service_member_role{job="pd",service="PD"} == 0)
and on(instance,job) (etcd_server_is_leader{job="pd"} == 1)
for: 1m

How service_member_role behaves:
The metric is only exported by a node that has been PD leader at least once. When the leader steps down, server.go sets it to 0 via a deferred call — the metric persists in the registry and reads 0 until the process restarts or the node re-wins. Follower nodes that have never been leader do not export the metric at all, so the alert fires precisely on the node that lost leadership while remaining etcd raft leader.

Test Environment

Required Components

  • PD server with failpoint support: make pd-server-failpoint
  • Prometheus ≥ 2.x
  • tiup (for cluster orchestration)

Failpoints Used

Failpoint Path Purpose
skipGrantLeader github.com/tikv/pd/pkg/election/skipGrantLeader Blocks all PD nodes from campaigning; pause action blocks indefinitely
exitCampaignLeader github.com/tikv/pd/server/exitCampaignLeader Forces the named PD leader to step down

Test Procedure

Step 1: Build PD with Failpoint Support

cd /path/to/pd
make pd-server-failpoint

Step 2: Start a 3-Node PD Cluster

# Clean up any existing cluster
pkill -9 -f "pd-server|tiup.*playground"
rm -rf ~/.tiup/data/alert-test

# Start 3-node cluster with failpoint-enabled binary
tiup playground --pd 3 --pd.binpath $(pwd)/bin/pd-server \
  --kv 0 --db 0 --tiflash 0 --without-monitor \
  --tag alert-test > /tmp/tiup-playground.log 2>&1 &

# Wait for cluster to stabilize (20-30 seconds)
sleep 25

Verify cluster and identify leader:

# Check all nodes' metrics to find leader
for port in 2379 2382 2384; do
  echo "=== Port $port ==="
  curl -s "http://127.0.0.1:$port/metrics" | \
    grep -E "^etcd_server_is_leader|^service_member_role"
done

Expected output (pd-1 at port 2382 is leader):

=== Port 2379 ===
etcd_server_is_leader 0

=== Port 2382 ===
etcd_server_is_leader 1
service_member_role{service="PD"} 1

=== Port 2384 ===
etcd_server_is_leader 0

Step 3: Configure Prometheus

Create /tmp/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /path/to/pd/metrics/alertmanager/pd.rules.yml

scrape_configs:
  - job_name: pd
    static_configs:
      - targets: ["127.0.0.1:2379", "127.0.0.1:2382", "127.0.0.1:2384"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '127.0.0.1:2379'
        replacement: 'pd-0'
      - source_labels: [__address__]
        target_label: instance
        regex: '127.0.0.1:2382'
        replacement: 'pd-1'
      - source_labels: [__address__]
        target_label: instance
        regex: '127.0.0.1:2384'
        replacement: 'pd-2'

Start Prometheus:

prometheus --config.file=/tmp/prometheus.yml \
  --storage.tsdb.path=/tmp/prometheus-data \
  > /tmp/prometheus.log 2>&1 &

# Wait for data collection
sleep 15

Step 4: Trigger the Alert

#!/usr/bin/env bash
set -euo pipefail

# PD leader (adjust if different node is leader)
PD_LEADER="http://127.0.0.1:2382"
FP_SKIP="github.com/tikv/pd/pkg/election/skipGrantLeader"
FP_EXIT="github.com/tikv/pd/server/exitCampaignLeader"

# Get leader's member_id
MEMBER_ID=$(curl -s "$PD_LEADER/pd/api/v1/leader" | \
  python3 -c "import json,sys; print(json.load(sys.stdin)['member_id'])")

echo "=== Leader member_id: $MEMBER_ID ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"

# Block ALL nodes from campaigning (pause = indefinite block)
echo ""
echo "Setting skipGrantLeader (pause) on all nodes..."
for port in 2379 2382 2384; do
  curl -s -X PUT "http://127.0.0.1:$port/pd/api/v1/fail/$FP_SKIP" -d 'pause'
done
echo "Done"

# Force leader to step down
echo ""
echo "Triggering exitCampaignLeader for member $MEMBER_ID..."
curl -s -X PUT "$PD_LEADER/pd/api/v1/fail/$FP_EXIT" \
  -d "return(\"$MEMBER_ID\")"
echo "Done"

# Verify stuck state
sleep 2
echo ""
echo "=== Stuck state achieved ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"

# Hold stuck state for >1m to trigger alert
echo ""
echo "Holding stuck state for 70s (alert fires after 60s)..."
sleep 70

echo ""
echo "=== Final state ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"

Step 5: Verify Alert Firing

Check via Prometheus UI:

# On macOS
open http://localhost:9090/alerts

# On Linux
xdg-open http://localhost:9090/alerts

Look for PD_leader_service_stuck with state "firing".

Check via Prometheus API:

curl -s http://127.0.0.1:9090/api/v1/alerts | python3 -c "
import json, sys
data = json.load(sys.stdin)
alerts = [a for a in data['data']['alerts']
  if a['labels']['alertname'] == 'PD_leader_service_stuck']
for a in alerts:
  print(f\"Instance: {a['labels']['instance']}\")
  print(f\"State: {a['state']}\")
  print(f\"Value: {a['annotations']['value']}\")
"

Expected output:

Instance: pd-1
State: firing
Value: 0

Observed Behavior

When the stuck condition is triggered:

t=0s    service_member_role=1  etcd_server_is_leader=1  (normal leader)
t+2s    service_member_role=0  etcd_server_is_leader=1  ← stuck condition
t+60s   ALERT FIRES (for: 1m satisfied)

Important note: The pause action blocks PD campaigns indefinitely, but in a healthy cluster the embedded etcd raft layer (independent of PD application logic) may eventually trigger a new leader election after ~40-70s due to its heartbeat timeout. In the real customer scenario (FRM-3233), the issue persists for minutes—long enough for the alert to fire and be actionable.

Alert State Progression

Time Alert State
t=0 Inactive (condition not yet in first eval)
t+15s Pending (condition true, accumulating for duration)
t+60s Firing (condition true for ≥ 1 minute across 4 × 15 s evals)

Test Coverage (promtool unit tests)

Run all tests:

promtool test rules tests/alertmanager/pd.rules.test.yml
# → SUCCESS

All 7 test groups pass (15 total eval_time assertions across all alerts):

Test Scenario Expected
pd-leader-service-stuck service drops and never recovers, etcd stable fires at 4m (PENDING at 3m45s)
pd-leader-service-stuck-suppressed-by-failover both service and etcd transfer simultaneously silent at 5m, 5m30s, 6m, 12m
pd-leader-service-stuck-staggered-failover service drops 30 s before etcd transfers (transient window < for:1m) silent at 5m, 5m15s, 5m30s, 6m, 12m

Clean Up

# Stop cluster and Prometheus
pkill -9 -f "pd-server|tiup.*playground|prometheus"
rm -rf ~/.tiup/data/alert-test /tmp/prometheus-data

References

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 191-206: Add an explicit sample at the 5m15s boundary inside the
alert_rule_test block to pin the for:1m suppression window: insert an entry with
eval_time: 5m15s, alertname: PD_leader_service_stuck and exp_alerts: [] so the
test exercises the transient window itself (this ensures rules shortened to
for:15s would be caught); update the sequence near the existing eval_time: 5m
and 5m30s entries in the alert_rule_test case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16857d64-2260-453e-9292-640220537e5e

📥 Commits

Reviewing files that changed from the base of the PR and between b595d41 and 80a0304.

📒 Files selected for processing (1)
  • tests/alertmanager/pd.rules.test.yml

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 129-143: The test for PD_leader_service_stuck in alert_rule_test
currently only checks that the alert eventually fires; update the test to assert
the exact 1m "for" boundary by adding a no-alert check just before the 1-minute
boundary and a firing check at the 1-minute mark: under the
PD_leader_service_stuck case (eval_time/exp_alerts) add an eval_time entry
(e.g., 59s or 59s since start) asserting no alerts with the same labels, and
another eval_time entry at 1m asserting the alert is firing with
exp_labels/exp_annotations as shown, so the test enforces the 1m hold time
precisely.
- Around line 155-156: The fixture currently includes a series for
service_member_role with instance="pd-2" and many leading zeros; instead model
this metric as absent until pd-2 takes leadership by removing that series (or
replacing it with an absent/empty series) so the metric is truly omitted before
takeover; apply the same change to the duplicate occurrence referenced in the
comment (the other service_member_role{instance="pd-2"} fixture).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ed5393b-6cc6-4316-950b-4f6b195ff1b1

📥 Commits

Reviewing files that changed from the base of the PR and between 80a0304 and e03fe8e.

📒 Files selected for processing (1)
  • tests/alertmanager/pd.rules.test.yml

…accuracy

Add eval_time: 3m45s (PENDING, no alert) and eval_time: 4m (exactly at
for:1m boundary, FIRING) to the pd-leader-service-stuck test so the suite
enforces the hold time precisely rather than only checking a late eval.

Remove the service_member_role{instance="pd-2"} series with leading zeros
from pd-leader-service-stuck-suppressed-by-failover and
pd-leader-service-stuck-staggered-failover. In practice the metric is only
initialized when a node first wins PD leadership (server.go:1797 Set(1));
followers that have never been leader do not export it at all. Keeping the
series absent before takeover makes the fixture match the real metric
cardinality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.84%. Comparing base (c1f3166) to head (f318359).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10321      +/-   ##
==========================================
+ Coverage   78.78%   78.84%   +0.06%     
==========================================
  Files         527      527              
  Lines       70916    70920       +4     
==========================================
+ Hits        55870    55917      +47     
+ Misses      11026    10986      -40     
+ Partials     4020     4017       -3     
Flag Coverage Δ
unittests 78.84% <ø> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@liyishuai
Copy link
Contributor Author

@coderabbitai resume

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

✅ Actions performed

Reviews resumed.

annotations:
description: 'cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, PD service is not the PD leader while being the embedded etcd leader; values:{{ $value }}'
value: '{{ $value }}'
summary: PD leader service is stuck in non-leader state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a summary to show the etcd leader is normal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metrics: add alert for PD etcd-leader node stuck in non-PD-leader state

3 participants