metrics: add PD_leader_service_stuck alert for etcd-leader without PD service by liyishuai · Pull Request #10321 · tikv/pd

liyishuai · 2026-03-10T04:37:25Z

What problem does this PR solve?

Issue Number: Close #10320

When service_member_role drops to 0 and never recovers — while etcd_server_is_leader stays stable and the process has not restarted — no existing alert fires. The cluster has no PD leader serving requests, but the entire alert suite is silent (see issue for the full per-rule analysis).

What is changed and how does it work?

Add PD_leader_service_stuck (level: critical, for: 1m):

(service_member_role{job="pd",service="PD"} == 0)
and on(instance,job) (etcd_server_is_leader{job="pd"} == 1)

Fires when the etcd-leader node's PD service layer is not serving as PD leader for a sustained period. Normal failovers are naturally excluded: when etcd leadership transfers, the departing node's etcd_server_is_leader drops to 0, making the join condition false without any extra suppression logic.

Two promtool unit tests are added:

pd-leader-service-stuck — positive: service drops at minute 3, stays down, fires at eval_time: 6m
pd-leader-service-stuck-suppressed-by-failover — negative: pd-1 loses both PD and etcd leadership to pd-2; no alert fires

When service_member_role drops to 0 and never recovers while
etcd_server_is_leader stays stable, no existing alert fires:
- PD_leader_lease_drop_without_failover requires service_member_role==1
  at eval time and changes>=2 (persistent drop gives changes==1, value=0)
- PD_leader_change detects TSO-save handoff; with no active PD leader,
  no saves are emitted and the count stays below threshold
- All cluster-health alerts rely on pd_cluster_status/pd_regions_status,
  which are only emitted by the active PD leader

Add PD_leader_service_stuck (critical, for:1m) that fires when the etcd
leader node's PD service layer is not serving as PD leader. Normal
failovers are naturally excluded: when etcd leadership transfers, the
departing node's etcd_server_is_leader drops to 0, making the join
condition false without any extra suppression logic.

Check List

Tests

Unit test

Code changes

No code (alert rule + promtool tests only)

Side effects

None

Release note

Add `PD_leader_service_stuck` alert (critical) that fires when the embedded etcd leader node's PD service layer is not serving as PD leader for more than 1 minute, covering a previously undetected failure mode where no existing alert would trigger.

Summary by CodeRabbit

New Features
- Added a critical alert that detects an unresponsive PD leader service, with severity, summary and description.
Tests
- Added many alert test scenarios covering service outage, normal failover suppression, staggered failover, and repeated duplicated scenarios that effectively double some tests.

… service When service_member_role drops to 0 and never recovers while etcd_server_is_leader stays stable, no existing alert fires: - PD_leader_lease_drop_without_failover requires service_member_role==1 at eval time and changes>=2 (persistent drop gives changes==1, value=0) - PD_leader_change detects TSO-save handoff; with no active PD leader, no saves are emitted and the count stays below threshold - All cluster-health alerts rely on pd_cluster_status/pd_regions_status, which are only emitted by the active PD leader Add PD_leader_service_stuck (critical, for:1m) that fires when the etcd leader node's PD service layer is not serving as PD leader. Normal failovers are naturally excluded: when etcd leadership transfers, the departing node's etcd_server_is_leader drops to 0, making the join condition false without any extra suppression logic. Add two promtool unit tests: - pd-leader-service-stuck: positive case fires after 1m of stuck state - pd-leader-service-stuck-suppressed-by-failover: normal failover with matching etcd+PD leadership transfer stays silent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

ti-chi-bot · 2026-03-10T04:37:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-03-10T04:37:35Z

Hi @liyishuai. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-03-10T04:37:43Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48d0063d-3ae6-42e6-9048-d008d9f9b725

📥 Commits

Reviewing files that changed from the base of the PR and between e03fe8e and f318359.

📒 Files selected for processing (1)

tests/alertmanager/pd.rules.test.yml

📝 Walkthrough

Walkthrough

Adds a new Prometheus alert rule PD_leader_service_stuck that fires when an embedded etcd leader's PD service layer is non-leader for 1m, and adds multiple alert-test scenarios covering trigger, normal failover suppression, and staggered failover (some scenarios duplicated).

Changes

Cohort / File(s)	Summary
Alert Rule `metrics/alertmanager/pd.rules.yml`	Insert `PD_leader_service_stuck` alert with expression `(service_member_role{job="pd",service="PD"} == 0) and on(instance,job) (etcd_server_is_leader{job="pd"} == 1)`, `for: 1m`, labels (`env`, `level: critical`) and accompanying summary/description/value.
Alert Tests `tests/alertmanager/pd.rules.test.yml`	Add multiple alert-test scenarios: a triggering case and several suppression/failover cases (including staggered failover), with input_series, eval_times, and expected alert/no-alert assertions. Two scenarios are duplicated, increasing test coverage and count.

Sequence Diagram(s)

(Skipped — change is a new alert rule plus tests; no new multi-component sequential control flow diagram generated.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

*: move alert test to tests package #10307 — Modifies the same tests/alertmanager/pd.rules.test.yml and may overlap with the added alert test scenarios.

Suggested reviewers

JmPotato
bufferflies
okJiang

Poem

🐇
I sniffed the metrics, soft and quick,
When etcd leads but PD stays sick.
One minute waits, the bell will peep,
A hoppity alert wakes from its sleep,
Hooray — the watcher guards the creek!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a new Prometheus alert rule for detecting when an etcd-leader node's PD service is stuck in non-leader state.
Description check	✅ Passed	The PR description thoroughly addresses the template requirements: links issue `#10320`, explains the problem and solution clearly, describes the changes in detail with promql expression, includes commit message, and provides release notes documenting the new alert.
Linked Issues check	✅ Passed	All coding requirements from issue `#10320` are met: the alert rule is implemented with correct expression and 1m duration, unit tests validate both positive case (service stuck) and negative cases (normal failovers), and test coverage confirms alert fires as expected.
Out of Scope Changes check	✅ Passed	All changes are scoped to requirements: alert rule addition, promtool unit tests, and release notes. No unrelated code modifications or out-of-scope changes detected.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a new PD alert to detect a previously silent failure mode where the embedded etcd leader remains stable but the PD service layer is stuck in a non-leader state (no PD leader serving requests).

Changes:

Add PD_leader_service_stuck (critical, for: 1m) alert rule based on service_member_role==0 on the etcd leader.
Add promtool unit tests covering both the positive (stuck) case and a failover-suppressed (negative) case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`metrics/alertmanager/pd.rules.yml`	Adds the new `PD_leader_service_stuck` alert rule and associated labels/annotations.
`tests/alertmanager/pd.rules.test.yml`	Adds promtool tests validating the alert fires for persistent PD service loss and stays silent during normal failover.

tests/alertmanager/pd.rules.test.yml

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

coderabbitai

🧹 Nitpick comments (1)

tests/alertmanager/pd.rules.test.yml (1)
145-163: Assert suppression during the failover window, not only at 12m.

exp_alerts: [] here only proves the alert is absent long after the handoff. It would still miss a transient page around minute 5-6, which is the exact behavior the for: 1m in metrics/alertmanager/pd.rules.yml:156-169 is meant to suppress. Please add evals near the transition, and ideally one staggered handoff case where service_member_role goes 0 slightly before etcd_server_is_leader flips, so the no-false-positive failover behavior is actually locked in.
Example of tightening this test
   alert_rule_test:
+    - eval_time: 5m45s
+      alertname: PD_leader_service_stuck
+      exp_alerts: []
     - eval_time: 12m
       alertname: PD_leader_service_stuck
       exp_alerts: []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/alertmanager/pd.rules.test.yml` around lines 145 - 163, The test
currently only checks at eval_time 12m; add additional alert_rule_test entries
that evaluate during and immediately after the failover window (e.g., around 5m,
5m30s, 6m) to assert suppression while the handoff is occurring; include at
least one staggered handoff scenario where the series
'service_member_role{job="pd",service="PD",instance="pd-1"}' flips to 0 slightly
before 'etcd_server_is_leader{job="pd",instance="pd-1"}' flips (and
corresponding pd-2 series flips on) and set exp_alerts: [] for those evals to
ensure the rule PD_leader_service_stuck (and the rule’s for: 1m behavior) does
not fire during the transient.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 145-163: The test currently only checks at eval_time 12m; add
additional alert_rule_test entries that evaluate during and immediately after
the failover window (e.g., around 5m, 5m30s, 6m) to assert suppression while the
handoff is occurring; include at least one staggered handoff scenario where the
series 'service_member_role{job="pd",service="PD",instance="pd-1"}' flips to 0
slightly before 'etcd_server_is_leader{job="pd",instance="pd-1"}' flips (and
corresponding pd-2 series flips on) and set exp_alerts: [] for those evals to
ensure the rule PD_leader_service_stuck (and the rule’s for: 1m behavior) does
not fire during the transient.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cc9def5-f11b-42fd-b942-10e2121166be

📥 Commits

Reviewing files that changed from the base of the PR and between 6ee40bd and b595d41.

📒 Files selected for processing (1)

tests/alertmanager/pd.rules.test.yml

…_service_stuck Add alert_rule_test entries at 5m, 5m30s, and 6m to the instant-failover suppression test to assert the rule stays silent while and immediately after the handoff is occurring. Add pd-leader-service-stuck-staggered-failover: pd-1 service_member_role flips to 0 thirty seconds before etcd_server_is_leader transfers. The transient window (2 evaluations × 15s = 30s) is shorter than for:1m's required 4 consecutive evaluations, so the alert must not fire. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

liyishuai · 2026-03-10T05:17:36Z

Test Report: PD_leader_service_stuck Alert

Date: 2026-03-10
Alert: PD_leader_service_stuck
PR: #10321

Alert Overview

This alert fires when the embedded etcd leader node's PD service layer is not serving as PD leader for more than 1 minute — a silent failure mode where the cluster has no PD leader yet no existing alert triggers.

Expression:

(service_member_role{job="pd",service="PD"} == 0)
and on(instance,job) (etcd_server_is_leader{job="pd"} == 1)
for: 1m

How service_member_role behaves:
The metric is only exported by a node that has been PD leader at least once. When the leader steps down, server.go sets it to 0 via a deferred call — the metric persists in the registry and reads 0 until the process restarts or the node re-wins. Follower nodes that have never been leader do not export the metric at all, so the alert fires precisely on the node that lost leadership while remaining etcd raft leader.

Test Environment

Required Components

PD server with failpoint support: make pd-server-failpoint
Prometheus ≥ 2.x
tiup (for cluster orchestration)

Failpoints Used

Failpoint	Path	Purpose
`skipGrantLeader`	`github.com/tikv/pd/pkg/election/skipGrantLeader`	Blocks all PD nodes from campaigning; `pause` action blocks indefinitely
`exitCampaignLeader`	`github.com/tikv/pd/server/exitCampaignLeader`	Forces the named PD leader to step down

Test Procedure

Step 1: Build PD with Failpoint Support

cd /path/to/pd
make pd-server-failpoint

Step 2: Start a 3-Node PD Cluster

# Clean up any existing cluster
pkill -9 -f "pd-server|tiup.*playground"
rm -rf ~/.tiup/data/alert-test

# Start 3-node cluster with failpoint-enabled binary
tiup playground --pd 3 --pd.binpath $(pwd)/bin/pd-server \
  --kv 0 --db 0 --tiflash 0 --without-monitor \
  --tag alert-test > /tmp/tiup-playground.log 2>&1 &

# Wait for cluster to stabilize (20-30 seconds)
sleep 25

Verify cluster and identify leader:

# Check all nodes' metrics to find leader
for port in 2379 2382 2384; do
  echo "=== Port $port ==="
  curl -s "http://127.0.0.1:$port/metrics" | \
    grep -E "^etcd_server_is_leader|^service_member_role"
done

Expected output (pd-1 at port 2382 is leader):

=== Port 2379 ===
etcd_server_is_leader 0

=== Port 2382 ===
etcd_server_is_leader 1
service_member_role{service="PD"} 1

=== Port 2384 ===
etcd_server_is_leader 0

Step 3: Configure Prometheus

Create /tmp/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /path/to/pd/metrics/alertmanager/pd.rules.yml

scrape_configs:
  - job_name: pd
    static_configs:
      - targets: ["127.0.0.1:2379", "127.0.0.1:2382", "127.0.0.1:2384"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '127.0.0.1:2379'
        replacement: 'pd-0'
      - source_labels: [__address__]
        target_label: instance
        regex: '127.0.0.1:2382'
        replacement: 'pd-1'
      - source_labels: [__address__]
        target_label: instance
        regex: '127.0.0.1:2384'
        replacement: 'pd-2'

Start Prometheus:

prometheus --config.file=/tmp/prometheus.yml \
  --storage.tsdb.path=/tmp/prometheus-data \
  > /tmp/prometheus.log 2>&1 &

# Wait for data collection
sleep 15

Step 4: Trigger the Alert

#!/usr/bin/env bash
set -euo pipefail

# PD leader (adjust if different node is leader)
PD_LEADER="http://127.0.0.1:2382"
FP_SKIP="github.com/tikv/pd/pkg/election/skipGrantLeader"
FP_EXIT="github.com/tikv/pd/server/exitCampaignLeader"

# Get leader's member_id
MEMBER_ID=$(curl -s "$PD_LEADER/pd/api/v1/leader" | \
  python3 -c "import json,sys; print(json.load(sys.stdin)['member_id'])")

echo "=== Leader member_id: $MEMBER_ID ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"

# Block ALL nodes from campaigning (pause = indefinite block)
echo ""
echo "Setting skipGrantLeader (pause) on all nodes..."
for port in 2379 2382 2384; do
  curl -s -X PUT "http://127.0.0.1:$port/pd/api/v1/fail/$FP_SKIP" -d 'pause'
done
echo "Done"

# Force leader to step down
echo ""
echo "Triggering exitCampaignLeader for member $MEMBER_ID..."
curl -s -X PUT "$PD_LEADER/pd/api/v1/fail/$FP_EXIT" \
  -d "return(\"$MEMBER_ID\")"
echo "Done"

# Verify stuck state
sleep 2
echo ""
echo "=== Stuck state achieved ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"

# Hold stuck state for >1m to trigger alert
echo ""
echo "Holding stuck state for 70s (alert fires after 60s)..."
sleep 70

echo ""
echo "=== Final state ==="
curl -s "$PD_LEADER/metrics" | grep -E "^etcd_server_is_leader|^service_member_role"

Step 5: Verify Alert Firing

Check via Prometheus UI:

# On macOS
open http://localhost:9090/alerts

# On Linux
xdg-open http://localhost:9090/alerts

Look for PD_leader_service_stuck with state "firing".

Check via Prometheus API:

curl -s http://127.0.0.1:9090/api/v1/alerts | python3 -c "
import json, sys
data = json.load(sys.stdin)
alerts = [a for a in data['data']['alerts']
  if a['labels']['alertname'] == 'PD_leader_service_stuck']
for a in alerts:
  print(f\"Instance: {a['labels']['instance']}\")
  print(f\"State: {a['state']}\")
  print(f\"Value: {a['annotations']['value']}\")
"

Expected output:

Instance: pd-1
State: firing
Value: 0

Observed Behavior

When the stuck condition is triggered:

t=0s    service_member_role=1  etcd_server_is_leader=1  (normal leader)
t+2s    service_member_role=0  etcd_server_is_leader=1  ← stuck condition
t+60s   ALERT FIRES (for: 1m satisfied)

Important note: The pause action blocks PD campaigns indefinitely, but in a healthy cluster the embedded etcd raft layer (independent of PD application logic) may eventually trigger a new leader election after ~40-70s due to its heartbeat timeout. In the real customer scenario (FRM-3233), the issue persists for minutes—long enough for the alert to fire and be actionable.

Alert State Progression

Time	Alert State
t=0	Inactive (condition not yet in first eval)
t+15s	Pending (condition true, accumulating `for` duration)
t+60s	Firing (condition true for ≥ 1 minute across 4 × 15 s evals)

Test Coverage (promtool unit tests)

Run all tests:

promtool test rules tests/alertmanager/pd.rules.test.yml
# → SUCCESS

All 7 test groups pass (15 total eval_time assertions across all alerts):

Test	Scenario	Expected
`pd-leader-service-stuck`	service drops and never recovers, etcd stable	fires at 4m (PENDING at 3m45s)
`pd-leader-service-stuck-suppressed-by-failover`	both service and etcd transfer simultaneously	silent at 5m, 5m30s, 6m, 12m
`pd-leader-service-stuck-staggered-failover`	service drops 30 s before etcd transfers (transient window < `for:1m`)	silent at 5m, 5m15s, 5m30s, 6m, 12m

Clean Up

# Stop cluster and Prometheus
pkill -9 -f "pd-server|tiup.*playground|prometheus"
rm -rf ~/.tiup/data/alert-test /tmp/prometheus-data

References

PR: metrics: add PD_leader_service_stuck alert for etcd-leader without PD service #10321
Issue: metrics: add alert for PD etcd-leader node stuck in non-PD-leader state #10320
Alert rule: metrics/alertmanager/pd.rules.yml:156
Test cases: tests/alertmanager/pd.rules.test.yml:118
Metric source:
- Set(1) → server/server.go:1797 (on becoming PD leader)
- Set(0) → server/server.go:1803 (deferred on any exit from leader loop)
- Failpoints: pkg/election/leadership.go:186 (skipGrantLeader), server/server.go:1820 (exitCampaignLeader)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 191-206: Add an explicit sample at the 5m15s boundary inside the
alert_rule_test block to pin the for:1m suppression window: insert an entry with
eval_time: 5m15s, alertname: PD_leader_service_stuck and exp_alerts: [] so the
test exercises the transient window itself (this ensures rules shortened to
for:15s would be caught); update the sequence near the existing eval_time: 5m
and 5m30s entries in the alert_rule_test case.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16857d64-2260-453e-9292-640220537e5e

📥 Commits

Reviewing files that changed from the base of the PR and between b595d41 and 80a0304.

📒 Files selected for processing (1)

tests/alertmanager/pd.rules.test.yml

tests/alertmanager/pd.rules.test.yml

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/alertmanager/pd.rules.test.yml`:
- Around line 129-143: The test for PD_leader_service_stuck in alert_rule_test
currently only checks that the alert eventually fires; update the test to assert
the exact 1m "for" boundary by adding a no-alert check just before the 1-minute
boundary and a firing check at the 1-minute mark: under the
PD_leader_service_stuck case (eval_time/exp_alerts) add an eval_time entry
(e.g., 59s or 59s since start) asserting no alerts with the same labels, and
another eval_time entry at 1m asserting the alert is firing with
exp_labels/exp_annotations as shown, so the test enforces the 1m hold time
precisely.
- Around line 155-156: The fixture currently includes a series for
service_member_role with instance="pd-2" and many leading zeros; instead model
this metric as absent until pd-2 takes leadership by removing that series (or
replacing it with an absent/empty series) so the metric is truly omitted before
takeover; apply the same change to the duplicate occurrence referenced in the
comment (the other service_member_role{instance="pd-2"} fixture).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ed5393b-6cc6-4316-950b-4f6b195ff1b1

📥 Commits

Reviewing files that changed from the base of the PR and between 80a0304 and e03fe8e.

📒 Files selected for processing (1)

tests/alertmanager/pd.rules.test.yml

tests/alertmanager/pd.rules.test.yml

…accuracy Add eval_time: 3m45s (PENDING, no alert) and eval_time: 4m (exactly at for:1m boundary, FIRING) to the pd-leader-service-stuck test so the suite enforces the hold time precisely rather than only checking a late eval. Remove the service_member_role{instance="pd-2"} series with leading zeros from pd-leader-service-stuck-suppressed-by-failover and pd-leader-service-stuck-staggered-failover. In practice the metric is only initialized when a node first wins PD leadership (server.go:1797 Set(1)); followers that have never been leader do not export it at all. Keeping the series absent before takeover makes the fixture match the real metric cardinality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Yishuai Li <yishuai.li@pingcap.com>

codecov · 2026-03-10T05:33:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.84%. Comparing base (c1f3166) to head (f318359).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10321      +/-   ##
==========================================
+ Coverage   78.78%   78.84%   +0.06%     
==========================================
  Files         527      527              
  Lines       70916    70920       +4     
==========================================
+ Hits        55870    55917      +47     
+ Misses      11026    10986      -40     
+ Partials     4020     4017       -3

Flag	Coverage Δ
unittests	`78.84% <ø> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai · 2026-03-10T05:39:15Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

liyishuai · 2026-03-10T05:41:23Z

@coderabbitai resume

coderabbitai · 2026-03-10T05:41:33Z

✅ Actions performed

Reviews resumed.

bufferflies · 2026-03-10T08:09:11Z

metrics/alertmanager/pd.rules.yml

+    annotations:
+      description: 'cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, PD service is not the PD leader while being the embedded etcd leader; values:{{ $value }}'
+      value: '{{ $value }}'
+      summary: PD leader service is stuck in non-leader state


How about adding a summary to show the etcd leader is normal?

Copilot AI review requested due to automatic review settings March 10, 2026 04:37

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 10, 2026

ti-chi-bot bot added contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Mar 10, 2026

ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 10, 2026

Copilot started reviewing on behalf of liyishuai March 10, 2026 04:37 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

tests/alertmanager/pd.rules.test.yml Outdated Show resolved Hide resolved

Apply suggestion from @Copilot

b595d41

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 10, 2026

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

tests/alertmanager/pd.rules.test.yml Show resolved Hide resolved

Apply suggestion from @coderabbitai[bot]

e03fe8e

Signed-off-by: Yishuai Li <yishuai.li@pingcap.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

tests/alertmanager/pd.rules.test.yml Show resolved Hide resolved

tests/alertmanager/pd.rules.test.yml Outdated Show resolved Hide resolved

bufferflies reviewed Mar 10, 2026

View reviewed changes

Conversation

liyishuai commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Mar 10, 2026

Uh oh!

ti-chi-bot bot commented Mar 10, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liyishuai commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Report: PD_leader_service_stuck Alert

Alert Overview

Test Environment

Required Components

Failpoints Used

Test Procedure

Step 1: Build PD with Failpoint Support

Step 2: Start a 3-Node PD Cluster

Step 3: Configure Prometheus

Step 4: Trigger the Alert

Step 5: Verify Alert Firing

Observed Behavior

Alert State Progression

Test Coverage (promtool unit tests)

Clean Up

References

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Mar 10, 2026

Uh oh!

liyishuai commented Mar 10, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026

Uh oh!

bufferflies Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

liyishuai commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

liyishuai commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading