*: migrate simple failpoint.Inject to failpoint.InjectCall by JmPotato · Pull Request #10300 · tikv/pd

JmPotato · 2026-03-05T04:30:39Z

What problem does this PR solve?

Issue Number: ref #10297

What is changed and how does it work?

Migrate 19 simple failpoint.Inject() call sites to the newer
failpoint.InjectCall() / failpoint.EnableCall() APIs. This eliminates
closure bodies from production code and consolidates test injection logic
in test files with type-safe callbacks.

Categories migrated:

Empty/nil callbacks (5 sites): Inject("name", func(){}) / nil → bare InjectCall("name")
Sleep-only callbacks (12 sites): move time.Sleep to test-side EnableCall
Panic callbacks (2 sites): move panic to test-side EnableCall

For "pause" failpoints, the pattern is converted from
failpoint.Enable(path, "pause") to a channel-based
failpoint.EnableCall(path, func() { <-ch }) approach.

Migrate 19 simple failpoint.Inject() call sites to failpoint.InjectCall() / EnableCall() APIs.

- Empty/nil callbacks (5 sites): bare InjectCall("name")
- Sleep-only callbacks (12 sites): move time.Sleep to test-side EnableCall
- Panic callbacks (2 sites): move panic to test-side EnableCall
- "pause" failpoints converted to channel-based EnableCall pattern

Check List

Tests

Unit test
Integration test

Release note

None.

Summary by CodeRabbit

Tests
- Refactored failpoint infrastructure to use call-based invocation patterns instead of direct injection.
- Modernized test synchronization mechanisms to utilize channel-based blocking instead of pause-based approaches.
Chores
- Simplified failpoint triggers by removing hardcoded delays from production code paths.

Migrate 19 simple `failpoint.Inject()` call sites to the newer `failpoint.InjectCall()` / `failpoint.EnableCall()` APIs. This eliminates closure bodies from production code and consolidates test injection logic in test files with type-safe callbacks. Categories migrated: - Empty/nil callbacks (5 sites): bare InjectCall with no args - Sleep-only callbacks (12 sites): move time.Sleep to test-side EnableCall - Panic callbacks (2 sites): move panic to test-side EnableCall For "pause" failpoints, the pattern is converted from `failpoint.Enable(path, "pause")` to a channel-based `failpoint.EnableCall(path, func() { <-ch })` approach. Signed-off-by: JmPotato <github@ipotato.me>

coderabbitai · 2026-03-05T04:31:13Z

📝 Walkthrough

Walkthrough

This pull request systematically migrates failpoint invocations across the codebase from using failpoint.Inject() with embedded callbacks to the newer failpoint.InjectCall() and failpoint.EnableCall() APIs. Production code shifts from inline sleep logic to simplified failpoint calls, while test code replaces static return values with function-based callbacks or channel-based synchronization mechanisms.

Changes

Cohort / File(s)	Summary
Client & TSO Dispatch `client/clients/tso/client.go`, `pkg/tso/tso.go`	Replaced `failpoint.Inject()` with embedded sleep logic with `failpoint.InjectCall()` for delay-related failpoints, removing inline sleep implementations.
Election & Leadership `pkg/election/leadership.go`, `pkg/election/leadership_test.go`	Changed failpoint invocations from `failpoint.Inject()` to `failpoint.InjectCall()` in leadership code; test files introduce channel-based synchronization to control failpoint release timing.
GC State Management `pkg/gc/gc_state_manager.go`, `pkg/gc/gc_state_manager_test.go`	Transitioned from `failpoint.Inject()` to `failpoint.InjectCall()` for GC state finish detection; tests now use channel-driven synchronization for pause/resume control.
Keyspace & Scheduling `pkg/keyspace/tso_keyspace_group.go`, `pkg/mcs/scheduling/server/cluster.go`, `pkg/schedule/checker/checker_controller.go`, `pkg/schedule/operator/operator_controller.go`	Replaced failpoint invocations with `failpoint.InjectCall()`, removing embedded sleep delays while preserving failpoint triggering mechanism.
Schedule Checker Tests `pkg/schedule/checker/rule_checker_test.go`, `pkg/schedule/operator/operator_controller_test.go`	Changed failpoint activation from static `Enable()` returns to `EnableCall()` with sleep function callbacks, introducing deliberate test delays.
Storage Layer `pkg/storage/kv/etcd_kv.go`, `pkg/storage/storage_tso_test.go`	Removed inline sleep logic from failpoint.Inject calls and replaced with `failpoint.InjectCall()`; test files now use `EnableCall()` with sleep callbacks.
Server Core `server/api/region.go`, `server/cluster/cluster.go`, `server/grpc_service.go`, `server/server.go`	Replaced multiple `failpoint.Inject()` invocations with embedded delays with `failpoint.InjectCall()` calls across region, cluster, and gRPC service paths.
Server Tests `server/cluster/cluster_test.go`, `tests/server/api/region_test.go`, `tests/server/api/version_test.go`, `tests/server/cluster/cluster_test.go`, `tests/server/member/member_test.go`, `tests/server/server_test.go`, `tests/server/tso/tso_test.go`, `tests/server/watch/leader_watch_test.go`	Transitioned from static failpoint returns to `EnableCall()` with sleep callbacks; several tests introduce channel-based synchronization for controlled failpoint release.
Integration Tests `tests/integrations/client/client_test.go`, `tests/integrations/mcs/router/server_test.go`, `tests/integrations/mcs/tso/api_test.go`, `tests/integrations/mcs/tso/keyspace_group_manager_test.go`, `tests/integrations/tso/client_test.go`	Changed failpoint activation from fixed returns to `EnableCall()` with sleep delays; keyspace group manager test adds channel-based synchronization.
Tools & CLI Tests `tools/pd-ctl/tests/keyspace/keyspace_group_test.go`, `tools/pd-ctl/tests/keyspace/keyspace_test.go`	Replaced static failpoint returns with `EnableCall()` functions introducing 2-second sleep delays; added `time` import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related issues

test: migrate failpoint.Inject to failpoint.InjectCall for cleaner test injection #10297: Directly implements the proposed failpoint API migration by replacing numerous failpoint.Inject() usages with failpoint.InjectCall() and failpoint.EnableCall() across the same files and failpoint sites.

Suggested labels

size/XL, lgtm, approved

Suggested reviewers

rleungx
bufferflies

Poem

🐰 Failpoints now flow with elegant grace,
Old Inject patterns fade without trace,
InjectCall brings clarity bright,
Channel waits for the perfect timing's might,
Tests synchronize, production runs light! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: migrating simple failpoint.Inject calls to failpoint.InjectCall, which is the primary objective across all modified files.
Description check	✅ Passed	The description includes Issue Number, explains what was changed and how, provides details about the migration categories, and includes a commit message and release note section.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/tso/api_test.go`:
- Line 137: The failpoint enabled with
failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", ...)
should be disabled in a deferred call to guarantee cleanup on test failure:
immediately after calling failpoint.EnableCall, add a defer
failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop") so the
failpoint is always cleared, and then remove the later explicit
failpoint.Disable call that currently appears at the end of the test.

In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go`:
- Around line 466-467: The failpoint enable call for
"github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn" uses a channel ch
but lacks a fail-safe cleanup; change the code to call failpoint.EnableCall(...)
and immediately defer
failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn")
so the failpoint is disabled on any early return/panic, and then remove the
later unconditional cleanup that disables the same failpoint to avoid
double-handling; reference the ch variable and the EnableCall / Disable symbols
to locate and update the test.

In `@tests/server/cluster/cluster_test.go`:
- Line 880: The failpoint opened with
failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency",
...) isn't disabled if an assertion between lines 880–894 fails; immediately
after calling failpoint.EnableCall(...) add a deferred cleanup like defer
failpoint.Disable("github.com/tikv/pd/server/cluster/versionChangeConcurrency")
so the failpoint is always disabled even on test failures, mirroring the pattern
used in TestRegionStatistics and ensuring proper cleanup around the enabled
failpoint.

In `@tests/server/member/member_test.go`:
- Around line 242-243: The failpoint enabled via
failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", ...) is not
guaranteed to be disabled on early test exit; after successfully enabling the
failpoint (the call creating ch and calling failpoint.EnableCall) immediately
register cleanup with defer: defer
failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy") and also defer
close(ch) (or otherwise signal the goroutine) so the failpoint is always
disabled and the channel closed even if the test fails early.

In `@tests/server/server_test.go`:
- Around line 225-226: The test enables a failpoint with
failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch })
and creates a channel ch but does not guarantee cleanup on test exit; wrap the
enable + channel creation with a deferred cleanup that uses a sync.Once to both
disable the failpoint
(failpoint.Disable("github.com/tikv/pd/server/delayProcess")) and close ch so
that regardless of assertion failures or panics the failpoint is disabled and
the blocked goroutines are released; place the sync.Once and the defer
immediately after creating ch and calling failpoint.EnableCall to ensure the
cleanup always runs.

In `@tests/server/tso/tso_test.go`:
- Line 138: Enable the failpoint with
failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", ...) and
immediately register cleanup to disable it (e.g., defer
failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") or
t.Cleanup(func(){
failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") })) right
after the EnableCall succeeds so any early test failures won't leak the
failpoint into other tests.

In `@tools/pd-ctl/tests/keyspace/keyspace_test.go`:
- Line 55: The failpoint is enabled via failpoint.EnableCall (wrapped by
re.NoError) but not guaranteed to be disabled if the test fails early;
immediately after enabling the failpoint (the call to failpoint.EnableCall),
register a teardown that disables it (use failpoint.DisableCall or
failpoint.Disable) via t.Cleanup(...) or defer to ensure the failpoint is always
turned off for that test; update the test that calls failpoint.EnableCall to add
this cleanup registration right after the successful enable so later tests are
not polluted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1456bfe-b696-4dfc-a1e0-23bdec775603

📥 Commits

Reviewing files that changed from the base of the PR and between 4253bee and c242147.

📒 Files selected for processing (33)

client/clients/tso/client.go
pkg/election/leadership.go
pkg/election/leadership_test.go
pkg/gc/gc_state_manager.go
pkg/gc/gc_state_manager_test.go
pkg/keyspace/tso_keyspace_group.go
pkg/mcs/scheduling/server/cluster.go
pkg/schedule/checker/checker_controller.go
pkg/schedule/checker/rule_checker_test.go
pkg/schedule/operator/operator_controller.go
pkg/schedule/operator/operator_controller_test.go
pkg/storage/kv/etcd_kv.go
pkg/storage/storage_tso_test.go
pkg/tso/tso.go
server/api/region.go
server/cluster/cluster.go
server/cluster/cluster_test.go
server/grpc_service.go
server/server.go
tests/integrations/client/client_test.go
tests/integrations/mcs/router/server_test.go
tests/integrations/mcs/tso/api_test.go
tests/integrations/mcs/tso/keyspace_group_manager_test.go
tests/integrations/tso/client_test.go
tests/server/api/region_test.go
tests/server/api/version_test.go
tests/server/cluster/cluster_test.go
tests/server/member/member_test.go
tests/server/server_test.go
tests/server/tso/tso_test.go
tests/server/watch/leader_watch_test.go
tools/pd-ctl/tests/keyspace/keyspace_group_test.go
tools/pd-ctl/tests/keyspace/keyspace_test.go

coderabbitai · 2026-03-05T04:41:56Z

tests/integrations/mcs/tso/api_test.go

 func TestTSOServerStartFirst(t *testing.T) {
 	re := require.New(t)
-	re.NoError(failpoint.Enable("github.com/tikv/pd/server/delayStartServerLoop", `return(true)`))
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))


⚠️ Potential issue | 🟡 Minor

Defer failpoint disable to ensure cleanup on test failure.

The failpoint.Disable at line 198 is not deferred. If any assertion fails between lines 137 and 198, the failpoint remains enabled and could affect subsequent tests. Based on learnings: "Keep failpoints enabled only for tests; disable immediately after."

🛡️ Proposed fix to defer disable

re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) })) + defer failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop") ctx, cancel := context.WithCancel(context.Background())

Then remove the explicit disable at line 198:

- re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop")) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))

re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))

defer failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop")

ctx, cancel := context.WithCancel(context.Background())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integrations/mcs/tso/api_test.go` at line 137, The failpoint enabled with failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", ...) should be disabled in a deferred call to guarantee cleanup on test failure: immediately after calling failpoint.EnableCall, add a defer failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop") so the failpoint is always cleared, and then remove the later explicit failpoint.Disable call that currently appears at the end of the test.

coderabbitai · 2026-03-05T04:41:57Z

tests/integrations/mcs/tso/keyspace_group_manager_test.go

+	ch := make(chan struct{})
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file around lines 466-525 cat -n tests/integrations/mcs/tso/keyspace_group_manager_test.go | sed -n '450,530p'

Repository: tikv/pd

Length of output: 3923

Use defer to make failpoint cleanup failure-safe.

The test has multiple early-exit paths (panics from Must* functions and re.NoError() assertions at lines 471, 481, 487, 491, 497, 500, 503–505, 514–516, 519–520). If the test fails before reaching line 524–525, the failpoint remains enabled and contaminates subsequent tests.

Suggested fix

ch := make(chan struct{}) re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch })) +defer func() { + select { + case <-ch: + default: + close(ch) + } + re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn")) +}() // Create the keyspace group `oldID` with keyspaces [444, 555, 666]. oldID := suite.allocID()

Then remove the unconditional cleanup at lines 524–525.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ch := make(chan struct{})

re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))

ch := make(chan struct{})

re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))

defer func() {

select {

case <-ch:

default:

close(ch)

}

re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn"))

}()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go` around lines 466 - 467, The failpoint enable call for "github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn" uses a channel ch but lacks a fail-safe cleanup; change the code to call failpoint.EnableCall(...) and immediately defer failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn") so the failpoint is disabled on any early return/panic, and then remove the later unconditional cleanup that disables the same failpoint to avoid double-handling; reference the ch variable and the EnableCall / Disable symbols to locate and update the test.

coderabbitai · 2026-03-05T04:41:57Z

tests/server/cluster/cluster_test.go

 	store := newMetaStore(storeID, "mock://tikv-1:1", "2.1.0", metapb.StoreState_Up, getTestDeployPath(storeID))
 	var wg sync.WaitGroup
-	re.NoError(failpoint.Enable("github.com/tikv/pd/server/cluster/versionChangeConcurrency", `return(true)`))
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency", func() { time.Sleep(500 * time.Millisecond) }))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and inspect the test file around lines 880-895 head -n 900 tests/server/cluster/cluster_test.go | tail -n 30

Repository: tikv/pd

Length of output: 1186

🏁 Script executed:

# Get more context around the failpoint enable/disable sed -n '870,900p' tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 1206

🏁 Script executed:

# Let's look at the broader function context to see if there's a defer pattern or cleanup strategy sed -n '850,900p' tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 1769

🏁 Script executed:

# Check if there are other tests in this file that use defer for failpoint cleanup rg -A 2 "failpoint.EnableCall" tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 217

🏁 Script executed:

# Check if other tests in the file use defer for failpoint cleanup rg -B 5 "failpoint.Disable" tests/server/cluster/cluster_test.go | head -60

Repository: tikv/pd

Length of output: 2553

🏁 Script executed:

# Search across the entire codebase for failpoint defer patterns rg -A 1 "defer.*failpoint.Disable" --type go | head -30

Repository: tikv/pd

Length of output: 33

🏁 Script executed:

# Look at the test function name and surrounding context to get complete picture sed -n '860,905p' tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 1606

Defer the failpoint disable to ensure cleanup even if an assertion fails.

The test has multiple assertions between the failpoint.EnableCall at line 880 and the failpoint.Disable at line 894. If any assertion fails (lines 888–889), the disable never executes, leaving the failpoint enabled for subsequent tests. Use a deferred cleanup immediately after enabling the failpoint, as shown in similar tests in this file (e.g., TestRegionStatistics).

Suggested fix

re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency", func() { time.Sleep(500 * time.Millisecond) })) +defer func() { + re.NoError(failpoint.Disable("github.com/tikv/pd/server/cluster/versionChangeConcurrency")) +}()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/server/cluster/cluster_test.go` at line 880, The failpoint opened with failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency", ...) isn't disabled if an assertion between lines 880–894 fails; immediately after calling failpoint.EnableCall(...) add a deferred cleanup like defer failpoint.Disable("github.com/tikv/pd/server/cluster/versionChangeConcurrency") so the failpoint is always disabled even on test failures, mirroring the pattern used in TestRegionStatistics and ensuring proper cleanup around the enabled failpoint.

coderabbitai · 2026-03-05T04:41:57Z

tests/server/member/member_test.go

+	ch := make(chan struct{})
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n tests/server/member/member_test.go | sed -n '230,260p'

Repository: tikv/pd

Length of output: 1197

🏁 Script executed:

cat -n tests/server/member/member_test.go | sed -n '220,250p'

Repository: tikv/pd

Length of output: 1276

Guard failpoint teardown with defer in this test.

Cleanup currently depends on reaching the tail. Early failures at lines 244–247 can leave the failpoint enabled and block goroutines, violating the requirement to disable failpoints immediately after test execution.

Suggested fix

ch := make(chan struct{}) re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch })) +defer func() { + select { + case <-ch: + default: + close(ch) + } + re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy")) +}() post(t, re, addr1+"/pd/api/v1/leader/resign", "") leader2 := waitLeaderChange(re, cluster, leader1) t.Log("leader2:", leader2) re.NotEqual(leader1, leader2) -close(ch) -re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ch := make(chan struct{})

re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))

ch := make(chan struct{})

re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))

defer func() {

select {

case <-ch:

default:

close(ch)

}

re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy"))

}()

post(t, re, addr1+"/pd/api/v1/leader/resign", "")

leader2 := waitLeaderChange(re, cluster, leader1)

t.Log("leader2:", leader2)

re.NotEqual(leader1, leader2)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/server/member/member_test.go` around lines 242 - 243, The failpoint enabled via failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", ...) is not guaranteed to be disabled on early test exit; after successfully enabling the failpoint (the call creating ch and calling failpoint.EnableCall) immediately register cleanup with defer: defer failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy") and also defer close(ch) (or otherwise signal the goroutine) so the failpoint is always disabled and the channel closed even if the test fails early.

coderabbitai · 2026-03-05T04:41:57Z

tests/server/server_test.go

+	ch := make(chan struct{})
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch }))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect the test function containing lines 225-260 to understand full context sed -n '180,280p' tests/server/server_test.go

Repository: tikv/pd

Length of output: 2934

🏁 Script executed:

#!/bin/bash # Confirm: are there any function-level defer statements for this test? sed -n '1,225p' tests/server/server_test.go | grep -n "func\|defer" | tail -20

Repository: tikv/pd

Length of output: 665

🏁 Script executed:

#!/bin/bash # Inspect the exact line numbers: find TestGRPCRateLimit and pinpoint failpoint operations grep -n "failpoint.EnableCall\|failpoint.Disable\|close(ch)" tests/server/server_test.go | grep -A5 -B5 "delayProcess"

Repository: tikv/pd

Length of output: 239

🏁 Script executed:

#!/bin/bash # Final confirmation: check what assertions are between failpoint enable and cleanup sed -n '226,261p' tests/server/server_test.go

Repository: tikv/pd

Length of output: 1099

Guarantee failpoint and channel cleanup on every test exit path

At Line 226, the failpoint enable is not guarded by deferred cleanup. Multiple assertions follow (lines 251–252), and if any fails, the failpoint remains enabled and the channel remains unclosed. This causes:

Failpoint state to leak and pollute subsequent tests

Goroutines spawned at lines 227–250 to remain blocked on <-ch

Wrap cleanup with defer and sync.Once to ensure it runs regardless of assertion failures:

Proposed fix

ch := make(chan struct{}) + var releaseOnce sync.Once + release := func() { + releaseOnce.Do(func() { + close(ch) + re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayProcess")) + }) + } re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch })) + defer release() var wg sync.WaitGroup wg.Add(2) go func() { defer wg.Done() resp, err := grpcPDClient.GetRegion(context.Background(), &pdpb.GetRegionRequest{ Header: &pdpb.RequestHeader{ClusterId: leaderServer.GetClusterID()}, RegionKey: []byte(""), }) re.Empty(resp.GetHeader().GetError()) if err != nil { errCh <- err.Error() } else { okCh <- struct{}{} } }() grpcPDClient1, conn1 := testutil.MustNewGrpcClient(re, addr) defer conn1.Close() go func() { defer wg.Done() resp, err := grpcPDClient1.GetRegion(context.Background(), &pdpb.GetRegionRequest{ Header: &pdpb.RequestHeader{ClusterId: leaderServer.GetClusterID()}, RegionKey: []byte(""), }) re.Empty(resp.GetHeader().GetError()) if err != nil { errCh <- err.Error() } else { okCh <- struct{}{} } }() errStr := <-errCh re.Contains(errStr, "rate limit exceeded") - close(ch) - re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayProcess")) + release() <-okCh wg.Wait()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/server/server_test.go` around lines 225 - 226, The test enables a failpoint with failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch }) and creates a channel ch but does not guarantee cleanup on test exit; wrap the enable + channel creation with a deferred cleanup that uses a sync.Once to both disable the failpoint (failpoint.Disable("github.com/tikv/pd/server/delayProcess")) and close ch so that regardless of assertion failures or panics the failpoint is disabled and the blocked goroutines are released; place the sync.Once and the defer immediately after creating ch and calling failpoint.EnableCall to ensure the cleanup always runs.

coderabbitai · 2026-03-05T04:41:57Z

tests/server/tso/tso_test.go

 	}

-	re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", `return(true)`))
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", func() { time.Sleep(time.Second) }))


⚠️ Potential issue | 🟠 Major

Register failpoint disable immediately after enabling

Line 138 enables a failpoint but cleanup is only executed at Line 156. Any earlier fatal assertion can leak the failpoint into subsequent tests.

Suggested fix

re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", func() { time.Sleep(time.Second) })) +defer func() { + re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/delaySyncTimestamp")) +}() ... -re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/delaySyncTimestamp"))

Based on learnings: Keep failpoints enabled only for tests; disable immediately after (`make failpoint-disable` or `make clean-test`) to avoid polluting the codebase.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/server/tso/tso_test.go` at line 138, Enable the failpoint with failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", ...) and immediately register cleanup to disable it (e.g., defer failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") or t.Cleanup(func(){ failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") })) right after the EnableCall succeeds so any early test failures won't leak the failpoint into other tests.

coderabbitai · 2026-03-05T04:41:57Z

tools/pd-ctl/tests/keyspace/keyspace_test.go

 	re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes", `return(true)`))
 	re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller", `return(true)`))
-	re.NoError(failpoint.Enable("github.com/tikv/pd/server/delayStartServerLoop", `return(true)`))
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))


⚠️ Potential issue | 🟠 Major

Use cleanup-based failpoint teardown to avoid cross-test pollution

Line 55 enables a failpoint, but teardown depends on reaching Line 119. If any re.NoError fails earlier, this failpoint can remain enabled and affect later tests. Please register disable with t.Cleanup (or immediate defer) at enable time.

Suggested fix

re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes", `return(true)`)) re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller", `return(true)`)) re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) })) +t.Cleanup(func() { + re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop")) + re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller")) + re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes")) +}) ... -re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes")) -re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller")) -re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop"))

Based on learnings: Keep failpoints enabled only for tests; disable immediately after (`make failpoint-disable` or `make clean-test`) to avoid polluting the codebase.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/pd-ctl/tests/keyspace/keyspace_test.go` at line 55, The failpoint is enabled via failpoint.EnableCall (wrapped by re.NoError) but not guaranteed to be disabled if the test fails early; immediately after enabling the failpoint (the call to failpoint.EnableCall), register a teardown that disables it (use failpoint.DisableCall or failpoint.Disable) via t.Cleanup(...) or defer to ensure the failpoint is always turned off for that test; update the test that calls failpoint.EnableCall to add this cleanup registration right after the successful enable so later tests are not polluted.

YuhaoZhang00 · 2026-03-09T08:02:51Z

pkg/election/leadership_test.go

@@ -125,8 +125,10 @@ func TestExitWatch(t *testing.T) {
 	re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/utils/etcdutil/fastTick", "return(true)"))


Does this pr intend to exhaustively cover the migration of all failpoint.Injects? Found one missing here

This PR only covers three "simple" patterns: empty/nil callbacks, sleep-only callbacks, and panic-only callbacks. The fastTick failpoints here modify local variables inside the closure (e.g., unhealthyTimeout = 5 * time.Second), which requires a different migration strategy (passing pointers via InjectCall). Will handle these in a follow-up PR.

ti-chi-bot · 2026-03-09T08:11:48Z

@YuhaoZhang00: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-03-09T08:11:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YuhaoZhang00
Once this PR has been reviewed and has the lgtm label, please assign bufferflies for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JmPotato · 2026-03-09T15:41:49Z

/retest

codecov · 2026-03-09T15:51:08Z

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.83%. Comparing base (525524f) to head (c242147).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10300      +/-   ##
==========================================
+ Coverage   78.78%   78.83%   +0.05%     
==========================================
  Files         525      527       +2     
  Lines       70824    70890      +66     
==========================================
+ Hits        55796    55889      +93     
+ Misses      11004    10995       -9     
+ Partials     4024     4006      -18

Flag	Coverage Δ
unittests	`78.83% <94.73%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2026

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

YuhaoZhang00 reviewed Mar 9, 2026

View reviewed changes

YuhaoZhang00 approved these changes Mar 9, 2026

View reviewed changes

		ch := make(chan struct{})
		re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))

-	ch := make(chan struct{})
-	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))
+	ch := make(chan struct{})
+	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))
+	defer func() {
+		select {
+		case <-ch:
+		default:
+			close(ch)
+		}
+		re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy"))
+	}()
+	post(t, re, addr1+"/pd/api/v1/leader/resign", "")
+	leader2 := waitLeaderChange(re, cluster, leader1)
+	t.Log("leader2:", leader2)
+	re.NotEqual(leader1, leader2)

		@@ -125,8 +125,10 @@ func TestExitWatch(t *testing.T) {
		re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/utils/etcdutil/fastTick", "return(true)"))

Conversation

JmPotato commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

YuhaoZhang00 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

JmPotato Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Mar 9, 2026

Uh oh!

ti-chi-bot bot commented Mar 9, 2026

Uh oh!

JmPotato commented Mar 9, 2026

Uh oh!

codecov bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JmPotato commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 5, 2026 •

edited

Loading

codecov bot commented Mar 9, 2026 •

edited

Loading