Skip to content

*: migrate simple failpoint.Inject to failpoint.InjectCall#10300

Open
JmPotato wants to merge 1 commit intotikv:masterfrom
JmPotato:migrate-inject-call
Open

*: migrate simple failpoint.Inject to failpoint.InjectCall#10300
JmPotato wants to merge 1 commit intotikv:masterfrom
JmPotato:migrate-inject-call

Conversation

@JmPotato
Copy link
Member

@JmPotato JmPotato commented Mar 5, 2026

What problem does this PR solve?

Issue Number: ref #10297

What is changed and how does it work?

Migrate 19 simple failpoint.Inject() call sites to the newer
failpoint.InjectCall() / failpoint.EnableCall() APIs. This eliminates
closure bodies from production code and consolidates test injection logic
in test files with type-safe callbacks.

Categories migrated:

  • Empty/nil callbacks (5 sites): Inject("name", func(){}) / nil → bare InjectCall("name")
  • Sleep-only callbacks (12 sites): move time.Sleep to test-side EnableCall
  • Panic callbacks (2 sites): move panic to test-side EnableCall

For "pause" failpoints, the pattern is converted from
failpoint.Enable(path, "pause") to a channel-based
failpoint.EnableCall(path, func() { <-ch }) approach.

Migrate 19 simple failpoint.Inject() call sites to failpoint.InjectCall() / EnableCall() APIs.

- Empty/nil callbacks (5 sites): bare InjectCall("name")
- Sleep-only callbacks (12 sites): move time.Sleep to test-side EnableCall
- Panic callbacks (2 sites): move panic to test-side EnableCall
- "pause" failpoints converted to channel-based EnableCall pattern

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Summary by CodeRabbit

  • Tests

    • Refactored failpoint infrastructure to use call-based invocation patterns instead of direct injection.
    • Modernized test synchronization mechanisms to utilize channel-based blocking instead of pause-based approaches.
  • Chores

    • Simplified failpoint triggers by removing hardcoded delays from production code paths.

Migrate 19 simple `failpoint.Inject()` call sites to the newer
`failpoint.InjectCall()` / `failpoint.EnableCall()` APIs. This
eliminates closure bodies from production code and consolidates
test injection logic in test files with type-safe callbacks.

Categories migrated:
- Empty/nil callbacks (5 sites): bare InjectCall with no args
- Sleep-only callbacks (12 sites): move time.Sleep to test-side EnableCall
- Panic callbacks (2 sites): move panic to test-side EnableCall

For "pause" failpoints, the pattern is converted from
`failpoint.Enable(path, "pause")` to a channel-based
`failpoint.EnableCall(path, func() { <-ch })` approach.

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

This pull request systematically migrates failpoint invocations across the codebase from using failpoint.Inject() with embedded callbacks to the newer failpoint.InjectCall() and failpoint.EnableCall() APIs. Production code shifts from inline sleep logic to simplified failpoint calls, while test code replaces static return values with function-based callbacks or channel-based synchronization mechanisms.

Changes

Cohort / File(s) Summary
Client & TSO Dispatch
client/clients/tso/client.go, pkg/tso/tso.go
Replaced failpoint.Inject() with embedded sleep logic with failpoint.InjectCall() for delay-related failpoints, removing inline sleep implementations.
Election & Leadership
pkg/election/leadership.go, pkg/election/leadership_test.go
Changed failpoint invocations from failpoint.Inject() to failpoint.InjectCall() in leadership code; test files introduce channel-based synchronization to control failpoint release timing.
GC State Management
pkg/gc/gc_state_manager.go, pkg/gc/gc_state_manager_test.go
Transitioned from failpoint.Inject() to failpoint.InjectCall() for GC state finish detection; tests now use channel-driven synchronization for pause/resume control.
Keyspace & Scheduling
pkg/keyspace/tso_keyspace_group.go, pkg/mcs/scheduling/server/cluster.go, pkg/schedule/checker/checker_controller.go, pkg/schedule/operator/operator_controller.go
Replaced failpoint invocations with failpoint.InjectCall(), removing embedded sleep delays while preserving failpoint triggering mechanism.
Schedule Checker Tests
pkg/schedule/checker/rule_checker_test.go, pkg/schedule/operator/operator_controller_test.go
Changed failpoint activation from static Enable() returns to EnableCall() with sleep function callbacks, introducing deliberate test delays.
Storage Layer
pkg/storage/kv/etcd_kv.go, pkg/storage/storage_tso_test.go
Removed inline sleep logic from failpoint.Inject calls and replaced with failpoint.InjectCall(); test files now use EnableCall() with sleep callbacks.
Server Core
server/api/region.go, server/cluster/cluster.go, server/grpc_service.go, server/server.go
Replaced multiple failpoint.Inject() invocations with embedded delays with failpoint.InjectCall() calls across region, cluster, and gRPC service paths.
Server Tests
server/cluster/cluster_test.go, tests/server/api/region_test.go, tests/server/api/version_test.go, tests/server/cluster/cluster_test.go, tests/server/member/member_test.go, tests/server/server_test.go, tests/server/tso/tso_test.go, tests/server/watch/leader_watch_test.go
Transitioned from static failpoint returns to EnableCall() with sleep callbacks; several tests introduce channel-based synchronization for controlled failpoint release.
Integration Tests
tests/integrations/client/client_test.go, tests/integrations/mcs/router/server_test.go, tests/integrations/mcs/tso/api_test.go, tests/integrations/mcs/tso/keyspace_group_manager_test.go, tests/integrations/tso/client_test.go
Changed failpoint activation from fixed returns to EnableCall() with sleep delays; keyspace group manager test adds channel-based synchronization.
Tools & CLI Tests
tools/pd-ctl/tests/keyspace/keyspace_group_test.go, tools/pd-ctl/tests/keyspace/keyspace_test.go
Replaced static failpoint returns with EnableCall() functions introducing 2-second sleep delays; added time import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related issues

Suggested labels

size/XL, lgtm, approved

Suggested reviewers

  • rleungx
  • bufferflies

Poem

🐰 Failpoints now flow with elegant grace,
Old Inject patterns fade without trace,
InjectCall brings clarity bright,
Channel waits for the perfect timing's might,
Tests synchronize, production runs light!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: migrating simple failpoint.Inject calls to failpoint.InjectCall, which is the primary objective across all modified files.
Description check ✅ Passed The description includes Issue Number, explains what was changed and how, provides details about the migration categories, and includes a commit message and release note section.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/tso/api_test.go`:
- Line 137: The failpoint enabled with
failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", ...)
should be disabled in a deferred call to guarantee cleanup on test failure:
immediately after calling failpoint.EnableCall, add a defer
failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop") so the
failpoint is always cleared, and then remove the later explicit
failpoint.Disable call that currently appears at the end of the test.

In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go`:
- Around line 466-467: The failpoint enable call for
"github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn" uses a channel ch
but lacks a fail-safe cleanup; change the code to call failpoint.EnableCall(...)
and immediately defer
failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn")
so the failpoint is disabled on any early return/panic, and then remove the
later unconditional cleanup that disables the same failpoint to avoid
double-handling; reference the ch variable and the EnableCall / Disable symbols
to locate and update the test.

In `@tests/server/cluster/cluster_test.go`:
- Line 880: The failpoint opened with
failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency",
...) isn't disabled if an assertion between lines 880–894 fails; immediately
after calling failpoint.EnableCall(...) add a deferred cleanup like defer
failpoint.Disable("github.com/tikv/pd/server/cluster/versionChangeConcurrency")
so the failpoint is always disabled even on test failures, mirroring the pattern
used in TestRegionStatistics and ensuring proper cleanup around the enabled
failpoint.

In `@tests/server/member/member_test.go`:
- Around line 242-243: The failpoint enabled via
failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", ...) is not
guaranteed to be disabled on early test exit; after successfully enabling the
failpoint (the call creating ch and calling failpoint.EnableCall) immediately
register cleanup with defer: defer
failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy") and also defer
close(ch) (or otherwise signal the goroutine) so the failpoint is always
disabled and the channel closed even if the test fails early.

In `@tests/server/server_test.go`:
- Around line 225-226: The test enables a failpoint with
failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch })
and creates a channel ch but does not guarantee cleanup on test exit; wrap the
enable + channel creation with a deferred cleanup that uses a sync.Once to both
disable the failpoint
(failpoint.Disable("github.com/tikv/pd/server/delayProcess")) and close ch so
that regardless of assertion failures or panics the failpoint is disabled and
the blocked goroutines are released; place the sync.Once and the defer
immediately after creating ch and calling failpoint.EnableCall to ensure the
cleanup always runs.

In `@tests/server/tso/tso_test.go`:
- Line 138: Enable the failpoint with
failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", ...) and
immediately register cleanup to disable it (e.g., defer
failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") or
t.Cleanup(func(){
failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") })) right
after the EnableCall succeeds so any early test failures won't leak the
failpoint into other tests.

In `@tools/pd-ctl/tests/keyspace/keyspace_test.go`:
- Line 55: The failpoint is enabled via failpoint.EnableCall (wrapped by
re.NoError) but not guaranteed to be disabled if the test fails early;
immediately after enabling the failpoint (the call to failpoint.EnableCall),
register a teardown that disables it (use failpoint.DisableCall or
failpoint.Disable) via t.Cleanup(...) or defer to ensure the failpoint is always
turned off for that test; update the test that calls failpoint.EnableCall to add
this cleanup registration right after the successful enable so later tests are
not polluted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1456bfe-b696-4dfc-a1e0-23bdec775603

📥 Commits

Reviewing files that changed from the base of the PR and between 4253bee and c242147.

📒 Files selected for processing (33)
  • client/clients/tso/client.go
  • pkg/election/leadership.go
  • pkg/election/leadership_test.go
  • pkg/gc/gc_state_manager.go
  • pkg/gc/gc_state_manager_test.go
  • pkg/keyspace/tso_keyspace_group.go
  • pkg/mcs/scheduling/server/cluster.go
  • pkg/schedule/checker/checker_controller.go
  • pkg/schedule/checker/rule_checker_test.go
  • pkg/schedule/operator/operator_controller.go
  • pkg/schedule/operator/operator_controller_test.go
  • pkg/storage/kv/etcd_kv.go
  • pkg/storage/storage_tso_test.go
  • pkg/tso/tso.go
  • server/api/region.go
  • server/cluster/cluster.go
  • server/cluster/cluster_test.go
  • server/grpc_service.go
  • server/server.go
  • tests/integrations/client/client_test.go
  • tests/integrations/mcs/router/server_test.go
  • tests/integrations/mcs/tso/api_test.go
  • tests/integrations/mcs/tso/keyspace_group_manager_test.go
  • tests/integrations/tso/client_test.go
  • tests/server/api/region_test.go
  • tests/server/api/version_test.go
  • tests/server/cluster/cluster_test.go
  • tests/server/member/member_test.go
  • tests/server/server_test.go
  • tests/server/tso/tso_test.go
  • tests/server/watch/leader_watch_test.go
  • tools/pd-ctl/tests/keyspace/keyspace_group_test.go
  • tools/pd-ctl/tests/keyspace/keyspace_test.go

func TestTSOServerStartFirst(t *testing.T) {
re := require.New(t)
re.NoError(failpoint.Enable("github.com/tikv/pd/server/delayStartServerLoop", `return(true)`))
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Defer failpoint disable to ensure cleanup on test failure.

The failpoint.Disable at line 198 is not deferred. If any assertion fails between lines 137 and 198, the failpoint remains enabled and could affect subsequent tests. Based on learnings: "Keep failpoints enabled only for tests; disable immediately after."

🛡️ Proposed fix to defer disable
 	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))
+	defer failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop")
 	ctx, cancel := context.WithCancel(context.Background())

Then remove the explicit disable at line 198:

-	re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop"))
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))
defer failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop")
ctx, cancel := context.WithCancel(context.Background())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/tso/api_test.go` at line 137, The failpoint enabled
with failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", ...)
should be disabled in a deferred call to guarantee cleanup on test failure:
immediately after calling failpoint.EnableCall, add a defer
failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop") so the
failpoint is always cleared, and then remove the later explicit
failpoint.Disable call that currently appears at the end of the test.

Comment on lines +466 to +467
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file around lines 466-525
cat -n tests/integrations/mcs/tso/keyspace_group_manager_test.go | sed -n '450,530p'

Repository: tikv/pd

Length of output: 3923


Use defer to make failpoint cleanup failure-safe.

The test has multiple early-exit paths (panics from Must* functions and re.NoError() assertions at lines 471, 481, 487, 491, 497, 500, 503–505, 514–516, 519–520). If the test fails before reaching line 524–525, the failpoint remains enabled and contaminates subsequent tests.

Suggested fix
 ch := make(chan struct{})
 re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))
+defer func() {
+	select {
+	case <-ch:
+	default:
+		close(ch)
+	}
+	re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn"))
+}()
 
 // Create the keyspace group `oldID` with keyspaces [444, 555, 666].
 oldID := suite.allocID()

Then remove the unconditional cleanup at lines 524–525.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn", func() { <-ch }))
defer func() {
select {
case <-ch:
default:
close(ch)
}
re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn"))
}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go` around lines 466 -
467, The failpoint enable call for
"github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn" uses a channel ch
but lacks a fail-safe cleanup; change the code to call failpoint.EnableCall(...)
and immediately defer
failpoint.Disable("github.com/tikv/pd/pkg/keyspace/pauseFinishSplitBeforeTxn")
so the failpoint is disabled on any early return/panic, and then remove the
later unconditional cleanup that disables the same failpoint to avoid
double-handling; reference the ch variable and the EnableCall / Disable symbols
to locate and update the test.

store := newMetaStore(storeID, "mock://tikv-1:1", "2.1.0", metapb.StoreState_Up, getTestDeployPath(storeID))
var wg sync.WaitGroup
re.NoError(failpoint.Enable("github.com/tikv/pd/server/cluster/versionChangeConcurrency", `return(true)`))
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency", func() { time.Sleep(500 * time.Millisecond) }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and inspect the test file around lines 880-895
head -n 900 tests/server/cluster/cluster_test.go | tail -n 30

Repository: tikv/pd

Length of output: 1186


🏁 Script executed:

# Get more context around the failpoint enable/disable
sed -n '870,900p' tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 1206


🏁 Script executed:

# Let's look at the broader function context to see if there's a defer pattern or cleanup strategy
sed -n '850,900p' tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 1769


🏁 Script executed:

# Check if there are other tests in this file that use defer for failpoint cleanup
rg -A 2 "failpoint.EnableCall" tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 217


🏁 Script executed:

# Check if other tests in the file use defer for failpoint cleanup
rg -B 5 "failpoint.Disable" tests/server/cluster/cluster_test.go | head -60

Repository: tikv/pd

Length of output: 2553


🏁 Script executed:

# Search across the entire codebase for failpoint defer patterns
rg -A 1 "defer.*failpoint.Disable" --type go | head -30

Repository: tikv/pd

Length of output: 33


🏁 Script executed:

# Look at the test function name and surrounding context to get complete picture
sed -n '860,905p' tests/server/cluster/cluster_test.go

Repository: tikv/pd

Length of output: 1606


Defer the failpoint disable to ensure cleanup even if an assertion fails.

The test has multiple assertions between the failpoint.EnableCall at line 880 and the failpoint.Disable at line 894. If any assertion fails (lines 888–889), the disable never executes, leaving the failpoint enabled for subsequent tests. Use a deferred cleanup immediately after enabling the failpoint, as shown in similar tests in this file (e.g., TestRegionStatistics).

Suggested fix
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency", func() { time.Sleep(500 * time.Millisecond) }))
+defer func() {
+	re.NoError(failpoint.Disable("github.com/tikv/pd/server/cluster/versionChangeConcurrency"))
+}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/cluster/cluster_test.go` at line 880, The failpoint opened with
failpoint.EnableCall("github.com/tikv/pd/server/cluster/versionChangeConcurrency",
...) isn't disabled if an assertion between lines 880–894 fails; immediately
after calling failpoint.EnableCall(...) add a deferred cleanup like defer
failpoint.Disable("github.com/tikv/pd/server/cluster/versionChangeConcurrency")
so the failpoint is always disabled even on test failures, mirroring the pattern
used in TestRegionStatistics and ensuring proper cleanup around the enabled
failpoint.

Comment on lines +242 to +243
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n tests/server/member/member_test.go | sed -n '230,260p'

Repository: tikv/pd

Length of output: 1197


🏁 Script executed:

cat -n tests/server/member/member_test.go | sed -n '220,250p'

Repository: tikv/pd

Length of output: 1276


Guard failpoint teardown with defer in this test.

Cleanup currently depends on reaching the tail. Early failures at lines 244–247 can leave the failpoint enabled and block goroutines, violating the requirement to disable failpoints immediately after test execution.

Suggested fix
 ch := make(chan struct{})
 re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))
+defer func() {
+	select {
+	case <-ch:
+	default:
+		close(ch)
+	}
+	re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy"))
+}()
 post(t, re, addr1+"/pd/api/v1/leader/resign", "")
 leader2 := waitLeaderChange(re, cluster, leader1)
 t.Log("leader2:", leader2)
 re.NotEqual(leader1, leader2)
-close(ch)
-re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy", func() { <-ch }))
defer func() {
select {
case <-ch:
default:
close(ch)
}
re.NoError(failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy"))
}()
post(t, re, addr1+"/pd/api/v1/leader/resign", "")
leader2 := waitLeaderChange(re, cluster, leader1)
t.Log("leader2:", leader2)
re.NotEqual(leader1, leader2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/member/member_test.go` around lines 242 - 243, The failpoint
enabled via failpoint.EnableCall("github.com/tikv/pd/server/raftclusterIsBusy",
...) is not guaranteed to be disabled on early test exit; after successfully
enabling the failpoint (the call creating ch and calling failpoint.EnableCall)
immediately register cleanup with defer: defer
failpoint.Disable("github.com/tikv/pd/server/raftclusterIsBusy") and also defer
close(ch) (or otherwise signal the goroutine) so the failpoint is always
disabled and the channel closed even if the test fails early.

Comment on lines +225 to +226
ch := make(chan struct{})
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the test function containing lines 225-260 to understand full context
sed -n '180,280p' tests/server/server_test.go

Repository: tikv/pd

Length of output: 2934


🏁 Script executed:

#!/bin/bash
# Confirm: are there any function-level defer statements for this test?
sed -n '1,225p' tests/server/server_test.go | grep -n "func\|defer" | tail -20

Repository: tikv/pd

Length of output: 665


🏁 Script executed:

#!/bin/bash
# Inspect the exact line numbers: find TestGRPCRateLimit and pinpoint failpoint operations
grep -n "failpoint.EnableCall\|failpoint.Disable\|close(ch)" tests/server/server_test.go | grep -A5 -B5 "delayProcess"

Repository: tikv/pd

Length of output: 239


🏁 Script executed:

#!/bin/bash
# Final confirmation: check what assertions are between failpoint enable and cleanup
sed -n '226,261p' tests/server/server_test.go

Repository: tikv/pd

Length of output: 1099


Guarantee failpoint and channel cleanup on every test exit path

At Line 226, the failpoint enable is not guarded by deferred cleanup. Multiple assertions follow (lines 251–252), and if any fails, the failpoint remains enabled and the channel remains unclosed. This causes:

  • Failpoint state to leak and pollute subsequent tests
  • Goroutines spawned at lines 227–250 to remain blocked on <-ch

Wrap cleanup with defer and sync.Once to ensure it runs regardless of assertion failures:

Proposed fix
 	ch := make(chan struct{})
+	var releaseOnce sync.Once
+	release := func() {
+		releaseOnce.Do(func() {
+			close(ch)
+			re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayProcess"))
+		})
+	}
 	re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayProcess", func() { <-ch }))
+	defer release()
 	var wg sync.WaitGroup
 	wg.Add(2)
 	go func() {
 		defer wg.Done()
 		resp, err := grpcPDClient.GetRegion(context.Background(), &pdpb.GetRegionRequest{
 			Header:    &pdpb.RequestHeader{ClusterId: leaderServer.GetClusterID()},
 			RegionKey: []byte(""),
 		})
 		re.Empty(resp.GetHeader().GetError())
 		if err != nil {
 			errCh <- err.Error()
 		} else {
 			okCh <- struct{}{}
 		}
 	}()
 	grpcPDClient1, conn1 := testutil.MustNewGrpcClient(re, addr)
 	defer conn1.Close()
 	go func() {
 		defer wg.Done()
 		resp, err := grpcPDClient1.GetRegion(context.Background(), &pdpb.GetRegionRequest{
 			Header:    &pdpb.RequestHeader{ClusterId: leaderServer.GetClusterID()},
 			RegionKey: []byte(""),
 		})
 		re.Empty(resp.GetHeader().GetError())
 		if err != nil {
 			errCh <- err.Error()
 		} else {
 			okCh <- struct{}{}
 		}
 	}()
 	errStr := <-errCh
 	re.Contains(errStr, "rate limit exceeded")
-	close(ch)
-	re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayProcess"))
+	release()
 	<-okCh
 	wg.Wait()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/server_test.go` around lines 225 - 226, The test enables a
failpoint with failpoint.EnableCall("github.com/tikv/pd/server/delayProcess",
func() { <-ch }) and creates a channel ch but does not guarantee cleanup on test
exit; wrap the enable + channel creation with a deferred cleanup that uses a
sync.Once to both disable the failpoint
(failpoint.Disable("github.com/tikv/pd/server/delayProcess")) and close ch so
that regardless of assertion failures or panics the failpoint is disabled and
the blocked goroutines are released; place the sync.Once and the defer
immediately after creating ch and calling failpoint.EnableCall to ensure the
cleanup always runs.

}

re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", `return(true)`))
re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", func() { time.Sleep(time.Second) }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Register failpoint disable immediately after enabling

Line 138 enables a failpoint but cleanup is only executed at Line 156. Any earlier fatal assertion can leak the failpoint into subsequent tests.

Suggested fix
 re.NoError(failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", func() { time.Sleep(time.Second) }))
+defer func() {
+	re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/delaySyncTimestamp"))
+}()
...
-re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/delaySyncTimestamp"))
Based on learnings: Keep failpoints enabled only for tests; disable immediately after (`make failpoint-disable` or `make clean-test`) to avoid polluting the codebase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/tso/tso_test.go` at line 138, Enable the failpoint with
failpoint.EnableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp", ...) and
immediately register cleanup to disable it (e.g., defer
failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") or
t.Cleanup(func(){
failpoint.DisableCall("github.com/tikv/pd/pkg/tso/delaySyncTimestamp") })) right
after the EnableCall succeeds so any early test failures won't leak the
failpoint into other tests.

re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes", `return(true)`))
re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller", `return(true)`))
re.NoError(failpoint.Enable("github.com/tikv/pd/server/delayStartServerLoop", `return(true)`))
re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use cleanup-based failpoint teardown to avoid cross-test pollution

Line 55 enables a failpoint, but teardown depends on reaching Line 119. If any re.NoError fails earlier, this failpoint can remain enabled and affect later tests. Please register disable with t.Cleanup (or immediate defer) at enable time.

Suggested fix
 re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes", `return(true)`))
 re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller", `return(true)`))
 re.NoError(failpoint.EnableCall("github.com/tikv/pd/server/delayStartServerLoop", func() { time.Sleep(2 * time.Second) }))
+t.Cleanup(func() {
+	re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop"))
+	re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller"))
+	re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes"))
+})
 ...
-re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/keyspace/acceleratedAllocNodes"))
-re.NoError(failpoint.Disable("github.com/tikv/pd/pkg/tso/fastGroupSplitPatroller"))
-re.NoError(failpoint.Disable("github.com/tikv/pd/server/delayStartServerLoop"))
Based on learnings: Keep failpoints enabled only for tests; disable immediately after (`make failpoint-disable` or `make clean-test`) to avoid polluting the codebase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/keyspace/keyspace_test.go` at line 55, The failpoint is
enabled via failpoint.EnableCall (wrapped by re.NoError) but not guaranteed to
be disabled if the test fails early; immediately after enabling the failpoint
(the call to failpoint.EnableCall), register a teardown that disables it (use
failpoint.DisableCall or failpoint.Disable) via t.Cleanup(...) or defer to
ensure the failpoint is always turned off for that test; update the test that
calls failpoint.EnableCall to add this cleanup registration right after the
successful enable so later tests are not polluted.

@@ -125,8 +125,10 @@ func TestExitWatch(t *testing.T) {
re.NoError(failpoint.Enable("github.com/tikv/pd/pkg/utils/etcdutil/fastTick", "return(true)"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pr intend to exhaustively cover the migration of all failpoint.Injects? Found one missing here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only covers three "simple" patterns: empty/nil callbacks, sleep-only callbacks, and panic-only callbacks. The fastTick failpoints here modify local variables inside the closure (e.g., unhealthyTimeout = 5 * time.Second), which requires a different migration strategy (passing pointers via InjectCall). Will handle these in a follow-up PR.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 9, 2026

@YuhaoZhang00: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YuhaoZhang00
Once this PR has been reviewed and has the lgtm label, please assign bufferflies for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JmPotato
Copy link
Member Author

JmPotato commented Mar 9, 2026

/retest

@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.83%. Comparing base (525524f) to head (c242147).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10300      +/-   ##
==========================================
+ Coverage   78.78%   78.83%   +0.05%     
==========================================
  Files         525      527       +2     
  Lines       70824    70890      +66     
==========================================
+ Hits        55796    55889      +93     
+ Misses      11004    10995       -9     
+ Partials     4024     4006      -18     
Flag Coverage Δ
unittests 78.83% <94.73%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants