Skip to content

tests: fix flaky TestTransferLeaderForScheduler#10306

Open
JmPotato wants to merge 2 commits intotikv:masterfrom
JmPotato:fix-10305
Open

tests: fix flaky TestTransferLeaderForScheduler#10306
JmPotato wants to merge 2 commits intotikv:masterfrom
JmPotato:fix-10305

Conversation

@JmPotato
Copy link
Member

@JmPotato JmPotato commented Mar 5, 2026

What problem does this PR solve?

Issue Number: Close #10305

What is changed and how does it work?

Replace `time.Sleep(time.Second)` followed by a direct `re.True(IsPrepared())`
assertion with `testutil.Eventually` to properly wait for the raft cluster to
become prepared after leader transfer. This matches the pattern already used
in the third leader transfer block of the same test.

Under CI resource pressure, a fixed 1-second sleep may not be enough for the
new leader to fully prepare, causing the subsequent scheduler count check to
time out.

Check List

Tests

  • Unit test

Release note

None.

Summary by CodeRabbit

  • Tests
    • Replaced fixed timing in cluster tests with a polling-based wait for scheduler readiness, improving reliability of leader-transfer scenarios and timing-sensitive assertions.

Replace `time.Sleep(time.Second)` followed by a direct `re.True(IsPrepared())`
assertion with `testutil.Eventually` to properly wait for the raft cluster to
become prepared after leader transfer. This matches the pattern already used
in the third leader transfer block of the same test.

Under CI resource pressure, a fixed 1-second sleep may not be enough for the
new leader to fully prepare, causing the subsequent scheduler count check to
time out.

Close tikv#10305

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 5, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29fbefd0-4ce9-485d-84da-fa8fe865998a

📥 Commits

Reviewing files that changed from the base of the PR and between bf20634 and 79bf08f.

📒 Files selected for processing (1)
  • tests/server/cluster/cluster_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/server/cluster/cluster_test.go

📝 Walkthrough

Walkthrough

Replaces a fixed sleep and immediate readiness check after region heartbeat with a polling-based testutil.Eventually wait around AreSchedulersInitialized, applied to two transfer-leader scenarios in TestTransferLeaderForScheduler to poll for scheduler initialization before assertions.

Changes

Cohort / File(s) Summary
Test synchronization
tests/server/cluster/cluster_test.go
Replaced fixed sleep and immediate readiness check after region heartbeat with polling testutil.Eventually around AreSchedulersInitialized for transfer-leader test paths; timing-sensitive assertions now wait for scheduler initialization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

size/S, lgtm

Suggested reviewers

  • lhy1024

Poem

🐇 I nudge the clocks and watch the queues align,
A patient hop, I poll until they're fine.
No hurried nap, no flake that stalls the test,
Steady hearts, steady schedulers — all at rest.
Hop-hurray! 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing a flaky test by replacing a fixed sleep with eventual polling.
Description check ✅ Passed The description follows the template with issue reference, detailed explanation of the change and its rationale, and appropriate checklist selections.
Linked Issues check ✅ Passed The PR addresses issue #10305 by replacing fixed sleep with testutil.Eventually polling for scheduler initialization, matching the linked issue's requirement.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the flaky test reported in #10305; no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@okJiang
Copy link
Member

okJiang commented Mar 6, 2026

/retest

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 6, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 6, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-06 06:59:38.25092324 +0000 UTC m=+514222.829002424: ☑️ agreed by okJiang.

@ti-chi-bot ti-chi-bot bot added the approved label Mar 6, 2026
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.89%. Comparing base (525524f) to head (79bf08f).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10306      +/-   ##
==========================================
+ Coverage   78.78%   78.89%   +0.11%     
==========================================
  Files         525      527       +2     
  Lines       70824    70920      +96     
==========================================
+ Hits        55796    55950     +154     
+ Misses      11004    10964      -40     
+ Partials     4024     4006      -18     
Flag Coverage Δ
unittests 78.89% <ø> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YuhaoZhang00
Copy link

Eventually(IsPrepared()) is a good change to have.

That said, it does not fully justify issue #10305. The reported failure is at the scheduler-count check on line 1658. If the new leader were not prepared yet, the old test should have already failed at the previous re.True(leaderServer.GetRaftCluster().IsPrepared()) on line 1655. So the issue is not just that IsPrepared() is still false.

A more targeted fix would be to increase the wait at line 1658/1659:

    testutil.Eventually(re, func() bool {
		return len(schedulersController.GetSchedulerNames()) == schedulersNum
	}, testutil.WithWaitFor(60*time.Second))

Replace `IsPrepared()` with `AreSchedulersInitialized()` in the
post-leader-transfer wait. `IsPrepared()` only indicates that the
coordinator's prepare checker has collected enough region info, but the
Run() goroutine still needs to call `InitSchedulers(true)` afterwards
to actually register schedulers. Under CI resource pressure, the gap
between `IsPrepared()` becoming true and `InitSchedulers` completing
can exceed the default 20-second Eventually timeout.

`AreSchedulersInitialized()` is set at the end of `InitSchedulers`,
so waiting on it directly ensures all schedulers are registered before
the subsequent scheduler-count assertion.

Signed-off-by: kaijikikou <kaijikikou@gmail.com>
Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 9, 2026

@JmPotato: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 79bf08f link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@YuhaoZhang00: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: okJiang, YuhaoZhang00

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestTransferLeaderForScheduler

3 participants