Skip to content

tests: stabilize TestSwitchModeDuringWorkload#10316

Open
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:codex/flaky-10154-switch-mode-workload
Open

tests: stabilize TestSwitchModeDuringWorkload#10316
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:codex/flaky-10154-switch-mode-workload

Conversation

@okJiang
Copy link
Member

@okJiang okJiang commented Mar 9, 2026

What problem does this PR solve?

Issue Number: ref #10154

TestSwitchModeDuringWorkload/pd-to-standalone is flaky.

Root-cause evidence chain

  • Issue TestSwitchModeDuringWorkload is unstable #10154 and recent rerun logs fail at:
    • pkg/utils/testutil/testutil.go:68
    • tests/integrations/mcs/resourcemanager/resource_manager_test.go:492
    • Condition never satisfied.
  • CI job logs (22856930709 / 66299693231) show transient RM discovery/connectivity errors around the switch window, e.g. repeated resource manager error with connection refused, while the test already starts counting post-switch success.
  • Current test only gates on leader/service-url checks before setting switched=true, but the controller may still be in transient degraded mode until the first successful token-bucket response after endpoint switch.
  • Therefore okAfter can stay below threshold in the bounded wait even though switch eventually recovers.

Historical analog

What is changed and how does it work?

  • In TestSwitchModeDuringWorkload, after switch + leader/service routing checks and before enabling post-switch counting, add an explicit wait:
    • testutil.Eventually(... !rgController.IsDegraded() ...)
  • This keeps test intent unchanged (must make post-switch progress), but removes the race where counting starts while controller is still transiently degraded.

Risk

  • Low risk; test-only change.
  • Slightly longer wait in rare slow environments (bounded to 30s) to avoid false negatives.

Verification

  • Focused verification (pass):
    • cd tests/integrations && make gotest GOTEST_ARGS='-tags without_dashboard ./mcs/resourcemanager -run TestSwitchModeDuringWorkload -count=3'
    • Result: ok github.com/tikv/pd/tests/integrations/mcs/resourcemanager 431.983s
  • Baseline verification (has unrelated pre-existing failures in this run):
    • make basic-test
    • Result: failed with non-touched-scope flakes/noise in this environment:
      • pkg/gctuner: goleak (unexpected goroutine in memory_limit_tuner)
      • pkg/storage/endpoint: TestDataPhysicalRepresentation expected path mismatch (/pd/0/... vs /pd/<cluster-id>/...)

Summary by CodeRabbit

  • Tests
    • Improved test stability by adding synchronization checks when switching resource manager deployment modes, ensuring proper controller recovery before test continuation.

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 9, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 9, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign siddontang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 9, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c81c775-7e67-4437-ae77-5820de15f0ce

📥 Commits

Reviewing files that changed from the base of the PR and between 58252c5 and de35208.

📒 Files selected for processing (1)
  • tests/integrations/mcs/resourcemanager/resource_manager_test.go

📝 Walkthrough

Walkthrough

This change adds a recovery verification step to a resource manager integration test. After switching deployment modes, the test now waits for the ResourceGroupController to transition from a degraded state by polling its status for up to 30 seconds, ensuring the controller has stabilized before proceeding.

Changes

Cohort / File(s) Summary
Resource Manager Test Recovery Check
tests/integrations/mcs/resourcemanager/resource_manager_test.go
Added polling mechanism to wait until ResourceGroupController exits degraded state after deployment mode switch, using 50ms tick intervals with 30-second timeout.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

  • tikv/pd#10283 — Modifies the same test file to adjust waiting logic for ResourceGroupController recovery after deployment mode switching.

Suggested labels

release-note-none, lgtm, approved

Suggested reviewers

  • lhy1024

Poem

🐰 A controller stumbles, feeling quite degraded,
But patience and polling leave it unabated,
Thirty seconds tick by, fifty mills in grace,
Until stability blooms in this test's embrace! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely identifies the main change: stabilizing a flaky test. It accurately summarizes the primary objective without being vague or misleading.
Description check ✅ Passed The description covers all required template sections: problem statement with issue reference, root-cause analysis, detailed explanation of the fix, risk assessment, and verification results. It is comprehensive and well-structured.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.88%. Comparing base (c1f3166) to head (de35208).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10316      +/-   ##
==========================================
+ Coverage   78.78%   78.88%   +0.10%     
==========================================
  Files         527      527              
  Lines       70916    70920       +4     
==========================================
+ Hits        55870    55945      +75     
+ Misses      11026    10975      -51     
+ Partials     4020     4000      -20     
Flag Coverage Δ
unittests 78.88% <ø> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@okJiang okJiang marked this pull request as ready for review March 10, 2026 02:33
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-realcluster-test de35208 link true /test pull-integration-realcluster-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant