Skip to content

*: upgrade testify from v1.10.0 to v1.11.1#10295

Open
JmPotato wants to merge 17 commits intotikv:masterfrom
JmPotato:update-testify
Open

*: upgrade testify from v1.10.0 to v1.11.1#10295
JmPotato wants to merge 17 commits intotikv:masterfrom
JmPotato:update-testify

Conversation

@JmPotato
Copy link
Member

@JmPotato JmPotato commented Mar 4, 2026

What problem does this PR solve?

Issue Number: Close #10164

What is changed and how does it work?

Remove the `replace` directive that pinned testify to v1.10.0 and fix
all Eventually condition functions that used testify assertions
(re.NoError, re.Equal, ReadGetJSON, etc.) inside the condition, which
permanently marks tests as failed on first check with testify v1.11+.

testify v1.11 changed `Eventually` to check the condition immediately
before the first tick. This exposed an anti-pattern: using testify
assertions (which call t.Errorf()) inside Eventually condition functions.
When the condition fails on the first immediate check, the test is
permanently marked as failed even if the condition eventually succeeds.

The key changes are:
- Remove `replace` directives in root, tools, and tests/integrations
  go.mod files
- Upgrade client/go.mod from v1.9.0 to v1.11.1
- Add Try* utility functions to pkg/utils/testutil/api_check.go
  (TryReadGetJSON, TryCheckGetJSON, TryCheckPostJSON, etc.) that return
  errors instead of using testify assertions
- Replace all testify assertions inside Eventually condition functions
  with pure Go boolean expressions or Try* variants across ~50 test
  files

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Summary by CodeRabbit

  • Tests

    • Improved resilience by converting many immediate assertions into guarded checks inside retry/polling loops, reducing flakiness and deferring failures to eventual checks.
    • Added numerous error-returning "Try*" / "try*" helpers and non-panicking variants for HTTP, CLI and request helpers to enable graceful retries and clearer polling logic.
    • Tightened select verifications to ensure errors propagate through both synchronous returns and asynchronous callbacks.
  • Chores

    • Upgraded testify to v1.11.1 and removed temporary version-pinning workarounds.

Remove the `replace` directive that pinned testify to v1.10.0 and fix
all Eventually condition functions that used testify assertions
(re.NoError, re.Equal, ReadGetJSON, etc.) inside the condition, which
permanently marks tests as failed on first check with testify v1.11+.

The key changes are:
- Remove `replace` directives in root, tools, and tests/integrations
  go.mod files
- Upgrade client/go.mod from v1.9.0 to v1.11.1
- Add Try* utility functions to pkg/utils/testutil/api_check.go
  (TryReadGetJSON, TryCheckGetJSON, TryCheckPostJSON, etc.) that return
  errors instead of using testify assertions
- Replace all testify assertions inside Eventually condition functions
  with pure Go boolean expressions or Try* variants across ~50 test
  files

Issue Number: close tikv#10164

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 4, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cabinfeverb for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 4, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Tests and test utilities were changed to use non-panicking, error-returning "try" helpers and guarded predicates inside Eventually loops; multiple temporary testify replace directives were removed, normalizing the testify dependency version across modules and tightening some test error-propagation checks.

Changes

Cohort / File(s) Summary
Dependency management
client/go.mod, go.mod, tests/integrations/go.mod, tools/go.mod
Removed temporary replace pins for github.com/stretchr/testify and normalized/upgraded testify usage across modules.
Test utility helpers
pkg/utils/testutil/api_check.go, tools/pd-ctl/tests/helper.go, tests/server/apiv2/handlers/testutil.go, tests/testutil.go
Added error-first helpers (TryReadGetJSON, TryCheckGetJSON, TryCheckPostJSON, TryCheckPatchJSON, TryExec, tryCheckResp, loadKeyspaceGroupByIDNoAssert, etc.) to support Eventually predicates without panicking.
TSO / client tests
client/clients/tso/stream_test.go, tests/integrations/tso/..., pkg/tso/keyspace_group_manager_test.go, tests/integrations/tso/server_test.go
Converted in-predicate assertions to guarded returns; added tryRequest and try-style consistency helpers; tightened error propagation checks in TSO stream test.
Server API & HTTP tests
tests/server/api/*.go, tests/server/apiv2/handlers/affinity_test.go, tests/server/apiv2/handlers/testutil.go, tests/server/api/*
Replaced Read/CheckGetJSON usages with Try* variants, added body/status guards, switched some equality checks to reflect.DeepEqual, centralized HTTP helpers.
MCS / scheduling / resource-manager tests
pkg/mcs/..., tests/integrations/mcs/scheduling/*, tests/integrations/mcs/resourcemanager/*, tests/integrations/mcs/router/*
Reworked Eventually predicates to return false on transient errors, introduced try-style helpers and nil/error guards instead of inline assertions.
Integration & cluster tests
tests/integrations/*, tests/server/cluster/cluster_test.go, tests/server/member/member_test.go
Replaced in-predicate require/assert calls with guarded early-return patterns and try-check helpers (e.g., tryCheckMemberList); added channel coordination replacing sleep in one cluster test.
pkg unit test updates
pkg/election/leadership_test.go, pkg/schedule/scatter/region_scatterer_test.go, pkg/storage/leveldb_backend_test.go, pkg/utils/etcdutil/etcdutil_test.go, pkg/mcs/resourcemanager/server/manager_test.go
Uniform change from require/assert inside Eventually to guarded-return predicates; safer nil/error handling in polling loops.
pd-ctl / tools tests
tools/pd-ctl/tests/*, tools/pd-ctl/tests/helper.go
Added TryExec and migrated MustExec/Must* usage in retry contexts to Try*/error-returning patterns; relaxed in-predicate assertions to guarded retries.
Small production tweak
server/cluster/cluster.go
Replaced a failpoint sleep with a failpoint.InjectCall to avoid artificial 500ms delay during version-change concurrency testing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

lgtm, approved

Suggested reviewers

  • rleungx
  • okJiang

Poem

🐰
I hop through tests with careful paws,
Swap sudden fails for patient laws.
Return false, wait — let Eventually sigh,
Retries and calm make CI rest high.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title directly and specifically describes the main change: upgrading testify from v1.10.0 to v1.11.1, which aligns with the primary objective of the changeset.
Description check ✅ Passed The PR description comprehensively covers the problem being solved (Issue #10164), explains the changes made (removing replace directives, upgrading dependencies, adding Try* utilities), and details the impact across ~50 test files.
Linked Issues check ✅ Passed The PR successfully addresses Issue #10164 by removing testify v1.10.0 pinning, upgrading to v1.11.1, fixing Eventually assertion anti-patterns across ~50 test files, and adding Try* utility functions to enable error-based control flow instead of assertions.
Out of Scope Changes check ✅ Passed All changes are scoped to the testify upgrade objective: dependency updates in go.mod files, new Try* utility functions for error-first testing patterns, and systematic replacement of testify assertions within Eventually condition functions across test files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
tools/pd-ctl/tests/member/member_test.go (1)

53-53: ⚠️ Potential issue | 🟠 Major

Duplicate defer cluster.Destroy() call.

Line 40 already has defer cluster.Destroy(). This duplicate defer at line 53 will cause Destroy() to be called twice when the function exits, which could lead to double-close errors or undefined behavior.

Proposed fix
 	client := cluster.GetServer("pd1").GetEtcdClient()
-	defer cluster.Destroy()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/member/member_test.go` at line 53, The test contains a
duplicate deferred call to cluster.Destroy() (the variable cluster and its
Destroy() method are deferred twice); remove the redundant defer so Destroy() is
only scheduled once—leave the original defer cluster.Destroy() (the one at line
~40) and delete the second defer cluster.Destroy() at line ~53 to avoid double
destruction and potential double-close errors.
tests/server/member/member_test.go (1)

112-137: ⚠️ Potential issue | 🟡 Minor

Remove unused checkMemberList function.

The checkMemberList function at lines 112-137 is dead code and not called anywhere in the codebase. It has been superseded by tryCheckMemberList in the same file. Remove the function to maintain code cleanliness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/member/member_test.go` around lines 112 - 137, Delete the unused
function checkMemberList from the file (the entire function block starting with
func checkMemberList(...)) since it's dead code and replaced by
tryCheckMemberList; remove any now-unused imports that were only required by
checkMemberList (e.g., io, json, pdpb, errors, or tests.TestDialClient if they
become unused), and run go vet/go build to ensure no remaining references to
checkMemberList exist.
tools/pd-ctl/tests/config/config_test.go (1)

660-668: ⚠️ Potential issue | 🟡 Minor

Tighten the Eventually success condition before exiting.

On Line 665, !strings.Contains(..., "404") can return true for non-JSON/transient responses, then Line 667 fails outside the retry loop. Keep JSON validation inside the predicate.

🔧 Suggested fix
 	var group placement.RuleGroup
 	testutil.Eventually(re, func() bool { // wait for the config to be synced to the scheduling server
 		output, err = tests.ExecuteCommand(cmd, "-u", pdAddr, "config", "placement-rules", "rule-group", "show", placement.DefaultGroupID)
 		if err != nil {
 			return false
 		}
-		return !strings.Contains(string(output), "404")
+		if strings.Contains(string(output), "404") {
+			return false
+		}
+		if json.Unmarshal(output, &group) != nil {
+			return false
+		}
+		return group.ID == placement.DefaultGroupID
 	})
-	re.NoError(json.Unmarshal(output, &group), string(output))
 	re.Equal(placement.RuleGroup{ID: placement.DefaultGroupID}, group)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/config/config_test.go` around lines 660 - 668, The
Eventually predicate is too lax—it only checks for absence of "404" and lets
invalid or non-JSON responses escape the retry loop causing json.Unmarshal and
re.Equal on group to fail; update the testutil.Eventually predicate (the
anonymous func passed to testutil.Eventually that calls tests.ExecuteCommand
with cmd and pdAddr) to attempt json.Unmarshal(output, &group) and verify the
unmarshaled value equals placement.RuleGroup{ID: placement.DefaultGroupID}
inside the predicate, returning true only when unmarshalling succeeds and the
group matches; leave the external re.NoError/re.Equal assertions out of the
outer scope since the predicate will now ensure valid JSON and the expected
group before exiting.
pkg/mcs/discovery/register_test.go (1)

73-92: ⚠️ Potential issue | 🟠 Major

Avoid time.Sleep inside Eventually predicate—can make the test extremely slow.

At lines 89–92, testutil.Eventually calls tryGetKeyAfterLeaseExpired, which sleeps for DefaultLeaseInSeconds (5 seconds) plus 500 ms on every predicate evaluation. If Eventually retries N times, the total sleep is multiplied by N, potentially pushing this test into the 30–60+ second range, especially given the 3-iteration loop context.

Refactor to compute a deadline once before Eventually, then return false until that deadline passes, before doing a fast client.Get poll:

Restructure with deadline gate
+func tryGetKey(ctx context.Context, client *clientv3.Client, key string) (string, error) {
+    resp, err := client.Get(ctx, key)
+    if err != nil {
+        return "", err
+    }
+    if len(resp.Kvs) == 0 {
+        return "", nil
+    }
+    return string(resp.Kvs[0].Value), nil
+}

 ...
         <-etcd.Server.ReadyNotify()
+        // Wait once for the lease-expiry window, then poll quickly.
+        expireAt := time.Now().Add(DefaultLeaseInSeconds*time.Second + 500*time.Millisecond)
         testutil.Eventually(re, func() bool {
-            val, err := tryGetKeyAfterLeaseExpired(ctx, client, sr.key)
+            if time.Now().Before(expireAt) {
+                return false
+            }
+            val, err := tryGetKey(ctx, client, sr.key)
             return err == nil && val == "127.0.0.1:2"
         })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/discovery/register_test.go` around lines 73 - 92, The predicate
passed to testutil.Eventually must not call tryGetKeyAfterLeaseExpired (which
sleeps DefaultLeaseInSeconds+500ms); instead compute a single deadline before
calling testutil.Eventually (e.g. deadline :=
time.Now().Add(time.Duration(DefaultLeaseInSeconds+500)*time.Millisecond)) and
have the predicate return false until time.Now() >= deadline, after which do a
fast non-sleeping client.Get loop (or a quick single Get) to check sr.key and
compare the value to "127.0.0.1:2"; replace the current call to
tryGetKeyAfterLeaseExpired inside testutil.Eventually with this deadline-gated
check so retries don’t multiply the sleep.
pkg/utils/etcdutil/etcdutil_test.go (1)

193-201: ⚠️ Potential issue | 🟠 Major

Always verify member convergence via list check, not removal RPC success.

The predicate currently returns true as soon as RemoveEtcdMember succeeds without confirming the member list reflects the removal. Since the test wraps this in Eventually specifically to tolerate removal timing issues (member shutdown can cause transient errors), it's safer to always validate end-state via member list rather than trust the RPC return value.

Suggested fix
 testutil.Eventually(re, func() bool {
-    _, err := RemoveEtcdMember(client1, memIDToRemove)
-    if err == nil {
-        return true
-    }
-    // Verify if the member was actually removed by checking the member list
-    listResp, listErr := ListEtcdMembers(client1.Ctx(), client1)
-    return listErr == nil && len(listResp.Members) == 1
+    _, _ = RemoveEtcdMember(client1, memIDToRemove)
+    // Always verify convergence via member list (covers both nil and non-nil remove errors).
+    listResp, listErr := ListEtcdMembers(client1.Ctx(), client1)
+    return listErr == nil && len(listResp.Members) == 1
 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/utils/etcdutil/etcdutil_test.go` around lines 193 - 201, The predicate in
the test currently returns true immediately when RemoveEtcdMember succeeds
without verifying the cluster state; change the predicate used in
testutil.Eventually to always call ListEtcdMembers(client1.Ctx(), client1) and
only return true when the returned member list reflects that memIDToRemove is
gone (e.g., len(listResp.Members) == 1 or the specific member ID is absent),
treating RemoveEtcdMember errors as transient and not as a success condition;
update the code around RemoveEtcdMember, ListEtcdMembers, client1 and
memIDToRemove so the end-state check (member list) determines success.
tests/server/api/rule_test.go (1)

548-559: ⚠️ Potential issue | 🟠 Major

Return false when the expected rule is missing.

In the success path, the predicate currently returns true if no rule with GroupID == "e" is present, which can make this test pass even when the target rule is absent.

🔧 Suggested fix
 		if testCase.success {
 			testutil.Eventually(re, func() bool {
 				if testutil.TryReadGetJSON(tests.TestDialClient, url, &resp) != nil {
 					return false
 				}
 				for _, r := range resp {
 					if r.GroupID == "e" {
 						return compareRule(r, &rule)
 					}
 				}
-				return true
+				return false
 			})
 		} else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/api/rule_test.go` around lines 548 - 559, The predicate passed
to testutil.Eventually incorrectly returns true when no rule with GroupID == "e"
is found, allowing the success path to pass incorrectly; update the closure used
in the testCase.success branch (the anonymous func passed to testutil.Eventually
that calls testutil.TryReadGetJSON and iterates resp) so that if no r with
r.GroupID == "e" is found it returns false (and only returns the result of
compareRule(r, &rule) when a matching rule is present), ensuring Eventually
keeps polling until the expected rule appears and compareRule returns true.
🧹 Nitpick comments (7)
tests/integrations/tso/consistency_test.go (1)

139-141: Preserve stream close/send errors for timeout diagnostics.

At Line 140, dropping CloseSend() errors makes eventual timeout root-cause analysis harder. Consider tracking a lastErr in outer scope (also for Line 136/142 failures) and assert it after Eventually fails.

💡 Suggested refactor
-	var resp *tsopb.TsoResponse
+	var (
+		resp    *tsopb.TsoResponse
+		lastErr error
+	)
 	testutil.Eventually(re, func() bool {
 		tsoClient, err := suite.tsoClient.Tso(ctx)
 		if err != nil {
+			lastErr = err
 			return false
 		}
 		defer func() {
-			_ = tsoClient.CloseSend()
+			if err := tsoClient.CloseSend(); err != nil && lastErr == nil {
+				lastErr = err
+			}
 		}()
 		if err := tsoClient.Send(req); err != nil {
+			lastErr = err
 			return false
 		}
 		resp, err = tsoClient.Recv()
-		return err == nil && resp != nil
+		if err != nil || resp == nil {
+			lastErr = err
+			return false
+		}
+		lastErr = nil
+		return true
 	})
+	re.NoError(lastErr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/tso/consistency_test.go` around lines 139 - 141, The defer
currently swallows errors from tsoClient.CloseSend(), which loses useful failure
context; introduce an outer-scope variable (e.g., lastErr) and assign
CloseSend()'s error to it instead of discarding, also set lastErr for the other
failure points referenced (the checks at the other failing calls around the
Eventually block), and update the test teardown/assertion so that if the
Eventually assertion fails you surface lastErr (assert or t.Log/fatal with
lastErr) to preserve the root-cause error for timeout diagnostics; reference the
tsoClient.CloseSend() call, the Eventually assertion, and the new lastErr
variable when making changes.
client/clients/tso/stream_test.go (1)

361-365: Prefer accepting either error path in the Eventually predicate.

At Line 365, requiring both processErr and callbackErr can make this check brittle if closure semantics return only one error path (sync return or callback). Consider loosening to either-path success.

Suggested change
-		return processErr != nil && callbackErr != nil
+		return processErr != nil || callbackErr != nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/clients/tso/stream_test.go` around lines 361 - 365, The test's
Eventually predicate currently requires both processErr and callbackErr to be
non-nil, which is brittle; update the predicate used around
s.stream.processRequests (referencing processErr and callbackErr and the
callback passed to stream.processRequests) to accept either error path by
returning processErr != nil || callbackErr != nil so the test succeeds if the
error is delivered synchronously or via the callback.
tests/integrations/mcs/tso/keyspace_group_manager_test.go (1)

242-244: Please validate with failpoint-aware test targets.

Run touched tests with make basic-test / make gotest (not plain go test) so failpoints are correctly managed.

Based on learnings: Applies to **/*_test.go : Tests must handle failpoints using github.com/pingcap/failpoint. Do not run go test directly unless failpoints are enabled/disabled correctly. Use make test, make basic-test, or make gotest commands to auto enable/disable failpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go` around lines 242 -
244, The test in tests/integrations/mcs/tso/keyspace_group_manager_test.go
relies on failpoints and must be executed with failpoint-aware targets; run it
via make basic-test or make gotest (not plain go test). Ensure the test file
imports github.com/pingcap/failpoint and that failpoints are enabled/disabled
properly (e.g., enable required failpoints in TestMain or setup and call
failpoint.Disable in cleanup) so the branch guarded by the if err != nil || tam
== nil check behaves deterministically under test; re-run the touched *_test.go
with the make targets to validate the fix.
tests/server/apiv2/handlers/testutil.go (1)

230-252: Refactor recommendation: deduplicate keyspace-group load paths.

tryLoadKeyspaceGroupByID and TryLoadKeyspaceGroupByID currently duplicate request/read/unmarshal logic. Consolidating them will reduce drift risk.

♻️ Proposed refactor
-func tryLoadKeyspaceGroupByID(server *tests.TestServer, id uint32) (*endpoint.KeyspaceGroup, error) {
+func tryLoadKeyspaceGroupByID(server *tests.TestServer, id uint32) (*endpoint.KeyspaceGroup, int, error) {
 	httpReq, err := http.NewRequest(http.MethodGet, server.GetAddr()+keyspaceGroupsPrefix+fmt.Sprintf("/%d", id), http.NoBody)
 	if err != nil {
-		return nil, err
+		return nil, 0, err
 	}
 	resp, err := tests.TestDialClient.Do(httpReq)
 	if err != nil {
-		return nil, err
+		return nil, 0, err
 	}
 	defer resp.Body.Close()
 	data, err := io.ReadAll(resp.Body)
 	if err != nil {
-		return nil, err
+		return nil, resp.StatusCode, err
 	}
 	if resp.StatusCode != http.StatusOK {
-		return nil, fmt.Errorf("unexpected status code: %d", resp.StatusCode)
+		return nil, resp.StatusCode, fmt.Errorf("unexpected status code: %d", resp.StatusCode)
 	}
 	var kg endpoint.KeyspaceGroup
 	if err := json.Unmarshal(data, &kg); err != nil {
-		return nil, err
+		return nil, resp.StatusCode, err
 	}
-	return &kg, nil
+	return &kg, resp.StatusCode, nil
 }
 
 func MustLoadKeyspaceGroupByID(re *require.Assertions, server *tests.TestServer, id uint32) *endpoint.KeyspaceGroup {
 	var kg *endpoint.KeyspaceGroup
 	testutil.Eventually(re, func() bool {
-		var err error
-		kg, err = tryLoadKeyspaceGroupByID(server, id)
+		var err error
+		kg, _, err = tryLoadKeyspaceGroupByID(server, id)
 		return err == nil && kg != nil
 	})
 	return kg
 }
 
 func TryLoadKeyspaceGroupByID(re *require.Assertions, server *tests.TestServer, id uint32) (*endpoint.KeyspaceGroup, int) {
-	httpReq, err := http.NewRequest(http.MethodGet, server.GetAddr()+keyspaceGroupsPrefix+fmt.Sprintf("/%d", id), http.NoBody)
-	re.NoError(err)
-	resp, err := tests.TestDialClient.Do(httpReq)
-	re.NoError(err)
-	defer resp.Body.Close()
-	data, err := io.ReadAll(resp.Body)
-	re.NoError(err)
-	if resp.StatusCode != http.StatusOK {
-		return nil, resp.StatusCode
-	}
-
-	var kg endpoint.KeyspaceGroup
-	re.NoError(json.Unmarshal(data, &kg))
-	return &kg, resp.StatusCode
+	kg, code, err := tryLoadKeyspaceGroupByID(server, id)
+	if err != nil {
+		return nil, code
+	}
+	return kg, code
 }

Also applies to: 255-270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/apiv2/handlers/testutil.go` around lines 230 - 252,
tryLoadKeyspaceGroupByID and TryLoadKeyspaceGroupByID duplicate the same HTTP
request, response read and json.Unmarshal logic; extract the shared logic into a
single helper (e.g., a new function like doGetAndUnmarshal or loadJSONInto) that
takes the request URL (or http.Request) and a target interface{} to unmarshal
into, then reimplement both tryLoadKeyspaceGroupByID and
TryLoadKeyspaceGroupByID to call that helper and return the resulting
*endpoint.KeyspaceGroup; update error propagation so behavior remains identical.
tools/pd-ctl/tests/helper.go (1)

108-123: Include command output in JSON unmarshal errors for better debuggability.

When JSON decoding fails, callers who pass a struct argument lose the raw output that typically explains the failure. The current implementation returns ("", err) without the output context. Since most callers in scheduler_test.go don't capture output when passing a struct for unmarshal, wrapping the error with the actual output would significantly aid debugging.

Proposed fix
 func TryExec(cmd *cobra.Command, args []string, v any) (string, error) {
     output, err := ExecuteCommand(cmd, args...)
     if err != nil {
         return "", err
     }
     if v == nil {
         return string(output), nil
     }
-    if err := json.Unmarshal(output, v); err != nil {
-        return "", err
+    if err := json.Unmarshal(output, v); err != nil {
+        return "", fmt.Errorf("unmarshal pd-ctl output: %w: %s", err, string(output))
     }
     return "", nil
 }

Requires adding fmt to imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/helper.go` around lines 108 - 123, TryExec's JSON
unmarshal branch currently returns the unmodified error without the command
output, losing useful debug info; update TryExec to wrap the json.Unmarshal
error with the raw command output (e.g. using fmt.Errorf("json unmarshal failed:
%w; output: %s", err, string(output))) so callers receive the output in the
error message, and add fmt to the imports; reference the TryExec function and
the json.Unmarshal error branch when making this change.
tests/testutil.go (1)

463-562: Please run failpoint-safe test commands for this change set.

Given these retry/Eventually changes in test infrastructure, validate with failpoint-aware targets (make check plus the narrowest make basic-test/make gotest scopes for touched packages).

Based on learnings, "Before PR: run make check and the narrowest relevant go test (with tags). At minimum run make basic-test for touched packages."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/testutil.go` around lines 463 - 562, Run failpoint-aware test targets
to validate the test changes: execute make check, then run the narrowest
package-level tests (e.g., make basic-test or make gotest for the packages that
contain TestDialClient, testutil.Eventually, TryCheckGetJSON, CheckPostJSON and
related testhelpers). Ensure these commands use the failpoint-enabled test
configuration (the project’s failpoint build/test flags or tags) so the
retry/Eventually changes are exercised under failpoints.
tests/server/api/region_test.go (1)

449-455: Use the current response variable in this polling block.

This predicate fetches into r8 but calls r4.Adjust(). That looks like a stale copy-paste and makes the check harder to read.

🔧 Suggested cleanup
 	testutil.Eventually(re, func() bool {
 		if err := testutil.TryReadGetJSON(tests.TestDialClient, url, r8); err != nil {
 			return false
 		}
-		r4.Adjust()
+		r8.Adjust()
 		return r8.Count == 1 && len(r8.Regions) > 0 && r.GetID() == r8.Regions[0].ID
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/api/region_test.go` around lines 449 - 455, The polling
predicate is reading the response into r8 but mistakenly calls r4.Adjust(),
which is a stale copy/paste; update the predicate to call r8.Adjust() so the
check uses the current response variable (inside the testutil.Eventually closure
that calls testutil.TryReadGetJSON and then compares r8.Count, r8.Regions, and
r.GetID()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/tso/server_test.go`:
- Around line 128-164: tryRequest can block forever on tsoClient.Recv/Send; wrap
each attempt in a short per-attempt context with timeout (e.g.
context.WithTimeout(ctx, shortDuration)) and use that derived context when
calling suite.pdClient.Tso(ctx) / suite.tsoClient.Tso(ctx) and any gRPC ops;
ensure you call cancel() in a defer, keep the existing CloseSend() cleanup, and
apply the same timeout-wrapped context for both the legacy (pdpb.TsoRequest) and
non-legacy (tsopb.TsoRequest) branches so each Send/Recv is bounded and the
Eventually loop cannot hang indefinitely.

In `@tests/server/api/api_test.go`:
- Around line 945-948: The test is unmarshalling a stale `output` variable
instead of the current HTTP response body for the polling iteration, which can
hide failures; update the assertion to read and unmarshal the fresh response
body (the bytes you just got from the request in this loop, e.g., the current
`resp.Body`/`body` variable) into `p` and then assert p.Action == "removing" &&
p.Progress == 0.0 && p.CurrentSpeed == 0.0 && p.LeftSeconds == math.MaxFloat64
so the check uses the latest response rather than the previously captured
`output`.

In `@tests/server/api/store_test.go`:
- Around line 528-546: The function tryRequestStatusBody leaks resp.Body when
io.ReadAll returns an error; update tryRequestStatusBody to ensure resp.Body is
always closed (use defer resp.Body.Close() immediately after client.Do succeeds)
and handle any Close() error if needed before returning the read error, so every
early return after client.Do cleans up the response body.

In `@tools/pd-ctl/tests/config/config_test.go`:
- Around line 856-870: The polling predicate used with testutil.Eventually (in
checkLoadRuleBundle) only checks length/presence which can hide
stale-but-same-length data; change the predicate to validate bundle contents
inside Eventually: in the anonymous func passed to testutil.Eventually (the code
that calls ExecuteCommand, os.ReadFile and json.Unmarshal into bundles), perform
a deterministic deep comparison between bundles and expectValues (e.g.,
reflect.DeepEqual or the test helper used by assertBundles) and return true only
when they match; keep all error cases returning false so the retry continues,
and remove the separate assertBundles call outside Eventually (or keep it as a
final assertion but ensure the predicate already guarantees content equality).

In `@tools/pd-ctl/tests/scheduler/scheduler_test.go`:
- Around line 385-392: The Eventually loop calls json.Unmarshal into a reused
result map without checking errors, which can leave stale data from prior
iterations and cause false positives; update the closure used by
testutil.Eventually (the one invoking tests.ExecuteCommand and json.Unmarshal)
to reinitialize or reset the result destination (e.g., create a fresh local map
variable) at the start of each iteration and capture the error returned by
json.Unmarshal, returning false on unmarshal failure so only successfully
parsed, current command output can satisfy the predicate. Ensure the change is
applied to the same pattern found around testutil.Eventually usages that call
tests.ExecuteCommand and json.Unmarshal.

---

Outside diff comments:
In `@pkg/mcs/discovery/register_test.go`:
- Around line 73-92: The predicate passed to testutil.Eventually must not call
tryGetKeyAfterLeaseExpired (which sleeps DefaultLeaseInSeconds+500ms); instead
compute a single deadline before calling testutil.Eventually (e.g. deadline :=
time.Now().Add(time.Duration(DefaultLeaseInSeconds+500)*time.Millisecond)) and
have the predicate return false until time.Now() >= deadline, after which do a
fast non-sleeping client.Get loop (or a quick single Get) to check sr.key and
compare the value to "127.0.0.1:2"; replace the current call to
tryGetKeyAfterLeaseExpired inside testutil.Eventually with this deadline-gated
check so retries don’t multiply the sleep.

In `@pkg/utils/etcdutil/etcdutil_test.go`:
- Around line 193-201: The predicate in the test currently returns true
immediately when RemoveEtcdMember succeeds without verifying the cluster state;
change the predicate used in testutil.Eventually to always call
ListEtcdMembers(client1.Ctx(), client1) and only return true when the returned
member list reflects that memIDToRemove is gone (e.g., len(listResp.Members) ==
1 or the specific member ID is absent), treating RemoveEtcdMember errors as
transient and not as a success condition; update the code around
RemoveEtcdMember, ListEtcdMembers, client1 and memIDToRemove so the end-state
check (member list) determines success.

In `@tests/server/api/rule_test.go`:
- Around line 548-559: The predicate passed to testutil.Eventually incorrectly
returns true when no rule with GroupID == "e" is found, allowing the success
path to pass incorrectly; update the closure used in the testCase.success branch
(the anonymous func passed to testutil.Eventually that calls
testutil.TryReadGetJSON and iterates resp) so that if no r with r.GroupID == "e"
is found it returns false (and only returns the result of compareRule(r, &rule)
when a matching rule is present), ensuring Eventually keeps polling until the
expected rule appears and compareRule returns true.

In `@tests/server/member/member_test.go`:
- Around line 112-137: Delete the unused function checkMemberList from the file
(the entire function block starting with func checkMemberList(...)) since it's
dead code and replaced by tryCheckMemberList; remove any now-unused imports that
were only required by checkMemberList (e.g., io, json, pdpb, errors, or
tests.TestDialClient if they become unused), and run go vet/go build to ensure
no remaining references to checkMemberList exist.

In `@tools/pd-ctl/tests/config/config_test.go`:
- Around line 660-668: The Eventually predicate is too lax—it only checks for
absence of "404" and lets invalid or non-JSON responses escape the retry loop
causing json.Unmarshal and re.Equal on group to fail; update the
testutil.Eventually predicate (the anonymous func passed to testutil.Eventually
that calls tests.ExecuteCommand with cmd and pdAddr) to attempt
json.Unmarshal(output, &group) and verify the unmarshaled value equals
placement.RuleGroup{ID: placement.DefaultGroupID} inside the predicate,
returning true only when unmarshalling succeeds and the group matches; leave the
external re.NoError/re.Equal assertions out of the outer scope since the
predicate will now ensure valid JSON and the expected group before exiting.

In `@tools/pd-ctl/tests/member/member_test.go`:
- Line 53: The test contains a duplicate deferred call to cluster.Destroy() (the
variable cluster and its Destroy() method are deferred twice); remove the
redundant defer so Destroy() is only scheduled once—leave the original defer
cluster.Destroy() (the one at line ~40) and delete the second defer
cluster.Destroy() at line ~53 to avoid double destruction and potential
double-close errors.

---

Nitpick comments:
In `@client/clients/tso/stream_test.go`:
- Around line 361-365: The test's Eventually predicate currently requires both
processErr and callbackErr to be non-nil, which is brittle; update the predicate
used around s.stream.processRequests (referencing processErr and callbackErr and
the callback passed to stream.processRequests) to accept either error path by
returning processErr != nil || callbackErr != nil so the test succeeds if the
error is delivered synchronously or via the callback.

In `@tests/integrations/mcs/tso/keyspace_group_manager_test.go`:
- Around line 242-244: The test in
tests/integrations/mcs/tso/keyspace_group_manager_test.go relies on failpoints
and must be executed with failpoint-aware targets; run it via make basic-test or
make gotest (not plain go test). Ensure the test file imports
github.com/pingcap/failpoint and that failpoints are enabled/disabled properly
(e.g., enable required failpoints in TestMain or setup and call
failpoint.Disable in cleanup) so the branch guarded by the if err != nil || tam
== nil check behaves deterministically under test; re-run the touched *_test.go
with the make targets to validate the fix.

In `@tests/integrations/tso/consistency_test.go`:
- Around line 139-141: The defer currently swallows errors from
tsoClient.CloseSend(), which loses useful failure context; introduce an
outer-scope variable (e.g., lastErr) and assign CloseSend()'s error to it
instead of discarding, also set lastErr for the other failure points referenced
(the checks at the other failing calls around the Eventually block), and update
the test teardown/assertion so that if the Eventually assertion fails you
surface lastErr (assert or t.Log/fatal with lastErr) to preserve the root-cause
error for timeout diagnostics; reference the tsoClient.CloseSend() call, the
Eventually assertion, and the new lastErr variable when making changes.

In `@tests/server/api/region_test.go`:
- Around line 449-455: The polling predicate is reading the response into r8 but
mistakenly calls r4.Adjust(), which is a stale copy/paste; update the predicate
to call r8.Adjust() so the check uses the current response variable (inside the
testutil.Eventually closure that calls testutil.TryReadGetJSON and then compares
r8.Count, r8.Regions, and r.GetID()).

In `@tests/server/apiv2/handlers/testutil.go`:
- Around line 230-252: tryLoadKeyspaceGroupByID and TryLoadKeyspaceGroupByID
duplicate the same HTTP request, response read and json.Unmarshal logic; extract
the shared logic into a single helper (e.g., a new function like
doGetAndUnmarshal or loadJSONInto) that takes the request URL (or http.Request)
and a target interface{} to unmarshal into, then reimplement both
tryLoadKeyspaceGroupByID and TryLoadKeyspaceGroupByID to call that helper and
return the resulting *endpoint.KeyspaceGroup; update error propagation so
behavior remains identical.

In `@tests/testutil.go`:
- Around line 463-562: Run failpoint-aware test targets to validate the test
changes: execute make check, then run the narrowest package-level tests (e.g.,
make basic-test or make gotest for the packages that contain TestDialClient,
testutil.Eventually, TryCheckGetJSON, CheckPostJSON and related testhelpers).
Ensure these commands use the failpoint-enabled test configuration (the
project’s failpoint build/test flags or tags) so the retry/Eventually changes
are exercised under failpoints.

In `@tools/pd-ctl/tests/helper.go`:
- Around line 108-123: TryExec's JSON unmarshal branch currently returns the
unmodified error without the command output, losing useful debug info; update
TryExec to wrap the json.Unmarshal error with the raw command output (e.g. using
fmt.Errorf("json unmarshal failed: %w; output: %s", err, string(output))) so
callers receive the output in the error message, and add fmt to the imports;
reference the TryExec function and the json.Unmarshal error branch when making
this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 762c67aa-0c1e-4db7-b55d-2ebfdf84ad4d

📥 Commits

Reviewing files that changed from the base of the PR and between 4253bee and 17f3956.

⛔ Files ignored due to path filters (4)
  • client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • tests/integrations/go.sum is excluded by !**/*.sum
  • tools/go.sum is excluded by !**/*.sum
📒 Files selected for processing (47)
  • client/clients/tso/stream_test.go
  • client/go.mod
  • go.mod
  • pkg/election/leadership_test.go
  • pkg/mcs/discovery/register_test.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • pkg/schedule/scatter/region_scatterer_test.go
  • pkg/storage/leveldb_backend_test.go
  • pkg/tso/keyspace_group_manager_test.go
  • pkg/utils/etcdutil/etcdutil_test.go
  • pkg/utils/testutil/api_check.go
  • tests/integrations/client/client_test.go
  • tests/integrations/client/http_client_test.go
  • tests/integrations/client/router_client_test.go
  • tests/integrations/go.mod
  • tests/integrations/mcs/resourcemanager/resource_manager_test.go
  • tests/integrations/mcs/router/server_test.go
  • tests/integrations/mcs/scheduling/api_test.go
  • tests/integrations/mcs/scheduling/config_test.go
  • tests/integrations/mcs/scheduling/rule_test.go
  • tests/integrations/mcs/scheduling/server_test.go
  • tests/integrations/mcs/tso/keyspace_group_manager_test.go
  • tests/integrations/mcs/tso/server_test.go
  • tests/integrations/realcluster/scheduler_test.go
  • tests/integrations/tso/consistency_test.go
  • tests/integrations/tso/server_test.go
  • tests/server/api/admin_test.go
  • tests/server/api/api_test.go
  • tests/server/api/diagnostic_test.go
  • tests/server/api/min_resolved_ts_test.go
  • tests/server/api/region_test.go
  • tests/server/api/rule_test.go
  • tests/server/api/scheduler_test.go
  • tests/server/api/store_test.go
  • tests/server/apiv2/handlers/affinity_test.go
  • tests/server/apiv2/handlers/testutil.go
  • tests/server/cluster/cluster_test.go
  • tests/server/config/config_test.go
  • tests/server/member/member_test.go
  • tests/testutil.go
  • tools/go.mod
  • tools/pd-ctl/tests/config/config_test.go
  • tools/pd-ctl/tests/helper.go
  • tools/pd-ctl/tests/keyspace/keyspace_group_test.go
  • tools/pd-ctl/tests/keyspace/keyspace_test.go
  • tools/pd-ctl/tests/member/member_test.go
  • tools/pd-ctl/tests/scheduler/scheduler_test.go
💤 Files with no reviewable changes (3)
  • tools/go.mod
  • tests/integrations/go.mod
  • go.mod

- Fix confusing-naming: rename tryLoadKeyspaceGroupByID to
  loadKeyspaceGroupByIDNoAssert
- Fix staticcheck S1008: simplify if-return patterns
- Fix unparam: remove unused expectedCode parameter from tryCheckResp
- Fix unused: delete checkMemberList and mustGetAffinityGroup that
  were superseded by try* variants
- Fix nil panic in TestResourceGroupController: add nil check before
  calling err.Error() inside Eventually

Issue Number: ref tikv#10164

Signed-off-by: JmPotato <github@ipotato.me>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/server/api/api_test.go (1)

945-947: ⚠️ Potential issue | 🟠 Major

Use current response body instead of stale output in this polling iteration.

Line 945 unmarshals output from outside this request path, so this branch can pass/fail against stale data instead of the response just fetched. That can mask real regressions.

💡 Suggested fix
 			defer resp.Body.Close()
 			if resp.StatusCode != http.StatusOK {
 				return false
 			}
+			body, err := io.ReadAll(resp.Body)
+			if err != nil {
+				return false
+			}
 			// is not prepared
-			if json.Unmarshal(output, &p) != nil {
+			if json.Unmarshal(body, &p) != nil {
 				return false
 			}
 			return p.Action == "removing" && p.Progress == 0.0 && p.CurrentSpeed == 0.0 && p.LeftSeconds == math.MaxFloat64
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/api/api_test.go` around lines 945 - 947, The test is
unmarshalling the stale variable `output` instead of the current request's
response bytes, so replace `json.Unmarshal(output, &p)` with
`json.Unmarshal(<currentResponseBody>, &p)` using the local variable holding the
latest response in this polling iteration (the byte slice produced by the
current request), and propagate/check the returned error from json.Unmarshal;
locate this change around the polling loop where `output`, `p`, and the HTTP
response/body are handled (search for `output`, `p`, and `json.Unmarshal` in the
test) and update the call to use the fresh response body variable.
🧹 Nitpick comments (1)
tests/server/member/member_test.go (1)

117-124: Wrap helper errors with operation context.

return err here loses key context in Eventually logs. Wrapping these failures makes retries debuggable.

🧩 Proposed refactor
 	res, err := tests.TestDialClient.Get(addr)
 	if err != nil {
-		return err
+		return errors.Wrap(err, "get /pd/api/v1/members")
 	}
 	defer res.Body.Close()
 	buf, err := io.ReadAll(res.Body)
 	if err != nil {
-		return err
+		return errors.Wrap(err, "read members response body")
 	}
@@
 	data := &pdpb.GetMembersResponse{}
 	if err = json.Unmarshal(buf, data); err != nil {
-		return err
+		return errors.Wrap(err, "unmarshal members response")
 	}

As per coding guidelines, "Wrap errors with github.com/pingcap/errors (errors.Wrap/Annotate) or fmt.Errorf("...: %w", err)."

Also applies to: 129-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/member/member_test.go` around lines 117 - 124, The helper
currently returns raw errors from response operations (res.Body.Close(),
io.ReadAll) which lose context; update the return statements in the helper in
tests/server/member/member_test.go to wrap errors (e.g., use fmt.Errorf("closing
response body: %w", err) and fmt.Errorf("reading response body: %w", err) or
errors.Wrap/Annotate) so Eventually logs show operation context, and apply the
same wrapping to the other occurrences referenced around the helper (the
subsequent returns at the other read/close points).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/server/member/member_test.go`:
- Around line 132-146: The test can false-pass because the nested loops only
validate URLs when names randomly match; change the validation to build a map of
returned members keyed by member.GetName() (from data.GetMembers()), assert the
map size equals len(configs), then for each cfg in configs look up cfg.Name in
that map and return an error if missing; for each found member verify
len(member.ClientUrls)==1 && member.ClientUrls[0]==cfg.ClientUrls and
len(member.PeerUrls)==1 && member.PeerUrls[0]==cfg.PeerUrls, returning
descriptive errors on mismatch.

---

Duplicate comments:
In `@tests/server/api/api_test.go`:
- Around line 945-947: The test is unmarshalling the stale variable `output`
instead of the current request's response bytes, so replace
`json.Unmarshal(output, &p)` with `json.Unmarshal(<currentResponseBody>, &p)`
using the local variable holding the latest response in this polling iteration
(the byte slice produced by the current request), and propagate/check the
returned error from json.Unmarshal; locate this change around the polling loop
where `output`, `p`, and the HTTP response/body are handled (search for
`output`, `p`, and `json.Unmarshal` in the test) and update the call to use the
fresh response body variable.

---

Nitpick comments:
In `@tests/server/member/member_test.go`:
- Around line 117-124: The helper currently returns raw errors from response
operations (res.Body.Close(), io.ReadAll) which lose context; update the return
statements in the helper in tests/server/member/member_test.go to wrap errors
(e.g., use fmt.Errorf("closing response body: %w", err) and fmt.Errorf("reading
response body: %w", err) or errors.Wrap/Annotate) so Eventually logs show
operation context, and apply the same wrapping to the other occurrences
referenced around the helper (the subsequent returns at the other read/close
points).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 710ff17b-cce5-4019-a501-9a75ce9dbb0d

📥 Commits

Reviewing files that changed from the base of the PR and between 17f3956 and 1027e83.

📒 Files selected for processing (6)
  • pkg/utils/testutil/api_check.go
  • tests/integrations/mcs/resourcemanager/resource_manager_test.go
  • tests/server/api/api_test.go
  • tests/server/apiv2/handlers/affinity_test.go
  • tests/server/apiv2/handlers/testutil.go
  • tests/server/member/member_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/server/apiv2/handlers/affinity_test.go
  • pkg/utils/testutil/api_check.go
  • tests/server/apiv2/handlers/testutil.go

The function was superseded by tests.TryExec and
tests.ExecuteCommand calls inside Eventually.

Issue Number: ref tikv#10164

Signed-off-by: JmPotato <github@ipotato.me>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tools/pd-ctl/tests/scheduler/scheduler_test.go (1)

385-392: ⚠️ Potential issue | 🟠 Major

Handle json.Unmarshal errors and reset decode destinations in retry loops.

At Line 390, Line 568, Line 588, and Line 903, decode errors are ignored while reusing destination variables across retries. That can preserve stale values and let predicates pass on invalid current output.

Suggested fix (apply same pattern to all affected blocks)
 result := make(map[string]any)
 testutil.Eventually(re, func() bool {
-    output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "describe", "balance-leader-scheduler"}...)
-    if err != nil {
-        return false
-    }
-    json.Unmarshal(output, &result)
+    output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "describe", "balance-leader-scheduler"}...)
+    if err != nil {
+        return false
+    }
+    result = make(map[string]any)
+    if err := json.Unmarshal(output, &result); err != nil {
+        return false
+    }
     return len(result) != 0 && result["status"] == "paused" && result["summary"] == ""
 }, testutil.WithWaitFor(30*time.Second))
 testutil.Eventually(re, func() bool {
-    output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "config", "balance-range-scheduler"}...)
-    if err != nil {
-        return false
-    }
-    json.Unmarshal(output, &rangeConf)
+    output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "config", "balance-range-scheduler"}...)
+    if err != nil {
+        return false
+    }
+    rangeConf = nil
+    if err := json.Unmarshal(output, &rangeConf); err != nil {
+        return false
+    }
     if len(rangeConf) == 0 {
         return false
     }
#!/bin/bash
set -euo pipefail

FILE="tools/pd-ctl/tests/scheduler/scheduler_test.go"

# Show Eventually blocks that still call json.Unmarshal without local error handling nearby.
rg -n 'testutil\.Eventually|json\.Unmarshal\(' "$FILE" -C 3

# Inspect the previously flagged ranges directly.
sed -n '380,395p' "$FILE"
sed -n '558,592p' "$FILE"
sed -n '891,908p' "$FILE"

Also applies to: 563-574, 584-590, 896-907

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/scheduler/scheduler_test.go` around lines 385 - 392, The
Eventually blocks calling tests.ExecuteCommand then json.Unmarshal into the
reused variable result ignore decode errors and can leave stale values passing
the predicate; update each failing loop (the testutil.Eventually closure around
tests.ExecuteCommand for cmd and scheduler "describe" calls) to reset or
reinitialize the decode destination (e.g., create a fresh local map/struct
inside the closure) and check the error returned by json.Unmarshal, returning
false on unmarshal error so the retry continues; ensure you handle and log or
surface the unmarshal error locally rather than letting previous contents of
result be used in the predicate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tools/pd-ctl/tests/scheduler/scheduler_test.go`:
- Around line 385-392: The Eventually blocks calling tests.ExecuteCommand then
json.Unmarshal into the reused variable result ignore decode errors and can
leave stale values passing the predicate; update each failing loop (the
testutil.Eventually closure around tests.ExecuteCommand for cmd and scheduler
"describe" calls) to reset or reinitialize the decode destination (e.g., create
a fresh local map/struct inside the closure) and check the error returned by
json.Unmarshal, returning false on unmarshal error so the retry continues;
ensure you handle and log or surface the unmarshal error locally rather than
letting previous contents of result be used in the predicate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59e99d44-e1f3-4c0e-90af-ab2b30b486fa

📥 Commits

Reviewing files that changed from the base of the PR and between 1027e83 and 13808b5.

📒 Files selected for processing (1)
  • tools/pd-ctl/tests/scheduler/scheduler_test.go

Fix remaining Eventually anti-patterns and timing issues found in CI:

- scheduler_test.go (server/api): wrap scheduler existence check in
  Eventually with TryReadGetJSON/TryCheckPostJSON to handle API not
  ready in microservice mode
- scheduler_test.go (tools): move status/timeout/ranges assertions
  inside Eventually condition using plain Go comparisons
- config_test.go: rewrite assertTTLConfig to use checkTTLConfig with
  plain Go comparisons inside Eventually, fixing race where only one
  field was checked in Eventually while all fields were asserted outside
- rule_test.go: extend Eventually condition to verify rule content
  (not just count) since count stays at 2 during rule transitions
- tso_keyspace_group_test.go: add non-asserting getKeyspaceGroup method
  and use it in all Eventually blocks instead of tryGetKeyspaceGroup
  which uses re.NoError internally
- operator_test.go: wrap operator add command in Eventually to handle
  scheduling server not ready in microservice mode

Issue Number: ref tikv#10164

Signed-off-by: JmPotato <github@ipotato.me>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tools/pd-ctl/tests/scheduler/scheduler_test.go (1)

385-392: ⚠️ Potential issue | 🟠 Major

Unchecked json.Unmarshal in Eventually predicates can produce stale-state false positives.

Line 390, Line 568, Line 601, and Line 916 ignore unmarshal errors while reusing destination variables across retries. A bad JSON response can leave prior values and incorrectly satisfy the predicate.

✅ Proposed fix (use `TryExec` + fresh destination per retry)
@@
 	result := make(map[string]any)
 	testutil.Eventually(re, func() bool {
-		output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "describe", "balance-leader-scheduler"}...)
-		if err != nil {
+		curr := make(map[string]any)
+		if _, err := tests.TryExec(cmd, []string{"-u", pdAddr, "scheduler", "describe", "balance-leader-scheduler"}, &curr); err != nil {
 			return false
 		}
-		json.Unmarshal(output, &result)
+		result = curr
 		return len(result) != 0 && result["status"] == "paused" && result["summary"] == ""
 	}, testutil.WithWaitFor(30*time.Second))
@@
 	testutil.Eventually(re, func() bool {
-		output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "config", "balance-range-scheduler"}...)
-		if err != nil {
+		curr := []map[string]any{}
+		if _, err := tests.TryExec(cmd, []string{"-u", pdAddr, "scheduler", "config", "balance-range-scheduler"}, &curr); err != nil {
 			return false
 		}
-		json.Unmarshal(output, &rangeConf)
+		rangeConf = curr
 		if len(rangeConf) == 0 {
 			return false
 		}
@@
 	testutil.Eventually(re, func() bool {
-		output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "config", "balance-range-scheduler"}...)
-		if err != nil {
+		curr := []map[string]any{}
+		if _, err := tests.TryExec(cmd, []string{"-u", pdAddr, "scheduler", "config", "balance-range-scheduler"}, &curr); err != nil {
 			return false
 		}
-		json.Unmarshal(output, &rangeConf)
+		rangeConf = curr
 		return len(rangeConf) == 2
 	})
@@
 	checkSchedulerDescribeCommand := func(schedulerName, expectedStatus, expectedSummary string) {
 		result := make(map[string]any)
 		testutil.Eventually(re, func() bool {
-			output, err := tests.ExecuteCommand(cmd, []string{"-u", pdAddr, "scheduler", "describe", schedulerName}...)
-			if err != nil {
+			curr := make(map[string]any)
+			if _, err := tests.TryExec(cmd, []string{"-u", pdAddr, "scheduler", "describe", schedulerName}, &curr); err != nil {
 				return false
 			}
-			json.Unmarshal(output, &result)
+			result = curr
 			suite.T().Log(result)
 			return len(result) != 0 && expectedStatus == result["status"] && expectedSummary == result["summary"]
 		}, testutil.WithTickInterval(50*time.Millisecond))
 	}
#!/bin/bash
set -euo pipefail

# Verify the remaining Eventually+ExecuteCommand+json.Unmarshal sites with unchecked unmarshal errors.
f="tools/pd-ctl/tests/scheduler/scheduler_test.go"

echo "=== candidate lines ==="
rg -n 'testutil\.Eventually|ExecuteCommand\(|json\.Unmarshal\(' "$f" -C 2

echo
echo "=== focused contexts ==="
sed -n '380,395p' "$f"
sed -n '560,605p' "$f"
sed -n '905,920p' "$f"

Also applies to: 563-592, 596-603, 909-919

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/scheduler/scheduler_test.go` around lines 385 - 392, The
predicate in testutil.Eventually uses tests.ExecuteCommand + json.Unmarshal into
a reused variable (result) and ignores Unmarshal errors, which can reuse stale
data and cause false positives; change the predicate to use tests.TryExec (or
ExecuteCommand but capture and return errors) and allocate a fresh local
variable for unmarshalling on each retry, check and fail the predicate if
json.Unmarshal returns an error, and only then evaluate fields like
result["status"]/"summary" (apply this pattern to the predicates around
testutil.Eventually, tests.ExecuteCommand, json.Unmarshal and the reused result
variable for the scheduler tests referencing cmd and pdAddr).
🧹 Nitpick comments (2)
tests/integrations/mcs/keyspace/tso_keyspace_group_test.go (1)

563-584: Add context when propagating helper errors.

getKeyspaceGroup returns raw errors from multiple operations; wrapping them will make CI/debug triage much faster.

♻️ Proposed refactor
 func (suite *keyspaceGroupTestSuite) getKeyspaceGroup(id uint32) (*endpoint.KeyspaceGroup, int, error) {
 	httpReq, err := http.NewRequest(http.MethodGet, suite.server.GetAddr()+keyspaceGroupsPrefix+fmt.Sprintf("/%d", id), http.NoBody)
 	if err != nil {
-		return nil, 0, err
+		return nil, 0, fmt.Errorf("create get keyspace group request: %w", err)
 	}
 	resp, err := tests.TestDialClient.Do(httpReq)
 	if err != nil {
-		return nil, 0, err
+		return nil, 0, fmt.Errorf("send get keyspace group request: %w", err)
 	}
 	defer resp.Body.Close()
 	kg := &endpoint.KeyspaceGroup{}
 	if resp.StatusCode == http.StatusOK {
 		bodyBytes, err := io.ReadAll(resp.Body)
 		if err != nil {
-			return nil, resp.StatusCode, err
+			return nil, resp.StatusCode, fmt.Errorf("read get keyspace group response body: %w", err)
 		}
 		if err := json.Unmarshal(bodyBytes, kg); err != nil {
-			return nil, resp.StatusCode, err
+			return nil, resp.StatusCode, fmt.Errorf("decode get keyspace group response: %w", err)
 		}
 	}
 	return kg, resp.StatusCode, nil
 }
As per coding guidelines, "Wrap errors with `github.com/pingcap/errors` (`errors.Wrap`/`Annotate`) or `fmt.Errorf(\"...: %w\", err)`."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/keyspace/tso_keyspace_group_test.go` around lines 563
- 584, The getKeyspaceGroup helper currently returns raw errors from
http.NewRequest, tests.TestDialClient.Do, io.ReadAll, and json.Unmarshal; update
getKeyspaceGroup to wrap those errors with context (using fmt.Errorf("...: %w",
err) or github.com/pingcap/errors.Wrap/Annotate) so callers see which operation
failed and the id involved — wrap the error from http.NewRequest with the
request context and id, wrap the TestDialClient.Do error indicating the GET to
keyspaceGroupsPrefix and id, wrap io.ReadAll errors indicating reading the
response body for that id, and wrap json.Unmarshal errors indicating decoding
the KeyspaceGroup for that id.
tests/server/api/scheduler_test.go (1)

916-930: Make cancelled-job assertion resilient to job-id ordering assumptions.

Line 930 hardcodes scheduler[1] after sorting. This is brittle if job-id allocation baseline changes. Match the cancelled record by job-id (the request cancels job-id=1) instead of fixed index.

Proposed change
-		return scheduler[1]["status"] == "cancelled"
+		for _, job := range scheduler {
+			id, okID := job["job-id"].(float64)
+			status, okStatus := job["status"].(string)
+			if okID && okStatus && id == 1 && status == "cancelled" {
+				return true
+			}
+		}
+		return false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/api/scheduler_test.go` around lines 916 - 930, The test
currently assumes the cancelled job is at scheduler[1] after sorting; instead
iterate the scheduler slice (the variable scheduler) and find the map entry
whose "job-id" (a float64 in the maps) equals 1 (or 1.0), then assert that that
entry's "status" == "cancelled". Replace the hardcoded scheduler[1] check with a
search over scheduler maps (checking map["job-id"] and then map["status"]) and
fail the test if no matching job-id is found or its status is not "cancelled".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/scheduling/rule_test.go`:
- Around line 205-214: The eventual predicate currently only verifies
labelRules[1].ID == labelRule.ID but doesn't assert removal of the old "rule1";
update the predicate that inspects labelRules (in the test around the sorted
slice check using labelRules and labelRule.ID) to explicitly assert both that no
element in labelRules has ID == "rule1" and that one element equals labelRule.ID
(the expected "rule2"); keep the existing length check and sorting (sort.Slice
on labelRules) and return true only when length==2, "rule1" is absent, and
labelRule.ID is present.

In `@tools/pd-ctl/tests/operator/operator_test.go`:
- Around line 197-201: The range variable testCase is captured by the closure
passed to testutil.Eventually; create a per-iteration local copy (e.g., tc :=
testCase) immediately inside the loop and use tc in the closure and subsequent
calls (including tests.ExecuteCommand and any checks) instead of testCase to
avoid capturing the loop variable. Ensure the new local name (tc) replaces
testCase everywhere inside that iteration's closure and following statements.

---

Duplicate comments:
In `@tools/pd-ctl/tests/scheduler/scheduler_test.go`:
- Around line 385-392: The predicate in testutil.Eventually uses
tests.ExecuteCommand + json.Unmarshal into a reused variable (result) and
ignores Unmarshal errors, which can reuse stale data and cause false positives;
change the predicate to use tests.TryExec (or ExecuteCommand but capture and
return errors) and allocate a fresh local variable for unmarshalling on each
retry, check and fail the predicate if json.Unmarshal returns an error, and only
then evaluate fields like result["status"]/"summary" (apply this pattern to the
predicates around testutil.Eventually, tests.ExecuteCommand, json.Unmarshal and
the reused result variable for the scheduler tests referencing cmd and pdAddr).

---

Nitpick comments:
In `@tests/integrations/mcs/keyspace/tso_keyspace_group_test.go`:
- Around line 563-584: The getKeyspaceGroup helper currently returns raw errors
from http.NewRequest, tests.TestDialClient.Do, io.ReadAll, and json.Unmarshal;
update getKeyspaceGroup to wrap those errors with context (using
fmt.Errorf("...: %w", err) or github.com/pingcap/errors.Wrap/Annotate) so
callers see which operation failed and the id involved — wrap the error from
http.NewRequest with the request context and id, wrap the TestDialClient.Do
error indicating the GET to keyspaceGroupsPrefix and id, wrap io.ReadAll errors
indicating reading the response body for that id, and wrap json.Unmarshal errors
indicating decoding the KeyspaceGroup for that id.

In `@tests/server/api/scheduler_test.go`:
- Around line 916-930: The test currently assumes the cancelled job is at
scheduler[1] after sorting; instead iterate the scheduler slice (the variable
scheduler) and find the map entry whose "job-id" (a float64 in the maps) equals
1 (or 1.0), then assert that that entry's "status" == "cancelled". Replace the
hardcoded scheduler[1] check with a search over scheduler maps (checking
map["job-id"] and then map["status"]) and fail the test if no matching job-id is
found or its status is not "cancelled".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa9ab7b3-d03d-416b-9305-1bce3d632b38

📥 Commits

Reviewing files that changed from the base of the PR and between 13808b5 and 90f82cc.

📒 Files selected for processing (6)
  • tests/integrations/mcs/keyspace/tso_keyspace_group_test.go
  • tests/integrations/mcs/scheduling/rule_test.go
  • tests/server/api/scheduler_test.go
  • tests/server/config/config_test.go
  • tools/pd-ctl/tests/operator/operator_test.go
  • tools/pd-ctl/tests/scheduler/scheduler_test.go

Comment on lines +205 to +214
if len(labelRules) != 2 {
return false
}
sort.Slice(labelRules, func(i, j int) bool {
return labelRules[i].ID < labelRules[j].ID
})
// After the patch, "rule1" should be deleted and "rule2" should be present.
// We must check the content, not just the count, because the count stays at 2
// during the transition (old rules -> new rules).
return labelRules[1].ID == labelRule.ID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Eventually predicate doesn’t explicitly assert rule1 removal.

At Line 214, the check only confirms rule2 is at index 1 after sorting. It does not directly assert that rule1 is gone, which is what Lines 211-213 claim.

Suggested predicate update
 	testutil.Eventually(re, func() bool {
 		labelRules = regionLabeler.GetAllLabelRules()
 		if len(labelRules) != 2 {
 			return false
 		}
-		sort.Slice(labelRules, func(i, j int) bool {
-			return labelRules[i].ID < labelRules[j].ID
-		})
-		// After the patch, "rule1" should be deleted and "rule2" should be present.
-		// We must check the content, not just the count, because the count stays at 2
-		// during the transition (old rules -> new rules).
-		return labelRules[1].ID == labelRule.ID
+		hasRule2 := false
+		hasRule1 := false
+		for _, r := range labelRules {
+			if r.ID == labelRule.ID {
+				hasRule2 = true
+			}
+			if r.ID == "rule1" {
+				hasRule1 = true
+			}
+		}
+		// After the patch, "rule1" should be deleted and "rule2" should be present.
+		return hasRule2 && !hasRule1
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/scheduling/rule_test.go` around lines 205 - 214, The
eventual predicate currently only verifies labelRules[1].ID == labelRule.ID but
doesn't assert removal of the old "rule1"; update the predicate that inspects
labelRules (in the test around the sorted slice check using labelRules and
labelRule.ID) to explicitly assert both that no element in labelRules has ID ==
"rule1" and that one element equals labelRule.ID (the expected "rule2"); keep
the existing length check and sorting (sort.Slice on labelRules) and return true
only when length==2, "rule1" is absent, and labelRule.ID is present.

Comment on lines 197 to +201
for _, testCase := range testCases {
output, err = tests.ExecuteCommand(cmd, testCase.cmd...)
re.NoError(err)
re.NotContains(string(output), "Failed")
testutil.Eventually(re, func() bool {
output, err = tests.ExecuteCommand(cmd, testCase.cmd...)
return err == nil && !strings.Contains(string(output), "Failed")
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Checking Go version in go.mod ==="
head -5 go.mod

echo ""
echo "=== Reading the full context of the flagged code (lines 190-215) ==="
sed -n '190,215p' tools/pd-ctl/tests/operator/operator_test.go

Repository: tikv/pd

Length of output: 1374


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Finding testutil.Eventually implementation ==="
find . -type f -name "*.go" | xargs grep -l "func Eventually"

echo ""
echo "=== Examining testutil.Eventually ==="
rg -n -A 15 'func Eventually\(' --type go

Repository: tikv/pd

Length of output: 1979


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full loop context with line numbers (lines 195-215) ==="
sed -n '195,215p' tools/pd-ctl/tests/operator/operator_test.go | cat -n -v

echo ""
echo "=== Go version check (should be 1.22+) ==="
grep '^go ' go.mod

echo ""
echo "=== Verify: Are output and err declared outside the loop? ==="
sed -n '180,200p' tools/pd-ctl/tests/operator/operator_test.go | head -30

Repository: tikv/pd

Length of output: 2248


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Searching for variable declarations before the loop (output, err) ==="
sed -n '1,196p' tools/pd-ctl/tests/operator/operator_test.go | grep -n "var \|output\|err"

echo ""
echo "=== Confirm: Check if Eventually modifies outer scope vars ==="
sed -n '197,210p' tools/pd-ctl/tests/operator/operator_test.go | cat -n

Repository: tikv/pd

Length of output: 1204


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Context: Full function signature and variable declarations ==="
sed -n '75,120p' tools/pd-ctl/tests/operator/operator_test.go | head -40

Repository: tikv/pd

Length of output: 1215


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check AGENTS.md for the exact guideline text ==="
cat AGENTS.md | grep -A 3 -B 3 "loop var"

echo ""
echo "=== Verify testCase is used outside the closure ==="
sed -n '197,212p' tools/pd-ctl/tests/operator/operator_test.go | cat -n

Repository: tikv/pd

Length of output: 1192


Capture loop variable in a local copy before using in closure.

testCase is captured directly in the closure on line 198. Repository guidelines explicitly require making a local copy for range variables used in closures: "Do not capture loop vars by pointer; copy inside loop (v := val). Range-in-closure/address rules enforced; make local copies."

Use a per-iteration copy (tc := testCase) and reference tc inside and after the closure.

🔧 Proposed fix
 for _, testCase := range testCases {
+	tc := testCase
 	testutil.Eventually(re, func() bool {
-		output, err = tests.ExecuteCommand(cmd, testCase.cmd...)
-		return err == nil && !strings.Contains(string(output), "Failed")
+		out, execErr := tests.ExecuteCommand(cmd, tc.cmd...)
+		return execErr == nil && !strings.Contains(string(out), "Failed")
 	})
-	output, err = tests.ExecuteCommand(cmd, testCase.show...)
+	output, err = tests.ExecuteCommand(cmd, tc.show...)
 	re.NoError(err)
-	re.Contains(string(output), testCase.expect)
+	re.Contains(string(output), tc.expect)
 	start := time.Now()
-	_, err = tests.ExecuteCommand(cmd, testCase.reset...)
+	_, err = tests.ExecuteCommand(cmd, tc.reset...)
 	re.NoError(err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/pd-ctl/tests/operator/operator_test.go` around lines 197 - 201, The
range variable testCase is captured by the closure passed to
testutil.Eventually; create a per-iteration local copy (e.g., tc := testCase)
immediately inside the loop and use tc in the closure and subsequent calls
(including tests.ExecuteCommand and any checks) instead of testCase to avoid
capturing the loop variable. Ensure the new local name (tc) replaces testCase
everywhere inside that iteration's closure and following statements.

JmPotato added 7 commits March 4, 2026 23:59
…Change

- affinity_test.go: wrap group list check in Eventually with
  tryGetAllAffinityGroups to handle sync delay in microservice mode
- cluster_test.go: replace timing-based failpoint with deterministic
  channel-based synchronization using failpoint.InjectCall/EnableCall,
  eliminating the race between goroutine and main thread

Issue Number: ref tikv#10164

Signed-off-by: JmPotato <github@ipotato.me>
Signed-off-by: JmPotato <github@ipotato.me>
- rule_test.go: wrap region fit check in Eventually with TryReadGetJSON
  since rules may not have propagated to the scheduling server yet
- scheduler_test.go: rewrite checkStorePause to use Eventually with
  plain Go comparisons, ensuring store pause state has propagated to
  both the PD leader and the scheduling primary server

Issue Number: ref tikv#10164

Signed-off-by: JmPotato <github@ipotato.me>
…terTransferLeader

- TestGlobalMemoryTuner: increase require.Eventually timeout from 5s
  to 20s to accommodate slower GC cycles in CI environments.

- TestTSOStreamBasic: fix impossible condition that checks both
  processErr and callbackErr. When the stream is in closing state,
  processRequests returns an error without invoking the callback, so
  callbackErr is always nil. Only check processErr != nil.

- TestGetTSAfterTransferLeader: wrap GetTS call in testutil.Eventually
  to tolerate transient errors during TSO stream reconnection after
  leader transfer.

Signed-off-by: JmPotato <github@ipotato.me>
- TestConfig: wrap config show assertion in testutil.Eventually to wait
  for the scheduling microservice to sync config from etcd. After
  MustPutStore triggers config persistence, the scheduling server's
  watcher needs time to process the update.

- stream_test.go: rename unused callback parameter err to _err to
  satisfy revive unused-parameter lint rule.

Signed-off-by: JmPotato <github@ipotato.me>
- TestGlobalMemoryTuner: add explicit runtime.GC() calls inside
  Eventually callbacks to force GC cycles. The test relies on the
  tuner's finalizer-based callback which only fires after GC, but CI
  environments may not trigger automatic GC promptly enough from just
  the soft memory limit.

- TestHotRegionStorage: wrap read hot region assertions in
  testutil.Eventually. Read hot regions are processed asynchronously
  via CheckReadAsync, so they may not be flushed to storage by the
  time write hot regions are available.

Signed-off-by: JmPotato <github@ipotato.me>
…rage

The scheduling server's rule watcher deleteFn tried to load rule data
from etcd storage when processing DELETE watch events. However, by the
time the watcher processes the event, the rule has already been deleted
from storage by the PD leader, causing "unexpected end of JSON input"
errors. This left ghost rules in the scheduling server's cache,
corrupting the rule state for subsequent operations.

Fix by parsing groupID and ruleID directly from the etcd key (which
uses hex-encoded format) instead of loading from storage. The suspect
key ranges are obtained from the rule manager's cached copy.

Signed-off-by: JmPotato <github@ipotato.me>
@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 74.00000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.86%. Comparing base (525524f) to head (4710b48).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10295      +/-   ##
==========================================
+ Coverage   78.78%   78.86%   +0.08%     
==========================================
  Files         525      527       +2     
  Lines       70824    70956     +132     
==========================================
+ Hits        55796    55959     +163     
+ Misses      11004    10998       -6     
+ Partials     4024     3999      -25     
Flag Coverage Δ
unittests 78.86% <74.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JmPotato
Copy link
Member Author

JmPotato commented Mar 5, 2026

/retest

@JmPotato
Copy link
Member Author

JmPotato commented Mar 5, 2026

Code review

Found 2 issues:

  1. TryReadGetJSON and TryCheckGetJSON have identical implementations (same signature, same body), making one of them redundant. The original ReadGetJSON and CheckGetJSON had different semantics (CheckGetJSON accepted variadic checkOpts), but both Try* variants just call tryCheckResp. Either TryCheckGetJSON should be removed in favor of TryReadGetJSON, or TryCheckGetJSON should accept options to mirror the original CheckGetJSON behavior. (CLAUDE.md says "Simplicity First: Make every change as simple as possible. Impact minimal code.")

// making it safe for use inside Eventually condition functions.
func TryReadGetJSON(client *http.Client, url string, data any) error {
resp, err := apiutil.GetJSON(client, url, nil)
if err != nil {
return err
}
return tryCheckResp(resp, data)
}
// TryReadGetJSONWithBody performs a GET request with body and unmarshals the JSON response.
// Unlike ReadGetJSONWithBody, it returns an error instead of using testify assertions.
func TryReadGetJSONWithBody(client *http.Client, url string, input []byte, data any) error {
resp, err := apiutil.GetJSON(client, url, input)
if err != nil {
return err
}
return tryCheckResp(resp, data)
}
// TryCheckGetJSON performs a GET request and checks the response status is OK.
// Unlike CheckGetJSON with StatusOK/ExtractJSON, it returns an error instead of using testify assertions.
func TryCheckGetJSON(client *http.Client, url string, data any) error {
resp, err := apiutil.GetJSON(client, url, nil)
if err != nil {
return err
}
return tryCheckResp(resp, data)
}

  1. tryRequestStatusBody does not use defer resp.Body.Close(). If io.ReadAll(resp.Body) returns an error at line 537, the function returns without closing resp.Body, leaking the HTTP connection. The standard pattern is defer resp.Body.Close() immediately after the client.Do call succeeds. (CLAUDE.md says "Cancel timers/tickers; close resources with defer and error checks.")

func tryRequestStatusBody(client *http.Client, method string, url string) (int, error) {
req, err := http.NewRequest(method, url, http.NoBody)
if err != nil {
return 0, err
}
resp, err := client.Do(req)
if err != nil {
return 0, err
}
_, err = io.ReadAll(resp.Body)
if err != nil {
return 0, err
}
err = resp.Body.Close()
if err != nil {
return 0, err
}

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

JmPotato added 4 commits March 5, 2026 10:25
- Remove duplicate TryCheckGetJSON/TryReadGetJSONWithBody functions,
  consolidate all call sites to use TryReadGetJSON
- Fix resp.Body leak in tryRequestStatusBody by using defer
- Fix stale output variable in TestRemovingProgress by reading resp.Body
- Fix ignored json.Unmarshal errors in scheduler_test.go by checking
  return value and moving destination variables inside closures
- Add per-attempt timeout to tryRequest gRPC calls to prevent blocking
- Strengthen config_test.go predicates by moving content checks inside
  Eventually closures to prevent false passes on stale data
- Fix tryCheckMemberList false-pass by tracking matched members

Signed-off-by: JmPotato <github@ipotato.me>
The RegionFit response may contain more RuleFits than expected values.
Use >= instead of == to match the original behavior.

Signed-off-by: JmPotato <github@ipotato.me>
- requestStatusBody now wraps tryRequestStatusBody
- TryLoadKeyspaceGroupByID now wraps loadKeyspaceGroupByIDNoAssert
- tryGetKeyspaceGroup now wraps getKeyspaceGroup

Signed-off-by: JmPotato <github@ipotato.me>
Extract parseRuleStoreKey helper from the deleteFn closure so it can
be unit-tested. Add TestParseRuleStoreKey covering roundtrip with
Rule.StoreKey(), invalid format, and invalid hex inputs.

Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato
Copy link
Member Author

JmPotato commented Mar 5, 2026

/retest

- register_test: remove sleep from tryGetKey since Eventually's 20s
  timeout covers the 5s lease expiry; rename for clarity
- watcher: use explicit ok bool return from parseRuleStoreKey to match
  original deleteFn behavior exactly (only skip on missing separator)
- api_check: remove dead TryCheckPatchJSON (zero call sites)

Signed-off-by: JmPotato <github@ipotato.me>
Unify the two remaining call sites to use Eventually polling instead
of blocking sleep, then delete the now-unused getKeyAfterLeaseExpired.

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2026

@JmPotato: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 048c6b8 link true /test pull-unit-test-next-gen-2
pull-unit-test-next-gen-1 048c6b8 link true /test pull-unit-test-next-gen-1
pull-unit-test-next-gen-3 048c6b8 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix flaky tests in Eventually assertions to prepare for testify v1.11.0 upgrade

1 participant