Skip to content

executor: use retry when not found job#66791

Open
bufferflies wants to merge 2 commits intopingcap:masterfrom
bufferflies:distribute_with_retry
Open

executor: use retry when not found job#66791
bufferflies wants to merge 2 commits intopingcap:masterfrom
bufferflies:distribute_with_retry

Conversation

@bufferflies
Copy link
Contributor

@bufferflies bufferflies commented Mar 9, 2026

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • Bug Fixes

    • More reliable scheduler job lookup with a retry loop (up to 3 attempts, 500ms intervals) and conditional inclusion of the scheduler job ID in outputs.
    • Improved resilience by replacing immediate failures with structured logging and non-fatal fallbacks when scheduler config is missing or malformed.
  • Tests

    • Updated test mocks to short-circuit the first scheduler config call to better simulate intermittent config availability.

Signed-off-by: bufferflies <1045931706@qq.com>
@pantheon-ai
Copy link

pantheon-ai bot commented Mar 9, 2026

Review Complete

Findings: 4 issues
Posted: 3
Duplicates/Skipped: 1

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 9, 2026
@tiprow
Copy link

tiprow bot commented Mar 9, 2026

Hi @bufferflies. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Added a retry loop in DistributeTableExec.Next to fetch a scheduler job ID via a new helper getSchedulerJob(ctx) (bool, float64) that queries scheduler config through infosync.GetSchedulerConfig. Errors are logged; when a job ID is found it's appended to the output chunk. Tests updated to allow an initial skip in the mock PD client.

Changes

Cohort / File(s) Summary
Scheduler job retrieval & retry
pkg/executor/distribute.go
Added getSchedulerJob(ctx context.Context) (bool, float64) and a retry loop (3 attempts, 500ms). Queries infosync.GetSchedulerConfig, validates/parses configs, locates matching job by alias/engine/rule with non-finished status, returns (true,id) or (false,-1). Switched error returns to structured logging; added time, logutil, and go.uber.org/zap imports.
Test: Mock PD client behavior
pkg/executor/distribute_table_test.go
Added skip and count fields to MockDistributePDCli. GetSchedulerConfig now short-circuits when count < skip (increments count and returns early). Tests initialize the mock with skip: 1 to bypass the first call and exercise the retry behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hop through logs with nimble feet,
Three gentle retries — a rhythmic beat,
I peek at configs, fetch the ID sweet,
If found, I tuck it in the chunk to meet,
Sniffing zap-lit trails — a tiny feat 🥕✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete; it uses the repository template with unfilled sections, placeholder issue reference (#xxx), and unchecked test checkboxes despite code changes requiring testing. Complete the description by providing a concrete problem summary, explaining how the retry logic solves it, selecting appropriate test checkboxes (unit test was added per commit), and replacing the placeholder issue reference with an actual issue number.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main change: adding retry logic when a scheduler job is not found, which matches the code modifications introducing a retry loop in getSchedulerJob().

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/executor/distribute.go`:
- Around line 79-89: Change getSchedulerJob to return (jobID float64, found
bool, err error) and update callers (including distributeTable/Next) to
propagate err immediately; only perform the retry loop when found==false (i.e.,
job not visible yet), do not retry on non-nil err, and when found==true safely
type-assert the "job-id" field with a checked assertion (avoid unchecked
job["job-id"].(float64) that can panic). After a successful found==true return,
append the job ID to chk (chk.AppendUint64) and otherwise return the appropriate
no-row response; ensure all error paths return the error instead of flattening
to -1.
- Around line 80-86: The retry loop using time.Sleep should be made
context-aware: replace time.Sleep(500 * time.Millisecond) with a select that
waits on either ctx.Done() or time.After(500*time.Millisecond) so
cancellation/timeout is respected; when ctx.Done() fires return or propagate
ctx.Err() (or otherwise abort the surrounding operation) instead of continuing
to sleep. Update the loop around e.getSchedulerJob(ctx) and jobID to
break/return promptly on ctx cancellation and only continue after the time.After
branch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8252403c-313e-4626-957e-4bccdab3fc16

📥 Commits

Reviewing files that changed from the base of the PR and between 571b516 and fb6c2cd.

📒 Files selected for processing (1)
  • pkg/executor/distribute.go

@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 73.33333% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.0671%. Comparing base (571b516) to head (bab46cb).
⚠️ Report is 14 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #66791        +/-   ##
================================================
- Coverage   77.6762%   77.0671%   -0.6091%     
================================================
  Files          2008       1930        -78     
  Lines        549877     539707     -10170     
================================================
- Hits         427124     415937     -11187     
- Misses       121045     123752      +2707     
+ Partials       1708         18      -1690     
Flag Coverage Δ
integration 41.1214% <0.0000%> (-7.0660%) ⬇️
unit 76.0493% <73.3333%> (-0.1825%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.6158% <ø> (-12.2724%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

configs, ok := config.([]any)
if !ok {
return errs.ErrClientProtoUnmarshal.FastGenByArgs(config)
logutil.Logger(ctx).Warn("get empty scheduler config")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Warn log fires during expected transient "no config yet" window

Why: The config.([]any) type-assertion failure is logged at WARN level, but this is the normal "job not visible yet" condition that the surrounding 3-attempt retry loop explicitly expects. The same shape is handled silently (no warning) in show.go:2529-2533. This generates up to 3 spurious WARN entries per DISTRIBUTE TABLE invocation during normal eventual-consistency windows, polluting logs and potentially alarming operators.

Evidence: pkg/executor/distribute.go:105logutil.Logger(ctx).Warn("get empty scheduler config")
Compare: pkg/executor/show.go:2529 — same config.([]any) failure returns nil with no warning.

Suggested fix: Downgrade to Debug level (or drop entirely) for the transient not-found path; reserve Warn/Error for unexpected fetch failures.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/executor/distribute.go (1)

130-130: ⚠️ Potential issue | 🔴 Critical

Unchecked type assertion can panic.

The type assertion job["job-id"].(float64) will panic if the value is nil or not a float64. This was flagged in a previous review and remains unaddressed. If PD returns malformed data, the executor will crash instead of returning an error.

🛡️ Proposed fix with checked type assertion
 		if job["alias"] == alias && job["engine"] == e.engine && job["rule"] == e.rule && job["status"] != "finished" {
-			id := job["job-id"].(float64)
+			id, ok := job["job-id"].(float64)
+			if !ok {
+				logutil.Logger(ctx).Info("invalid job-id type in scheduler config", zap.Any("job-id", job["job-id"]))
+				continue
+			}
 			if id > jobID {
 				jobID = id
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/distribute.go` at line 130, The unchecked assertion
job["job-id"].(float64) can panic; change it to a safe check (e.g., use the
comma-ok form or a type switch) to validate the value from job["job-id"] before
using it, handle nil/mismatched types by returning an error, and convert
acceptable alternatives (int, int64, json.Number, or string) to the expected
numeric id; update the location where id is used (the id variable created from
job["job-id"]) so callers handle the returned error instead of risking a panic.
♻️ Duplicate comments (2)
pkg/executor/distribute.go (2)

94-97: ⚠️ Potential issue | 🟠 Major

Silent success when scheduler job lookup fails.

When jobID remains -1 after all retries (due to errors, empty configs, or the job not being visible), Next returns nil with an empty chunk. The statement appears to succeed but provides no JOB_ID to track or cancel the scheduled operation. This was flagged in a previous review.

Consider either:

  1. Returning an error when the job ID cannot be retrieved after retries, or
  2. At minimum, logging a warning so operators can diagnose missing job IDs.

As per coding guidelines, "Keep error handling actionable and contextual; avoid silently swallowing errors in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/distribute.go` around lines 94 - 97, In the Next method in
distribute.go, do not silently succeed when jobID == -1; after all retries if
jobID remains -1, return a descriptive error instead of appending zero and
returning nil. Replace the current branch that appends a zero via
chk.AppendUint64(0, uint64(jobID)) and returns nil with logic that returns an
error (e.g., fmt.Errorf("failed to obtain scheduler job ID after %d retries"))
so callers can detect and handle the missing JOB_ID; if you prefer not to fail
hard, at minimum emit a contextual warning log with the same message before
returning an error or non-nil status.

106-113: ⚠️ Potential issue | 🟠 Major

Errors and transient "not found" cases are still conflated.

Both real failures (network errors, PD issues, context deadline) and the expected transient "job not visible yet" case return (false, -1). The caller cannot distinguish between:

  • A retriable "not found" condition (should retry)
  • A hard failure that should surface to the user

This was raised in a previous review. The suggested fix was to return (jobID, found, err) and only retry when !found && err == nil.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/distribute.go` around lines 106 - 113, Change the function to
distinguish transient "not found" from real errors by updating its signature to
return (jobID int, found bool, err error) and adjust the logic around the
'config'/'configs' check: when fetching scheduler config fails (err != nil)
return -1, false, err; when the config isn't present or not a []any treat it as
not-found and return -1, false, nil; leave successful parsing to return the
jobID, true, nil. Update callers to handle the new (jobID, found, err) tuple and
ensure retry logic only retries when !found && err == nil; update the log
messages that use logutil.Logger(ctx).Info around the 'config'/'configs' checks
to reflect whether the result is an error or a not-found condition.
🧹 Nitpick comments (1)
pkg/executor/distribute_table_test.go (1)

131-131: Consider adding test coverage for failure scenarios.

The test only covers the happy path where the job is eventually found after one skip. Consider adding tests for:

  • Job never found after all retries (verifies behavior when jobID stays -1)
  • GetSchedulerConfig returning an error (verifies error handling)
  • Malformed job config (e.g., missing or invalid job-id field)

This would help validate the error handling paths in getSchedulerJob.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/distribute_table_test.go` at line 131, Add tests exercising
failure paths for getSchedulerJob: (1) simulate MockDistributePDCli with skip
values that cause jobID to remain -1 across all retries and assert the function
returns the expected error or nil job; (2) mock GetSchedulerConfig to return an
error and assert getSchedulerJob surfaces that error; (3) provide a malformed
scheduler job config (missing/invalid "job-id") via the mock and assert the
parsing/validation path returns the expected error. Target functions/mocks:
getSchedulerJob, MockDistributePDCli (adjust its behavior), and the
GetSchedulerConfig response to validate jobID handling and error propagation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/executor/distribute.go`:
- Line 130: The unchecked assertion job["job-id"].(float64) can panic; change it
to a safe check (e.g., use the comma-ok form or a type switch) to validate the
value from job["job-id"] before using it, handle nil/mismatched types by
returning an error, and convert acceptable alternatives (int, int64,
json.Number, or string) to the expected numeric id; update the location where id
is used (the id variable created from job["job-id"]) so callers handle the
returned error instead of risking a panic.

---

Duplicate comments:
In `@pkg/executor/distribute.go`:
- Around line 94-97: In the Next method in distribute.go, do not silently
succeed when jobID == -1; after all retries if jobID remains -1, return a
descriptive error instead of appending zero and returning nil. Replace the
current branch that appends a zero via chk.AppendUint64(0, uint64(jobID)) and
returns nil with logic that returns an error (e.g., fmt.Errorf("failed to obtain
scheduler job ID after %d retries")) so callers can detect and handle the
missing JOB_ID; if you prefer not to fail hard, at minimum emit a contextual
warning log with the same message before returning an error or non-nil status.
- Around line 106-113: Change the function to distinguish transient "not found"
from real errors by updating its signature to return (jobID int, found bool, err
error) and adjust the logic around the 'config'/'configs' check: when fetching
scheduler config fails (err != nil) return -1, false, err; when the config isn't
present or not a []any treat it as not-found and return -1, false, nil; leave
successful parsing to return the jobID, true, nil. Update callers to handle the
new (jobID, found, err) tuple and ensure retry logic only retries when !found &&
err == nil; update the log messages that use logutil.Logger(ctx).Info around the
'config'/'configs' checks to reflect whether the result is an error or a
not-found condition.

---

Nitpick comments:
In `@pkg/executor/distribute_table_test.go`:
- Line 131: Add tests exercising failure paths for getSchedulerJob: (1) simulate
MockDistributePDCli with skip values that cause jobID to remain -1 across all
retries and assert the function returns the expected error or nil job; (2) mock
GetSchedulerConfig to return an error and assert getSchedulerJob surfaces that
error; (3) provide a malformed scheduler job config (missing/invalid "job-id")
via the mock and assert the parsing/validation path returns the expected error.
Target functions/mocks: getSchedulerJob, MockDistributePDCli (adjust its
behavior), and the GetSchedulerConfig response to validate jobID handling and
error propagation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1756c639-2c7e-4c22-9045-0d2e5c8429eb

📥 Commits

Reviewing files that changed from the base of the PR and between fb6c2cd and 39e03df.

📒 Files selected for processing (2)
  • pkg/executor/distribute.go
  • pkg/executor/distribute_table_test.go

Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies bufferflies force-pushed the distribute_with_retry branch from 39e03df to bab46cb Compare March 10, 2026 07:27
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.


Notice: To remove the do-not-merge/needs-tests-checked label, please finished the tests then check the finished items in description.

For example:

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/executor/distribute.go (2)

133-133: ⚠️ Potential issue | 🟠 Major

Unchecked type assertion can panic.

If PD returns a malformed job object where job["job-id"] is not a float64 (e.g., nil, string, or missing), this line will panic. Use a checked type assertion:

-			id := job["job-id"].(float64)
+			id, ok := job["job-id"].(float64)
+			if !ok {
+				logutil.Logger(ctx).Info("invalid job-id type", zap.Any("job-id", job["job-id"]))
+				continue
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/distribute.go` at line 133, Replace the unchecked assertion id
:= job["job-id"].(float64) with a safe extraction that checks existence and type
(e.g., v, ok := job["job-id"]; if !ok { handle error } then switch on v's type
to accept float64, float32, int, int64, json.Number or string and convert to the
required numeric form), and propagate or log an error when the key is missing or
cannot be converted so the code cannot panic; apply this change around the id
variable usage in distribute.go where job["job-id"] is read.

105-139: ⚠️ Potential issue | 🟠 Major

Error propagation still missing — errors flattened into the retry path.

The function returns (bool, float64) but still conflates fetch failures (line 109), malformed responses (lines 114, 121), and genuine "job not visible yet" into the same (false, -1) return. The caller cannot distinguish retry-able transient states from hard errors that should fail fast.

When retries exhaust, Next returns nil with an empty chunk — the statement silently succeeds without returning a JOB_ID, leaving users unable to track or cancel the job.

As flagged in prior reviews, consider returning (float64, bool, error) so:

  • err != nil → propagate immediately, no retry
  • !found && err == nil → transient, retry
  • found → success

As per coding guidelines, "Keep error handling actionable and contextual; avoid silently swallowing errors in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/distribute.go` around lines 105 - 139, Change getSchedulerJob on
DistributeTableExec to return an error in addition to the found flag and job ID
(e.g., signature -> getSchedulerJob(ctx context.Context) (float64, bool, error)
or (jobID float64, found bool, err error)); on infosync.GetSchedulerConfig
failures and on malformed responses (invalid type assertions) return a non-nil
err immediately instead of collapsing to (false, -1), and only return (0, false,
nil) for a genuine “not found yet” transient case; update callers (e.g., Next)
to treat err != nil as fatal, found==false && err==nil as retryable, and
found==true as success so job-not-visible vs hard errors are distinguished.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/executor/distribute.go`:
- Line 133: Replace the unchecked assertion id := job["job-id"].(float64) with a
safe extraction that checks existence and type (e.g., v, ok := job["job-id"]; if
!ok { handle error } then switch on v's type to accept float64, float32, int,
int64, json.Number or string and convert to the required numeric form), and
propagate or log an error when the key is missing or cannot be converted so the
code cannot panic; apply this change around the id variable usage in
distribute.go where job["job-id"] is read.
- Around line 105-139: Change getSchedulerJob on DistributeTableExec to return
an error in addition to the found flag and job ID (e.g., signature ->
getSchedulerJob(ctx context.Context) (float64, bool, error) or (jobID float64,
found bool, err error)); on infosync.GetSchedulerConfig failures and on
malformed responses (invalid type assertions) return a non-nil err immediately
instead of collapsing to (false, -1), and only return (0, false, nil) for a
genuine “not found yet” transient case; update callers (e.g., Next) to treat err
!= nil as fatal, found==false && err==nil as retryable, and found==true as
success so job-not-visible vs hard errors are distinguished.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 438ac2c4-132c-4800-9c56-672d122c950d

📥 Commits

Reviewing files that changed from the base of the PR and between 39e03df and bab46cb.

📒 Files selected for processing (2)
  • pkg/executor/distribute.go
  • pkg/executor/distribute_table_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/executor/distribute_table_test.go

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

@bufferflies: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/unit-test bab46cb link true /test unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant