executor: use retry when not found job#66791
executor: use retry when not found job#66791bufferflies wants to merge 2 commits intopingcap:masterfrom
Conversation
Signed-off-by: bufferflies <1045931706@qq.com>
|
Review Complete Findings: 4 issues ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @bufferflies. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughAdded a retry loop in DistributeTableExec.Next to fetch a scheduler job ID via a new helper getSchedulerJob(ctx) (bool, float64) that queries scheduler config through infosync.GetSchedulerConfig. Errors are logged; when a job ID is found it's appended to the output chunk. Tests updated to allow an initial skip in the mock PD client. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/executor/distribute.go`:
- Around line 79-89: Change getSchedulerJob to return (jobID float64, found
bool, err error) and update callers (including distributeTable/Next) to
propagate err immediately; only perform the retry loop when found==false (i.e.,
job not visible yet), do not retry on non-nil err, and when found==true safely
type-assert the "job-id" field with a checked assertion (avoid unchecked
job["job-id"].(float64) that can panic). After a successful found==true return,
append the job ID to chk (chk.AppendUint64) and otherwise return the appropriate
no-row response; ensure all error paths return the error instead of flattening
to -1.
- Around line 80-86: The retry loop using time.Sleep should be made
context-aware: replace time.Sleep(500 * time.Millisecond) with a select that
waits on either ctx.Done() or time.After(500*time.Millisecond) so
cancellation/timeout is respected; when ctx.Done() fires return or propagate
ctx.Err() (or otherwise abort the surrounding operation) instead of continuing
to sleep. Update the loop around e.getSchedulerJob(ctx) and jobID to
break/return promptly on ctx cancellation and only continue after the time.After
branch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8252403c-313e-4626-957e-4bccdab3fc16
📒 Files selected for processing (1)
pkg/executor/distribute.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #66791 +/- ##
================================================
- Coverage 77.6762% 77.0671% -0.6091%
================================================
Files 2008 1930 -78
Lines 549877 539707 -10170
================================================
- Hits 427124 415937 -11187
- Misses 121045 123752 +2707
+ Partials 1708 18 -1690
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
pkg/executor/distribute.go
Outdated
| configs, ok := config.([]any) | ||
| if !ok { | ||
| return errs.ErrClientProtoUnmarshal.FastGenByArgs(config) | ||
| logutil.Logger(ctx).Warn("get empty scheduler config") |
There was a problem hiding this comment.
[P2] Warn log fires during expected transient "no config yet" window
Why: The config.([]any) type-assertion failure is logged at WARN level, but this is the normal "job not visible yet" condition that the surrounding 3-attempt retry loop explicitly expects. The same shape is handled silently (no warning) in show.go:2529-2533. This generates up to 3 spurious WARN entries per DISTRIBUTE TABLE invocation during normal eventual-consistency windows, polluting logs and potentially alarming operators.
Evidence: pkg/executor/distribute.go:105 — logutil.Logger(ctx).Warn("get empty scheduler config")
Compare: pkg/executor/show.go:2529 — same config.([]any) failure returns nil with no warning.
Suggested fix: Downgrade to Debug level (or drop entirely) for the transient not-found path; reserve Warn/Error for unexpected fetch failures.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/executor/distribute.go (1)
130-130:⚠️ Potential issue | 🔴 CriticalUnchecked type assertion can panic.
The type assertion
job["job-id"].(float64)will panic if the value isnilor not afloat64. This was flagged in a previous review and remains unaddressed. If PD returns malformed data, the executor will crash instead of returning an error.🛡️ Proposed fix with checked type assertion
if job["alias"] == alias && job["engine"] == e.engine && job["rule"] == e.rule && job["status"] != "finished" { - id := job["job-id"].(float64) + id, ok := job["job-id"].(float64) + if !ok { + logutil.Logger(ctx).Info("invalid job-id type in scheduler config", zap.Any("job-id", job["job-id"])) + continue + } if id > jobID { jobID = id }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/distribute.go` at line 130, The unchecked assertion job["job-id"].(float64) can panic; change it to a safe check (e.g., use the comma-ok form or a type switch) to validate the value from job["job-id"] before using it, handle nil/mismatched types by returning an error, and convert acceptable alternatives (int, int64, json.Number, or string) to the expected numeric id; update the location where id is used (the id variable created from job["job-id"]) so callers handle the returned error instead of risking a panic.
♻️ Duplicate comments (2)
pkg/executor/distribute.go (2)
94-97:⚠️ Potential issue | 🟠 MajorSilent success when scheduler job lookup fails.
When
jobIDremains-1after all retries (due to errors, empty configs, or the job not being visible),Nextreturnsnilwith an empty chunk. The statement appears to succeed but provides noJOB_IDto track or cancel the scheduled operation. This was flagged in a previous review.Consider either:
- Returning an error when the job ID cannot be retrieved after retries, or
- At minimum, logging a warning so operators can diagnose missing job IDs.
As per coding guidelines, "Keep error handling actionable and contextual; avoid silently swallowing errors in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/distribute.go` around lines 94 - 97, In the Next method in distribute.go, do not silently succeed when jobID == -1; after all retries if jobID remains -1, return a descriptive error instead of appending zero and returning nil. Replace the current branch that appends a zero via chk.AppendUint64(0, uint64(jobID)) and returns nil with logic that returns an error (e.g., fmt.Errorf("failed to obtain scheduler job ID after %d retries")) so callers can detect and handle the missing JOB_ID; if you prefer not to fail hard, at minimum emit a contextual warning log with the same message before returning an error or non-nil status.
106-113:⚠️ Potential issue | 🟠 MajorErrors and transient "not found" cases are still conflated.
Both real failures (network errors, PD issues, context deadline) and the expected transient "job not visible yet" case return
(false, -1). The caller cannot distinguish between:
- A retriable "not found" condition (should retry)
- A hard failure that should surface to the user
This was raised in a previous review. The suggested fix was to return
(jobID, found, err)and only retry when!found && err == nil.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/distribute.go` around lines 106 - 113, Change the function to distinguish transient "not found" from real errors by updating its signature to return (jobID int, found bool, err error) and adjust the logic around the 'config'/'configs' check: when fetching scheduler config fails (err != nil) return -1, false, err; when the config isn't present or not a []any treat it as not-found and return -1, false, nil; leave successful parsing to return the jobID, true, nil. Update callers to handle the new (jobID, found, err) tuple and ensure retry logic only retries when !found && err == nil; update the log messages that use logutil.Logger(ctx).Info around the 'config'/'configs' checks to reflect whether the result is an error or a not-found condition.
🧹 Nitpick comments (1)
pkg/executor/distribute_table_test.go (1)
131-131: Consider adding test coverage for failure scenarios.The test only covers the happy path where the job is eventually found after one skip. Consider adding tests for:
- Job never found after all retries (verifies behavior when
jobIDstays-1)GetSchedulerConfigreturning an error (verifies error handling)- Malformed job config (e.g., missing or invalid
job-idfield)This would help validate the error handling paths in
getSchedulerJob.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/distribute_table_test.go` at line 131, Add tests exercising failure paths for getSchedulerJob: (1) simulate MockDistributePDCli with skip values that cause jobID to remain -1 across all retries and assert the function returns the expected error or nil job; (2) mock GetSchedulerConfig to return an error and assert getSchedulerJob surfaces that error; (3) provide a malformed scheduler job config (missing/invalid "job-id") via the mock and assert the parsing/validation path returns the expected error. Target functions/mocks: getSchedulerJob, MockDistributePDCli (adjust its behavior), and the GetSchedulerConfig response to validate jobID handling and error propagation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@pkg/executor/distribute.go`:
- Line 130: The unchecked assertion job["job-id"].(float64) can panic; change it
to a safe check (e.g., use the comma-ok form or a type switch) to validate the
value from job["job-id"] before using it, handle nil/mismatched types by
returning an error, and convert acceptable alternatives (int, int64,
json.Number, or string) to the expected numeric id; update the location where id
is used (the id variable created from job["job-id"]) so callers handle the
returned error instead of risking a panic.
---
Duplicate comments:
In `@pkg/executor/distribute.go`:
- Around line 94-97: In the Next method in distribute.go, do not silently
succeed when jobID == -1; after all retries if jobID remains -1, return a
descriptive error instead of appending zero and returning nil. Replace the
current branch that appends a zero via chk.AppendUint64(0, uint64(jobID)) and
returns nil with logic that returns an error (e.g., fmt.Errorf("failed to obtain
scheduler job ID after %d retries")) so callers can detect and handle the
missing JOB_ID; if you prefer not to fail hard, at minimum emit a contextual
warning log with the same message before returning an error or non-nil status.
- Around line 106-113: Change the function to distinguish transient "not found"
from real errors by updating its signature to return (jobID int, found bool, err
error) and adjust the logic around the 'config'/'configs' check: when fetching
scheduler config fails (err != nil) return -1, false, err; when the config isn't
present or not a []any treat it as not-found and return -1, false, nil; leave
successful parsing to return the jobID, true, nil. Update callers to handle the
new (jobID, found, err) tuple and ensure retry logic only retries when !found &&
err == nil; update the log messages that use logutil.Logger(ctx).Info around the
'config'/'configs' checks to reflect whether the result is an error or a
not-found condition.
---
Nitpick comments:
In `@pkg/executor/distribute_table_test.go`:
- Line 131: Add tests exercising failure paths for getSchedulerJob: (1) simulate
MockDistributePDCli with skip values that cause jobID to remain -1 across all
retries and assert the function returns the expected error or nil job; (2) mock
GetSchedulerConfig to return an error and assert getSchedulerJob surfaces that
error; (3) provide a malformed scheduler job config (missing/invalid "job-id")
via the mock and assert the parsing/validation path returns the expected error.
Target functions/mocks: getSchedulerJob, MockDistributePDCli (adjust its
behavior), and the GetSchedulerConfig response to validate jobID handling and
error propagation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1756c639-2c7e-4c22-9045-0d2e5c8429eb
📒 Files selected for processing (2)
pkg/executor/distribute.gopkg/executor/distribute_table_test.go
Signed-off-by: bufferflies <1045931706@qq.com>
39e03df to
bab46cb
Compare
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. Notice: To remove the For example:
📖 For more info, you can check the "Contribute Code" section in the development guide. |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/executor/distribute.go (2)
133-133:⚠️ Potential issue | 🟠 MajorUnchecked type assertion can panic.
If PD returns a malformed job object where
job["job-id"]is not afloat64(e.g.,nil, string, or missing), this line will panic. Use a checked type assertion:- id := job["job-id"].(float64) + id, ok := job["job-id"].(float64) + if !ok { + logutil.Logger(ctx).Info("invalid job-id type", zap.Any("job-id", job["job-id"])) + continue + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/distribute.go` at line 133, Replace the unchecked assertion id := job["job-id"].(float64) with a safe extraction that checks existence and type (e.g., v, ok := job["job-id"]; if !ok { handle error } then switch on v's type to accept float64, float32, int, int64, json.Number or string and convert to the required numeric form), and propagate or log an error when the key is missing or cannot be converted so the code cannot panic; apply this change around the id variable usage in distribute.go where job["job-id"] is read.
105-139:⚠️ Potential issue | 🟠 MajorError propagation still missing — errors flattened into the retry path.
The function returns
(bool, float64)but still conflates fetch failures (line 109), malformed responses (lines 114, 121), and genuine "job not visible yet" into the same(false, -1)return. The caller cannot distinguish retry-able transient states from hard errors that should fail fast.When retries exhaust,
Nextreturnsnilwith an empty chunk — the statement silently succeeds without returning aJOB_ID, leaving users unable to track or cancel the job.As flagged in prior reviews, consider returning
(float64, bool, error)so:
err != nil→ propagate immediately, no retry!found && err == nil→ transient, retryfound→ successAs per coding guidelines, "Keep error handling actionable and contextual; avoid silently swallowing errors in Go code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/distribute.go` around lines 105 - 139, Change getSchedulerJob on DistributeTableExec to return an error in addition to the found flag and job ID (e.g., signature -> getSchedulerJob(ctx context.Context) (float64, bool, error) or (jobID float64, found bool, err error)); on infosync.GetSchedulerConfig failures and on malformed responses (invalid type assertions) return a non-nil err immediately instead of collapsing to (false, -1), and only return (0, false, nil) for a genuine “not found yet” transient case; update callers (e.g., Next) to treat err != nil as fatal, found==false && err==nil as retryable, and found==true as success so job-not-visible vs hard errors are distinguished.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/executor/distribute.go`:
- Line 133: Replace the unchecked assertion id := job["job-id"].(float64) with a
safe extraction that checks existence and type (e.g., v, ok := job["job-id"]; if
!ok { handle error } then switch on v's type to accept float64, float32, int,
int64, json.Number or string and convert to the required numeric form), and
propagate or log an error when the key is missing or cannot be converted so the
code cannot panic; apply this change around the id variable usage in
distribute.go where job["job-id"] is read.
- Around line 105-139: Change getSchedulerJob on DistributeTableExec to return
an error in addition to the found flag and job ID (e.g., signature ->
getSchedulerJob(ctx context.Context) (float64, bool, error) or (jobID float64,
found bool, err error)); on infosync.GetSchedulerConfig failures and on
malformed responses (invalid type assertions) return a non-nil err immediately
instead of collapsing to (false, -1), and only return (0, false, nil) for a
genuine “not found yet” transient case; update callers (e.g., Next) to treat err
!= nil as fatal, found==false && err==nil as retryable, and found==true as
success so job-not-visible vs hard errors are distinguished.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 438ac2c4-132c-4800-9c56-672d122c950d
📒 Files selected for processing (2)
pkg/executor/distribute.gopkg/executor/distribute_table_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/executor/distribute_table_test.go
|
@bufferflies: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Bug Fixes
Tests