Conversation
Signed-off-by: tongjian <1045931706@qq.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds defensive input validation and type checks across multiple scheduler handlers, tightens HTTP error responses and control flow to avoid panics/nil derefs, and introduces HTTP/unit tests covering invalid input scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/api/scheduler.go (1)
145-151:⚠️ Potential issue | 🟠 MajorMalformed
timeoutvalues still fall back to30m.
apiutil.CollectStringOptionreturnsErrOptionNotExistboth whentimeoutis absent and when it is present but not a string (pkg/utils/apiutil/apiutil.go:337-345). With this branch, a payload like{"timeout":123}still appends the default and creates the scheduler instead of returning 400. Check raw key presence ininputbefore applying the default.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/api/scheduler.go` around lines 145 - 151, The branch treating ErrOptionNotExist as "use default" is incorrect because apiutil.CollectStringOption returns ErrOptionNotExist both for a missing key and for a key with a non-string value; update the logic in the scheduler handler around apiutil.CollectStringOption("timeout", input, collector) to first check raw presence of the "timeout" key in input (e.g., via a map lookup on input) and only call collector(defaultTimeout) when the key is absent; if the key exists but CollectStringOption returns ErrOptionNotExist (meaning the value is malformed), return a 400 using h.r.JSON with the error message instead of falling back to default; keep the existing behavior for other errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 351-353: The early-nil-start guard in checkJob is letting an
invalid running job be marked finished but then allows checkJob to continue and
reach prepare/Schedule; fix this by detecting job == nil or job.Start == nil,
calling finishLocked(job) to mark it finished/clean up, and then immediately
return (stop further processing) from checkJob so it cannot proceed to prepare
or Schedule; update the branch around job == nil || job.Start == nil in checkJob
to perform finishLocked and return immediately.
In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 176-182: The helper currently conflates "store already configured"
with "we paused it", causing rollbacks in updateConfig to resume stores that
this request never paused; change the helper used in evict_leader logic (the
function that checks conf.StoreIDWithRanges and calls
conf.cluster.PauseLeaderTransfer) to return separate flags (e.g., exists and
pausedByRequest or a small result struct) instead of a single boolean, set
pausedByRequest only when conf.cluster.PauseLeaderTransfer(id, constant.In)
succeeds, and update all callers (including resumeLeaderTransferIfExist and the
rollback paths in updateConfig) to only call resumeLeaderTransfer when
pausedByRequest is true (leave stores that merely existed untouched).
In `@pkg/schedule/schedulers/grant_leader.go`:
- Around line 268-293: The error branches in the ranges parsing block always
call handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) which can
unpause existing stores; change the logic in
pkg/schedule/schedulers/grant_leader.go so ResumeLeaderTransfer is only called
if this request previously paused the store. Introduce a local boolean (e.g.,
paused) set to true where PauseLeaderTransfer is invoked for new stores (the
existing PauseLeaderTransfer call around lines ~257-263) and then check that
paused flag before calling handler.config.cluster.ResumeLeaderTransfer(id,
constant.Out) in the ranges validation error/default branches inside the ranges
handling switch.
---
Outside diff comments:
In `@server/api/scheduler.go`:
- Around line 145-151: The branch treating ErrOptionNotExist as "use default" is
incorrect because apiutil.CollectStringOption returns ErrOptionNotExist both for
a missing key and for a key with a non-string value; update the logic in the
scheduler handler around apiutil.CollectStringOption("timeout", input,
collector) to first check raw presence of the "timeout" key in input (e.g., via
a map lookup on input) and only call collector(defaultTimeout) when the key is
absent; if the key exists but CollectStringOption returns ErrOptionNotExist
(meaning the value is malformed), return a 400 using h.r.JSON with the error
message instead of falling back to default; keep the existing behavior for other
errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d0e9d009-8d30-4d17-8018-37955c835928
📒 Files selected for processing (12)
pkg/schedule/schedulers/balance_range.gopkg/schedule/schedulers/balance_range_test.gopkg/schedule/schedulers/evict_leader.gopkg/schedule/schedulers/evict_leader_test.gopkg/schedule/schedulers/grant_hot_region.gopkg/schedule/schedulers/grant_hot_region_test.gopkg/schedule/schedulers/grant_leader.gopkg/schedule/schedulers/grant_leader_test.gopkg/schedule/schedulers/scheduler_controller.gopkg/schedule/schedulers/transfer_witness_leader.gopkg/schedule/schedulers/transfer_witness_leader_test.goserver/api/scheduler.go
| if job == nil || job.Start == nil { | ||
| return true | ||
| } |
There was a problem hiding this comment.
Nil-start jobs are still eligible for one scheduling round.
This avoids the panic, but checkJob keeps going after finishLocked (Lines 470-496). A running job restored without Start will be marked finished here and can still reach prepare/Schedule once in the same tick. Return immediately after finishing that invalid job state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/balance_range.go` around lines 351 - 353, The
early-nil-start guard in checkJob is letting an invalid running job be marked
finished but then allows checkJob to continue and reach prepare/Schedule; fix
this by detecting job == nil or job.Start == nil, calling finishLocked(job) to
mark it finished/clean up, and then immediately return (stop further processing)
from checkJob so it cannot proceed to prepare or Schedule; update the branch
around job == nil || job.Start == nil in checkJob to perform finishLocked and
return immediately.
| if _, exist := conf.StoreIDWithRanges[id]; exist { | ||
| return true, nil | ||
| } | ||
| if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil { | ||
| return false, err | ||
| } | ||
| return true, nil | ||
| return false, nil |
There was a problem hiding this comment.
Bad updates can now unpause an existing evict-leader store.
Because this helper now returns true for stores that are already configured, the rollback paths in updateConfig still call resumeLeaderTransferIfExist(id) on 400s (Lines 424, 442, 449, 454, and 464) even though this request never paused that store. A request like {"store_id":1,"batch":0} will now fail validation and also clear the scheduler’s pause state for store 1. Track pausedByRequest separately from exists, and only resume when this request actually called PauseLeaderTransfer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/evict_leader.go` around lines 176 - 182, The helper
currently conflates "store already configured" with "we paused it", causing
rollbacks in updateConfig to resume stores that this request never paused;
change the helper used in evict_leader logic (the function that checks
conf.StoreIDWithRanges and calls conf.cluster.PauseLeaderTransfer) to return
separate flags (e.g., exists and pausedByRequest or a small result struct)
instead of a single boolean, set pausedByRequest only when
conf.cluster.PauseLeaderTransfer(id, constant.In) succeeds, and update all
callers (including resumeLeaderTransferIfExist and the rollback paths in
updateConfig) to only call resumeLeaderTransfer when pausedByRequest is true
(leave stores that merely existed untouched).
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/schedule/schedulers/balance_range.go (1)
350-351:⚠️ Potential issue | 🔴 Critical
shouldFinishedstill panics on nil job or missingStart.This still dereferences
job.Startunconditionally, so the new regression case withjob == nil/job.Start == nilwill panic instead of returning safely.Suggested fix
func (job *balanceRangeSchedulerJob) shouldFinished() bool { + if job == nil || job.Start == nil { + return true + } return time.Since(*job.Start) > job.Timeout }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/schedulers/balance_range.go` around lines 350 - 351, The shouldFinished method currently dereferences job.Start unconditionally causing a panic when job or job.Start is nil; update balanceRangeSchedulerJob.shouldFinished to first check for job != nil and job.Start != nil and return false (not finished) if either is nil, then perform the time.Since(*job.Start) > job.Timeout comparison; ensure you reference the balanceRangeSchedulerJob receiver, its Start field and Timeout field when adding the nil-guards so the behavior is safe for the regression case.
♻️ Duplicate comments (1)
pkg/schedule/schedulers/balance_range.go (1)
467-475:⚠️ Potential issue | 🟠 MajorReturn immediately after timing out the job.
Once this branch marks the job as
finished,checkJobstill falls through intoprepareand can schedule one more operator in the same tick. Stop here after cleanup.Suggested fix
if job.shouldFinished() { if err := s.conf.finishLocked(index); err != nil { balancePersistFailedCounter.Inc() return false } s.cleanJobStatus(cluster, job) balanceRangeExpiredCounter.Inc() log.Info("balance key range job finished due to timeout", zap.String("alias", job.Alias), zap.Uint64("job-id", job.JobID)) + return false }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/schedulers/balance_range.go` around lines 467 - 475, The checkJob branch that handles timeout needs to return immediately after marking the job finished to avoid falling through into prepare; update the block inside checkJob where it calls job.shouldFinished(), s.conf.finishLocked(index), s.cleanJobStatus(cluster, job), balanceRangeExpiredCounter.Inc(), and the log.Info call so that it returns false (or otherwise exits) right after cleanup and logging, preventing prepare/prepareRange from scheduling another operator in the same tick.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 350-351: The shouldFinished method currently dereferences
job.Start unconditionally causing a panic when job or job.Start is nil; update
balanceRangeSchedulerJob.shouldFinished to first check for job != nil and
job.Start != nil and return false (not finished) if either is nil, then perform
the time.Since(*job.Start) > job.Timeout comparison; ensure you reference the
balanceRangeSchedulerJob receiver, its Start field and Timeout field when adding
the nil-guards so the behavior is safe for the regression case.
---
Duplicate comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 467-475: The checkJob branch that handles timeout needs to return
immediately after marking the job finished to avoid falling through into
prepare; update the block inside checkJob where it calls job.shouldFinished(),
s.conf.finishLocked(index), s.cleanJobStatus(cluster, job),
balanceRangeExpiredCounter.Inc(), and the log.Info call so that it returns false
(or otherwise exits) right after cleanup and logging, preventing
prepare/prepareRange from scheduling another operator in the same tick.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c16d46e1-b0bf-4117-a0bd-4a0dfe8d55ed
📒 Files selected for processing (3)
pkg/schedule/schedulers/balance_range.gopkg/schedule/schedulers/balance_range_test.gopkg/schedule/schedulers/evict_leader.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/schedule/schedulers/evict_leader.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/schedule/schedulers/evict_leader.go (1)
176-180:⚠️ Potential issue | 🟠 MajorKeep
alreadyConfiguredseparate frompausedByRequest.This helper now returns the same
truevalue for two different states, but the rollback paths at Line 424, Line 442, Line 449, Line 454, and Line 464 still treat it as “this request paused the store”. A rejected update can therefore clear the existing pause on an already-configured store even though this request never calledPauseLeaderTransfer.💡 One way to fix it
+type pauseLeaderTransferResult struct { + alreadyConfigured bool + pausedByRequest bool +} + -func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (bool, error) { +func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (pauseLeaderTransferResult, error) { conf.RLock() defer conf.RUnlock() if _, exist := conf.StoreIDWithRanges[id]; exist { - return true, nil + return pauseLeaderTransferResult{alreadyConfigured: true}, nil } if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil { - return false, err + return pauseLeaderTransferResult{}, err } - return true, nil + return pauseLeaderTransferResult{pausedByRequest: true}, nil }- exist, err = handler.config.pauseLeaderTransferIfStoreNotExist(id) + pauseState, err = handler.config.pauseLeaderTransferIfStoreNotExist(id) ... - handler.config.resumeLeaderTransferIfExist(id) + if pauseState.pausedByRequest { + handler.config.resumeLeaderTransferIfExist(id) + } ... - } else if exist { + } else if pauseState.alreadyConfigured { ranges = handler.config.getRanges(id) }Also applies to: 423-465
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/schedulers/evict_leader.go` around lines 176 - 180, The helper that checks/conflicts and calls conf.cluster.PauseLeaderTransfer conflates "alreadyConfigured" and "pausedByRequest" into a single boolean; change it to return two distinct booleans (e.g., alreadyConfigured, pausedByRequest) so callers can distinguish stores that were paused by this request from stores that were already configured; update the helper (the code that reads conf.StoreIDWithRanges and calls conf.cluster.PauseLeaderTransfer) to set alreadyConfigured=true when the store exists in conf.StoreIDWithRanges and set pausedByRequest=true only when PauseLeaderTransfer is actually invoked successfully; then update all rollback paths that currently assume the single true to only call the resume/unpause logic when pausedByRequest is true (leave alreadyConfigured alone).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 121-135: The parsed timeout can be negative which causes
shouldFinished() to immediately return true; after parsing timeoutStr with
time.ParseDuration in the block surrounding timeoutStr and err, add a validation
that the resulting timeout is strictly positive (e.g., timeout > 0) and if not,
return a 400 via handler.rd.JSON with a clear message (e.g., "timeout must be
positive") and do not assign job.Timeout; keep existing handling of empty string
and parse errors intact and reference timeoutStr, time.ParseDuration,
job.Timeout and shouldFinished() when locating the change.
---
Duplicate comments:
In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 176-180: The helper that checks/conflicts and calls
conf.cluster.PauseLeaderTransfer conflates "alreadyConfigured" and
"pausedByRequest" into a single boolean; change it to return two distinct
booleans (e.g., alreadyConfigured, pausedByRequest) so callers can distinguish
stores that were paused by this request from stores that were already
configured; update the helper (the code that reads conf.StoreIDWithRanges and
calls conf.cluster.PauseLeaderTransfer) to set alreadyConfigured=true when the
store exists in conf.StoreIDWithRanges and set pausedByRequest=true only when
PauseLeaderTransfer is actually invoked successfully; then update all rollback
paths that currently assume the single true to only call the resume/unpause
logic when pausedByRequest is true (leave alreadyConfigured alone).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0dfccb36-3fe0-44a2-9af9-d9cd001700a9
📒 Files selected for processing (3)
pkg/schedule/schedulers/balance_range.gopkg/schedule/schedulers/balance_range_test.gopkg/schedule/schedulers/evict_leader.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/schedule/schedulers/balance_range_test.go
| if timeoutVal, exists := input["timeout"]; exists { | ||
| timeoutStr, ok := timeoutVal.(string) | ||
| if !ok { | ||
| handler.rd.JSON(w, http.StatusBadRequest, "timeout must be a string") | ||
| return | ||
| } | ||
| job.Timeout = timeout | ||
| if len(timeoutStr) > 0 { | ||
| timeout, err := time.ParseDuration(timeoutStr) | ||
| if err != nil { | ||
| handler.rd.JSON(w, http.StatusBadRequest, fmt.Sprintf("timeout:%s is invalid", timeoutStr)) | ||
| return | ||
| } | ||
| job.Timeout = timeout | ||
| } | ||
| } |
There was a problem hiding this comment.
Consider validating that timeout is positive.
time.ParseDuration accepts negative durations (e.g., "-5m"). A negative timeout would cause shouldFinished() to return true immediately, finishing the job without performing any work. Add a check to reject non-positive timeouts.
🛡️ Suggested fix
if len(timeoutStr) > 0 {
timeout, err := time.ParseDuration(timeoutStr)
if err != nil {
handler.rd.JSON(w, http.StatusBadRequest, fmt.Sprintf("timeout:%s is invalid", timeoutStr))
return
}
+ if timeout <= 0 {
+ handler.rd.JSON(w, http.StatusBadRequest, "timeout must be positive")
+ return
+ }
job.Timeout = timeout
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/schedule/schedulers/balance_range.go` around lines 121 - 135, The parsed
timeout can be negative which causes shouldFinished() to immediately return
true; after parsing timeoutStr with time.ParseDuration in the block surrounding
timeoutStr and err, add a validation that the resulting timeout is strictly
positive (e.g., timeout > 0) and if not, return a 400 via handler.rd.JSON with a
clear message (e.g., "timeout must be positive") and do not assign job.Timeout;
keep existing handling of empty string and parse errors intact and reference
timeoutStr, time.ParseDuration, job.Timeout and shouldFinished() when locating
the change.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10312 +/- ##
==========================================
+ Coverage 78.78% 78.82% +0.04%
==========================================
Files 525 527 +2
Lines 70824 70973 +149
==========================================
+ Hits 55796 55944 +148
- Misses 11004 11012 +8
+ Partials 4024 4017 -7
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
liyishuai
left a comment
There was a problem hiding this comment.
Tesed on my side. The AI review is worth considering. Otherwise LGTM.
|
@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liyishuai The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: tongjian <1045931706@qq.com>
|
@bufferflies: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
okJiang
left a comment
There was a problem hiding this comment.
Please link the corresponding issue and explain your changes in the PR description.
Even when using AI to fix bugs, we should still follow the process of submitting a bug issue -> pull request. The process of submitting a bug issue can also be completed by AI, and we should try to follow atomic commits, with one PR fixing one issue.
What problem does this PR solve?
Issue Number: Ref #10326
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
pingcap/docs/pingcap/docs-cn:pingcap/tiup:Release note
Summary by CodeRabbit
Bug Fixes
Tests