mcs: scheduling mcs support gctuner#10212
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdd a configurable GC tuner: new gctuner public API (Config, State, InitGCTuner, UpdateIfNeeded), server-level ServerConfig with validation/defaults, persistence and accessor wiring, watcher and cluster integration to initialize/update the tuner, and unit/integration tests plus config watch coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant Watcher
participant PersistConfig
participant Memory
participant GCTuner
participant RaftCluster
Watcher->>Memory: MemTotal()
Memory-->>Watcher: totalMem
Watcher->>PersistConfig: load persisted Server config
PersistConfig-->>Watcher: server config
Watcher->>GCTuner: InitGCTuner(totalMem, mappedConfig)
GCTuner-->>Watcher: State
RaftCluster->>GCTuner: periodic tick -> UpdateIfNeeded(cfg)
GCTuner-->>RaftCluster: applied?/updated (bool)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10212 +/- ##
==========================================
+ Coverage 78.59% 78.90% +0.30%
==========================================
Files 520 527 +7
Lines 70014 70990 +976
==========================================
+ Hits 55028 56014 +986
+ Misses 11008 10968 -40
- Partials 3978 4008 +30
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@pkg/gctuner/tuner_test.go`:
- Around line 129-153: Tests mutate package globals (EnableGOGCTuner,
memory.ServerMemoryLimit, GlobalMemoryLimitTuner/globalTuner via Tuning) and do
not restore them; update TestInitGCTuner, TestInitGCTunerWithZeroMemoryLimit,
and TestUpdateIfNeeded to capture current values at test start (e.g., prevEnable
:= EnableGOGCTuner.Load(), prevServerLimit := memory.ServerMemoryLimit,
prevGlobal := GlobalMemoryLimitTuner) and register t.Cleanup() to restore those
values (set EnableGOGCTuner back, reset memory.ServerMemoryLimit, and call
Tuning(prevGlobal) or reassign GlobalMemoryLimitTuner appropriately) so global
state is returned after each test.
In `@pkg/gctuner/tuner.go`:
- Around line 271-283: The update block currently only triggers when
newMemoryLimitBytes or newMemoryLimitGCTriggerBytes change, so changes to
newMemoryLimitGCTriggerRatio can be ignored due to rounding; modify the
conditional that gates the update to also compare newMemoryLimitGCTriggerRatio
against s.MemoryLimitGCTriggerRatio (and/or include an epsilon for float
comparison) so that when the GC trigger ratio changes you still set
s.MemoryLimitGCTriggerRatio, call GlobalMemoryLimitTuner.SetPercentage and
UpdateMemoryLimit; update references in this block to use
newMemoryLimitGCTriggerRatio, s.MemoryLimitGCTriggerRatio,
GlobalMemoryLimitTuner, memory.ServerMemoryLimit.Store and ensure updated is set
when the ratio-only change occurs.
In `@pkg/mcs/scheduling/server/config/config.go`:
- Around line 149-152: The ServerConfig adjustment is using the wrong metadata
key: replace the call c.Server.Adjust(configMetaData.Child("server")) with
c.Server.Adjust(configMetaData.Child("pd-server")) so the Adjust on ServerConfig
(c.Server) uses the TOML tag name `pd-server` and preserves user-provided values
instead of treating them as undefined; update the metadata key where
c.Server.Adjust is invoked to "pd-server".
In `@pkg/mcs/scheduling/server/config/watcher.go`:
- Around line 92-97: Replace the plain string concatenation error creation with
errors.Wrap to preserve context: after calling memory.MemTotal() in the watcher
initialization, keep the cancel() call but return errors.Wrap(err, "fail to get
total memory") instead of errors.New("fail to get total memory: "+err.Error());
this uses github.com/pingcap/errors and preserves the original error and stack
trace while keeping the same control flow around memory.MemTotal() and cancel().
Signed-off-by: bufferflies <1045931706@qq.com>
okJiang
left a comment
There was a problem hiding this comment.
// Tuning sets the threshold of heap which will be respect by gogc tuner.
// When Tuning, the env GOGC will not be take effect.
// threshold: disable tuning if threshold == 0
func Tuning(threshold uint64) {
// disable gc tuner if percent is zero
if t := globalTuner.Load(); t == nil {
t1 := newTuner(threshold)
globalTuner.CompareAndSwap(nil, &t1)
} else {
if threshold <= 0 {
(*t).stop()
globalTuner.CompareAndSwap(t, nil)
} else {
(*t).setThreshold(threshold)
}
}
}The Tuning annotation says that threshold==0 is disabled, but newTuner(0) still starts the finalizer loop when globalTuner is empty; And if CompareAndSwap(nil, &t1) fails during concurrent initialization, the tuner created by the losing party will not stop (it may survive for a long time and participate in parameter adjustment when enabled later).
|
Do other microservice components also need to support gctunner? |
We should support it, and we can enable it for the specified case. |
Yes, the tuning should fix this bug |
|
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/gctuner/tuner.go (1)
67-82:⚠️ Potential issue | 🟠 MajorLeaked tuner on concurrent
CompareAndSwapfailure.When two goroutines call
Tuning(threshold)concurrently withthreshold > 0andglobalTuneris nil, both create anewTuner(which starts a finalizer loop). The loser ofCompareAndSwapsilently discards its tuner without callingstop(), leaking a goroutine. This was also identified by@bufferfliesin the PR discussion.🔧 Proposed fix — stop the tuner if CAS fails
if t := globalTuner.Load(); t == nil { // init gc tuner only when threshold > 0, otherwise do nothing if threshold > 0 { t1 := newTuner(threshold) - globalTuner.CompareAndSwap(nil, &t1) + if !globalTuner.CompareAndSwap(nil, &t1) { + t1.stop() + // Another goroutine won; update its threshold instead. + if t2 := globalTuner.Load(); t2 != nil { + (*t2).setThreshold(threshold) + } + } }
🧹 Nitpick comments (2)
pkg/gctuner/tuner.go (2)
249-259: Duplicated computation betweenInitGCTunerandUpdateIfNeeded.Lines 253-259 replicate the same derivation logic as lines 221-227 in
InitGCTuner. Consider extracting a helper (e.g.,computeState(totalMem uint64, cfg *Config) (...)) to keep the two paths in sync and reduce maintenance burden.♻️ Sketch
func computeDerivedState(totalMem uint64, cfg *Config) (memLimitBytes, gcThresholdBytes uint64, gcTriggerRatio float64, gcTriggerBytes uint64) { memLimitBytes = uint64(float64(totalMem) * cfg.ServerMemoryLimit) gcThresholdBytes = uint64(float64(memLimitBytes) * cfg.GCTunerThreshold) if memLimitBytes == 0 { gcThresholdBytes = uint64(float64(totalMem) * cfg.GCTunerThreshold) } gcTriggerRatio = cfg.ServerMemoryLimitGCTrigger gcTriggerBytes = uint64(float64(memLimitBytes) * gcTriggerRatio) return }Then call it from both
InitGCTunerandUpdateIfNeeded.
218-227:MemoryLimitGCTriggerBytesis never used downstream, so the inconsistency is a code-cleanliness issue, not a functional bug.While the inconsistency is real—
GCThresholdByteshas a fallback tototalMemwhenMemoryLimitBytes == 0(lines 223-225), butMemoryLimitGCTriggerBytesis computed as0 * ratio = 0with no fallback (line 227)—the field itself is never read anywhere in the codebase. The actual memory limit tuning usesMemoryLimitGCTriggerRatio(the ratio) andServerMemoryLimitdirectly;GlobalMemoryLimitTuner.calcMemoryLimit()has its own fallback logic that returnsmath.MaxInt64when the computed limit is 0.The same pattern exists in
UpdateIfNeededat line 259. SinceMemoryLimitGCTriggerBytesis private and unused, consider removing it entirely or adding a fallback for consistency.
|
/ping @lhy1024 |
| if threshold > 0 { | ||
| t1 := newTuner(threshold) | ||
| globalTuner.CompareAndSwap(nil, &t1) | ||
| } |
There was a problem hiding this comment.
threshold == 0 is fixed now, but the concurrent first-time initialization case is still unsafe. If two goroutines call Tuning(threshold > 0) while globalTuner is nil, both of them will create newTuner(threshold) before the CAS. Since newTuner() already starts the finalizer loop, the loser of tuning() on later GCs once EnableGOGCTuner is enabled.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/gctuner/tuner.go`:
- Around line 68-88: Concurrent calls to Tuning can allocate a newTuner whose
finalizer loop starts immediately and gets leaked if the goroutine loses the
CompareAndSwap on globalTuner; modify Tuning so that newTuner does not start its
finalizer until after globalTuner.CompareAndSwap(nil, &t1) returns true, or if
you must create the tuner before the CAS then call (*t1).stop() (or t1.stop())
when CompareAndSwap fails; specifically change the flow around
newTuner(threshold) + globalTuner.CompareAndSwap(nil, &t1) to either defer
starting the finalizer inside newTuner until after CAS success or invoke the
tuner's stop method on the loser path so no running finalizer is left behind,
and ensure code uses the same symbols Tuning, globalTuner, newTuner,
CompareAndSwap, and stop/setThreshold when implementing this fix.
In `@pkg/mcs/scheduling/server/config/config.go`:
- Around line 302-305: PersistConfig.setServerConfig currently stores the
incoming cfg directly, bypassing validation/clamping; call cfg.Adjust(nil) (or
equivalent clamp-only variant) before storing to ensure ServerMemoryLimit and
other fields are validated and defaults applied so downstream users like
GetGCTunerConfig() receive sane values; update PersistConfig.setServerConfig to
run ServerConfig.Adjust() on the provided cfg and then
o.serverConfig.Store(cfg).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c98acc6e-702e-47e1-a616-f5cf766226be
📒 Files selected for processing (3)
pkg/gctuner/tuner.gopkg/mcs/scheduling/server/config/config.gopkg/mcs/scheduling/server/server.go
Signed-off-by: tongjian <1045931706@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 98-102: The validation is checking the uninitialized field
job.Engine instead of the parsed engine value, causing valid requests to be
rejected; change the conditional to compare the local variable engine against
core.EngineTiFlash and core.EngineTiKV (validate engine first), return the
bad-request via handler.rd.JSON if engine is invalid, and only then assign
job.Engine = engine so job.Engine is set after successful validation;
references: job.Engine, engine, core.EngineTiFlash, core.EngineTiKV,
handler.rd.JSON.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d001e565-3b90-49ad-89ab-da46e7d0304e
📒 Files selected for processing (5)
pkg/gctuner/tuner.gopkg/mcs/scheduling/server/config/config.gopkg/mcs/scheduling/server/server.gopkg/schedule/schedulers/balance_range.gopkg/schedule/schedulers/balance_range_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/mcs/scheduling/server/server.go
Signed-off-by: tongjian <1045931706@qq.com>
|
/ping @lhy1024 |
|
/retest |
pkg/gctuner/tuner.go
Outdated
| if threshold <= 0 { | ||
| (*t).stop() | ||
| globalTuner.CompareAndSwap(t, nil) | ||
| for range 3 { |
There was a problem hiding this comment.
Tuning() now publishes &t1 into globalTuner before calling t1.start(). That means another goroutine concurrent Tuning(0) wins the CAS-to-nil path, it immediately calls (*t).stop(), which dereferences t.finalizer and panics.
I reproduced this locally with a small unit test by placing a newTunerWithoutStart() instance into
Also, even if stop() were made nil-safe, the publish-before-start ordering would still be problematic: would leave behind a detached started tuner again. So the root issue is that the tuner is visible to other
goroutines before it is fully initialized.
There was a problem hiding this comment.
func TestTuningDisableHandlesPublishedButNotStartedTuner(t *testing.T) {
re := require.New(t)
old := globalTuner.Load()
t.Cleanup(func() {
globalTuner.Store(old)
})
tn := newTunerWithoutStart(1)
globalTuner.Store(&tn)
re.NotPanics(func() {
Tuning(0)
})
re.Nil(globalTuner.Load())
}
There was a problem hiding this comment.
Yes, I will add a lock to keep it atomic.
Signed-off-by: tongjian <1045931706@qq.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lhy1024, okJiang The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
What problem does this PR solve?
Issue Number: Ref #10213
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
pingcap/docs/pingcap/docs-cn:pingcap/tiup:Release note
Summary by CodeRabbit
New Features
Behavior
Tests