resourcemanager: add metadata watcher scaffold#10259
resourcemanager: add metadata watcher scaffold#10259okJiang wants to merge 10 commits intotikv:masterfrom
Conversation
Signed-off-by: okjiang <819421878@qq.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a metadata watcher that boots resource-group metadata into Manager; introduces parsing/validation and upsert/delete helpers for resource groups, transactional deletion paths, default-group lifecycle updates, service-limit loading, and tests covering watcher parsing, lifecycle, and cache/state behavior. Changes
Sequence DiagramsequenceDiagram
participant ETCD as etcd
participant WATCHER as MetadataWatcher
participant MGR as Manager
participant KRG as KeyspaceResourceGroupManager
participant STORE as Storage
participant CACHE as InMemoryCache
ETCD->>WATCHER: Watch Event (Put/Delete) on resource_group/...
WATCHER->>WATCHER: parseResourceGroupWatchPath(path)
alt Put -> Controller Config
WATCHER->>MGR: applyControllerConfigFromRaw(raw)
MGR->>STORE: SaveControllerConfig(...)
MGR->>MGR: clone & update controller config
else Put -> Settings
WATCHER->>MGR: getOrCreateKeyspaceResourceGroupManager(keyspaceID)
MGR->>KRG: ensure manager exists
WATCHER->>KRG: upsertResourceGroupFromRaw(name, raw)
KRG->>CACHE: insert/update ResourceGroup
KRG->>KRG: syncBurstabilityWithServiceLimit(group)
else Put -> States
WATCHER->>MGR: find keyspace manager
WATCHER->>KRG: applyResourceGroupStatesFromRaw(name, rawState)
KRG->>CACHE: update RU tokens/consumption
else Put/Delete -> Service Limits
WATCHER->>MGR: applyServiceLimitFromRaw(keyspaceID, raw)
MGR->>MGR: set/clear in-memory service limit
else Delete -> Settings/States
WATCHER->>MGR: find keyspace manager
WATCHER->>KRG: deleteResourceGroupFromCache(name) (unless reserved default)
KRG->>CACHE: remove group and RU tracker
KRG->>STORE: (txn) delete persisted setting & states
end
CACHE-->>MGR: state updated
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
okJiang
left a comment
There was a problem hiding this comment.
Two blocking issues before this watcher path is safe to enable:
- watcher mode only keeps group settings/states live; controller config and keyspace service-limit metadata under
resource_group/are still one-shot loaded at startup and will drift after PD-side updates. - the watch loop is started on the callback context before
m.cancelexists, soManager.close()and init-failure cleanup cannot stop it cleanly.
Open question: is controller/service-limit watch intentionally deferred to a follow-up PR? If so, the scope should be called out explicitly in the PR description/tests.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
pkg/mcs/resourcemanager/server/keyspace_manager.go (1)
109-132: Validate the key name against the protobuf name before upserting.The watcher path passes the authoritative group name from the etcd key, but this method looks up and inserts by
group.Namefrom the payload instead. If those ever diverge, PUTs update one cache entry while DELETEs later target another, leaving a ghost group in memory.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/resourcemanager/server/keyspace_manager.go` around lines 109 - 132, upsertResourceGroupFromRaw currently trusts group.Name from the protobuf payload instead of the authoritative key name (name param), so add a validation immediately after proto.Unmarshal: compare group.Name to the provided name param and if they differ log an error (include key name, payload name, keyspace id and rawValue) and return a non-nil error instead of inserting; this prevents inserting a cache entry under a different name than the etcd key and avoids ghost groups—use the existing function name upsertResourceGroupFromRaw, the local variable group and the name parameter to locate where to add the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 135-139: Legacy API path in deleteResourceGroup currently removes
only groups but not groupRUTrackers, letting recreated groups inherit stale RU;
update deleteResourceGroup to call the existing helper
deleteResourceGroupFromCache (which atomically deletes both groups and
groupRUTrackers under the lock) instead of directly deleting krgm.groups, and
ensure no duplicated deletions remain; reference deleteResourceGroup,
deleteResourceGroupFromCache, groups, groupRUTrackers and cleanupStaleRUTrackers
when making the change.
In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 92-95: NewManager currently checks for the metadataWatcherProvider
interface to set enableMetadataWatcher, but the concrete *Server type does not
implement EnableResourceGroupMetadataWatcher(), so the flag stays false and
watcher bootstrap in Init() never runs; implement the method on the concrete
Server (add a receiver method Server.EnableResourceGroupMetadataWatcher() bool
that returns the appropriate config/flag), or wire the decision into NewManager
by reading the Server's existing config field and passing that boolean
explicitly, ensuring metadataWatcherProvider
(EnableResourceGroupMetadataWatcher) is satisfied so enableMetadataWatcher is
true when intended (update references in NewManager, Server, and Init to use the
new method or passed boolean).
- Around line 234-237: The watcher is being started with the parent ctx before
m.cancel is set, which can leak the watch goroutine on failure or Close; create
and assign a cancelable context on the Manager first (e.g. call
context.WithCancel on the incoming ctx, set m.ctx and m.cancel) before calling
initializeMetadataWatcher(ctx), ensure initializeMetadataWatcher is passed the
Manager's cancelable context (or uses m.ctx) so the started goroutine can be
canceled by m.cancel and paired with m.wg/Manager.close().
---
Nitpick comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 109-132: upsertResourceGroupFromRaw currently trusts group.Name
from the protobuf payload instead of the authoritative key name (name param), so
add a validation immediately after proto.Unmarshal: compare group.Name to the
provided name param and if they differ log an error (include key name, payload
name, keyspace id and rawValue) and return a non-nil error instead of inserting;
this prevents inserting a cache entry under a different name than the etcd key
and avoids ghost groups—use the existing function name
upsertResourceGroupFromRaw, the local variable group and the name parameter to
locate where to add the check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1b217c21-9e8e-4096-b57e-c34e514b58f1
📒 Files selected for processing (5)
pkg/mcs/resourcemanager/server/keyspace_manager.gopkg/mcs/resourcemanager/server/keyspace_manager_test.gopkg/mcs/resourcemanager/server/manager.gopkg/mcs/resourcemanager/server/metadata_watcher.gopkg/mcs/resourcemanager/server/metadata_watcher_test.go
Keep controller config and keyspace service limits in sync under watcher mode, create the manager-owned cancelable context before starting the watcher, reuse cache cleanup for legacy deletes, and reject mismatched watcher payload names. Verified with: go test ./pkg/mcs/resourcemanager/server -count=1 Signed-off-by: okjiang <819421878@qq.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/mcs/resourcemanager/server/metadata_watcher.go (2)
215-242: Consider adding debug logging for skipped delete events.When
handleMetadataWatchDeleteskips processing due to missing keyspace manager (lines 223-224, 237-238), there's no logging to aid debugging. While this may be intentional to avoid log spam, adding debug-level logging could help troubleshoot issues in production.📝 Optional: Add debug logging for skipped deletes
case resourceGroupWatchEntryServiceLimit: krgm := m.getKeyspaceResourceGroupManager(target.keyspaceID) if krgm == nil { + log.Debug("skip deleting service limit without corresponding manager", + zap.Uint32("keyspace-id", target.keyspaceID)) return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/resourcemanager/server/metadata_watcher.go` around lines 215 - 242, In handleMetadataWatchDelete, add debug-level logs when the keyspace resource group manager is missing so skipped deletes are visible: in the first nil-check after parseResourceGroupWatchPath (inside the resourceGroupWatchEntryServiceLimit case where getKeyspaceResourceGroupManager returns nil) and in the second nil-check before deleteResourceGroupFromCache, call the logger (e.g., m.logger.Debug or processLogger.Debug) to record the key, target.keyspaceID, target.groupName and the entryType so operators can trace skipped events without raising log noise; keep the log at debug level and include concise context like "skipping delete: no keyspace manager" plus the identifiers.
164-194: Potential redundancy:loadServiceLimits()called after watcher initial load.The watcher's
WaitLoad()processes all existing keys including service limits viaapplyServiceLimitFromRaw. ThenloadServiceLimits()is called again at line 193, which reloads the same data. While both paths usesetServiceLimitFromStorage(idempotent), this is redundant work during initialization.Consider removing line 193 if service limits are already processed by the watcher's initial load, or document why the explicit reload is necessary (e.g., ordering guarantees).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/resourcemanager/server/metadata_watcher.go` around lines 164 - 194, The initializeMetadataWatcher function redundantly calls loadServiceLimits() after the watcher.WaitLoad() because the watcher's initial load already invokes applyServiceLimitFromRaw which calls setServiceLimitFromStorage; remove the explicit loadServiceLimits() call at the end of initializeMetadataWatcher (or if you need ordering guarantees, replace it with a short comment explaining why the explicit reload is required) so that applyServiceLimitFromRaw via the newMetadataLoopWatcher/WaitLoad path is the sole initialization path for service limits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/mcs/resourcemanager/server/metadata_watcher.go`:
- Around line 215-242: In handleMetadataWatchDelete, add debug-level logs when
the keyspace resource group manager is missing so skipped deletes are visible:
in the first nil-check after parseResourceGroupWatchPath (inside the
resourceGroupWatchEntryServiceLimit case where getKeyspaceResourceGroupManager
returns nil) and in the second nil-check before deleteResourceGroupFromCache,
call the logger (e.g., m.logger.Debug or processLogger.Debug) to record the key,
target.keyspaceID, target.groupName and the entryType so operators can trace
skipped events without raising log noise; keep the log at debug level and
include concise context like "skipping delete: no keyspace manager" plus the
identifiers.
- Around line 164-194: The initializeMetadataWatcher function redundantly calls
loadServiceLimits() after the watcher.WaitLoad() because the watcher's initial
load already invokes applyServiceLimitFromRaw which calls
setServiceLimitFromStorage; remove the explicit loadServiceLimits() call at the
end of initializeMetadataWatcher (or if you need ordering guarantees, replace it
with a short comment explaining why the explicit reload is required) so that
applyServiceLimitFromRaw via the newMetadataLoopWatcher/WaitLoad path is the
sole initialization path for service limits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f04ad21f-8d12-4429-85c8-fbee3131ee0c
📒 Files selected for processing (6)
pkg/mcs/resourcemanager/server/keyspace_manager.gopkg/mcs/resourcemanager/server/keyspace_manager_test.gopkg/mcs/resourcemanager/server/manager.gopkg/mcs/resourcemanager/server/manager_test.gopkg/mcs/resourcemanager/server/metadata_watcher.gopkg/mcs/resourcemanager/server/metadata_watcher_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/mcs/resourcemanager/server/keyspace_manager_test.go
- pkg/mcs/resourcemanager/server/keyspace_manager.go
Reorder imports for gci and rename unused test watcher-factory parameters to satisfy revive. Verified with: make static Signed-off-by: okjiang <819421878@qq.com>
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10259 +/- ##
==========================================
+ Coverage 78.80% 78.87% +0.07%
==========================================
Files 523 528 +5
Lines 70529 71191 +662
==========================================
+ Hits 55580 56153 +573
- Misses 10955 11016 +61
- Partials 3994 4022 +28
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 259-265: Currently DeleteResourceGroupStates(krgm.keyspaceID,
name) is called before DeleteResourceGroupSetting(krgm.keyspaceID, name) which
can leave state removed while settings remain; change this to perform both
deletes atomically using a single etcd transaction (combine
DeleteResourceGroupSetting and DeleteResourceGroupStates into one txn) or, if a
txn is not feasible, invert the order to call DeleteResourceGroupSetting first
and then call DeleteResourceGroupStates as a best-effort cleanup, and keep
deleteResourceGroupFromCache(name) only after the transactional/delete-success
path to ensure cache stays consistent with storage.
In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 241-250: The full-load branch that calls
loadKeyspaceResourceGroups() can create ghost groups because
addResourceGroupFromRaw() inserts by group.Name without validating the etcd
key/payload name guard; ensure the legacy bootstrap path performs the same
validation as watcher mode by either routing the full-load path to use
upsertResourceGroupFromRaw() instead of addResourceGroupFromRaw(), or add the
same key/payload name validation into addResourceGroupFromRaw() (the functions
to touch are loadKeyspaceResourceGroups, addResourceGroupFromRaw, and
upsertResourceGroupFromRaw); update loadKeyspaceResourceGroups() to call
upsertResourceGroupFromRaw() or add the name-check logic to
addResourceGroupFromRaw() so mismatched records are rejected during legacy
bootstrap.
- Around line 314-324: applyControllerConfigFromRaw currently swaps
m.controllerConfig under m.Lock but other goroutines (backgroundMetricsFlush,
UpdateControllerConfigItem) access or mutate the same object directly causing
races; change to publish a snapshot atomically: deep-copy or construct a new
ControllerConfig, assign it to m.controllerConfig only while holding m.Lock (or
use an atomic.Value field) and ensure readers (backgroundMetricsFlush) and
mutators (UpdateControllerConfigItem) always access controller config via the
same locked accessor method (e.g.,
GetControllerConfigLocked/GetControllerConfig) or via atomic load so they never
mutate the shared instance in place; also update UpdateControllerConfigItem to
operate on a copied config and replace the published config under the lock.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 06b1c729-772b-4246-85d9-9f4480870f2d
📒 Files selected for processing (7)
pkg/mcs/resourcemanager/server/keyspace_manager.gopkg/mcs/resourcemanager/server/keyspace_manager_test.gopkg/mcs/resourcemanager/server/manager.gopkg/mcs/resourcemanager/server/manager_test.gopkg/mcs/resourcemanager/server/metadata_watcher.gopkg/mcs/resourcemanager/server/metadata_watcher_test.gopkg/mcs/resourcemanager/server/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/mcs/resourcemanager/server/keyspace_manager_test.go
|
Also addressed the CodeRabbit review-body notes from the March 6 review:
Verification:
|
Address the remaining metadata watcher review comments by reapplying existing service limits when groups enter the cache, restoring the default group with fully initialized runtime fields, and extending the watcher regression coverage. The package tests and server lint checks pass with these changes. Signed-off-by: okjiang <819421878@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 120-130: addResourceGroupFromRaw (and the new-group path in
upsertResourceGroupFromRaw) currently skips the validations performed by
addResourceGroup, allowing invalid groups (e.g., invalid name or Priority >
maxPriority) into krgm.groups and causing downstream panics; after
parseResourceGroupFromRaw returns the proto group, run the same validation logic
used by addResourceGroup (validate group.Name format and that group.Priority is
within allowed bounds, or reuse addResourceGroup's validation helper if
available) and return an error if validation fails before calling
FromProtoResourceGroup() and inserting into krgm.groups; alternatively, call
addResourceGroup with the validated proto to centralize checks and avoid
duplicating logic, ensuring getPriorityQueues() cannot encounter an out-of-range
priority.
In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 417-427: The code currently updates m.controllerConfig and returns
nil even when m.storage.SaveControllerConfig(controllerConfig) fails; change the
control flow so that you do not assign m.controllerConfig unless
SaveControllerConfig returns nil, and if SaveControllerConfig returns an error,
return that error up the stack (use errors.Is/As where appropriate for sentinel
checks). Concretely: within the updated branch call
m.storage.SaveControllerConfig(controllerConfig) first, if it returns an error
log it and return the error (making sure to release the lock before returning),
and only on success set m.controllerConfig and continue to log the update; keep
references to SaveControllerConfig and m.controllerConfig to locate the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b80cdb91-0be4-4a9a-9e60-ab025a9b79c7
📒 Files selected for processing (6)
pkg/mcs/resourcemanager/server/keyspace_manager.gopkg/mcs/resourcemanager/server/keyspace_manager_test.gopkg/mcs/resourcemanager/server/manager.gopkg/mcs/resourcemanager/server/manager_test.gopkg/mcs/resourcemanager/server/metadata_watcher.gopkg/mcs/resourcemanager/server/metadata_watcher_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/mcs/resourcemanager/server/manager_test.go
Reject invalid raw resource-group payloads before inserting them into\nthe in-memory cache and keep controller config unchanged when\npersistence fails.\n\nAdd regression tests for both paths. Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
|
@okJiang: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
The write-in-PD plan needs ResourceManager to consume resource-group metadata from etcd instead of owning metadata writes directly. This PR adds the metadata watcher bootstrap path so RM can stay in sync with PD-driven updates for resource-group settings, runtime states, controller config, and keyspace service limits.
Issue Number: ref #9737
What is changed and how does it work?
This PR adds the ResourceManager metadata watcher scaffold in
pkg/mcs/resourcemanager/serverand wires it intoManager.Init()behindEnableResourceGroupMetadataWatcher(). The watcher now followsresource_group/changes for controller config, resource-group settings, runtime states, and keyspace service limits, while the manager keeps the legacy full-load path as the fallback bootstrap mode.The rest of the branch hardens that watcher path so it is safe to enable: it fixes watcher lifecycle cleanup, publishes controller-config snapshots safely, keeps key validation and delete behavior consistent across watcher and legacy bootstrap paths, and makes service-limit replay and default-group restoration behave correctly in memory.
Validation:
make gotest GOTEST_ARGS='./pkg/mcs/resourcemanager/server -count=1'./.tools/bin/golangci-lint run --verbose ./pkg/mcs/resourcemanager/server/... --allow-parallel-runnersmake staticCheck List
Tests
Side effects
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests