Skip to content

resourcemanager: add metadata watcher scaffold#10259

Open
okJiang wants to merge 10 commits intotikv:masterfrom
okJiang:codex/rm-pr4-metadata-watcher-scaffold
Open

resourcemanager: add metadata watcher scaffold#10259
okJiang wants to merge 10 commits intotikv:masterfrom
okJiang:codex/rm-pr4-metadata-watcher-scaffold

Conversation

@okJiang
Copy link
Member

@okJiang okJiang commented Feb 27, 2026

What problem does this PR solve?

The write-in-PD plan needs ResourceManager to consume resource-group metadata from etcd instead of owning metadata writes directly. This PR adds the metadata watcher bootstrap path so RM can stay in sync with PD-driven updates for resource-group settings, runtime states, controller config, and keyspace service limits.

Issue Number: ref #9737

What is changed and how does it work?

This PR adds the ResourceManager metadata watcher scaffold in pkg/mcs/resourcemanager/server and wires it into Manager.Init() behind EnableResourceGroupMetadataWatcher(). The watcher now follows resource_group/ changes for controller config, resource-group settings, runtime states, and keyspace service limits, while the manager keeps the legacy full-load path as the fallback bootstrap mode.

The rest of the branch hardens that watcher path so it is safe to enable: it fixes watcher lifecycle cleanup, publishes controller-config snapshots safely, keeps key validation and delete behavior consistent across watcher and legacy bootstrap paths, and makes service-limit replay and default-group restoration behave correctly in memory.

Validation:

  • make gotest GOTEST_ARGS='./pkg/mcs/resourcemanager/server -count=1'
  • ./.tools/bin/golangci-lint run --verbose ./pkg/mcs/resourcemanager/server/... --allow-parallel-runners
  • make static
Add the ResourceManager metadata watcher bootstrap path and watch
resource-group settings, states, controller config, and keyspace
service limits from etcd. Create the manager-owned cancelable context
before watcher startup, publish controller-config snapshots safely,
validate etcd key and payload-name consistency, and keep delete and
default-restore paths consistent with the in-memory cache and
service-limit state.

Check List

Tests

  • Unit test

Side effects

  • Increased code complexity

Release note

Add ResourceManager metadata watcher bootstrap so RM can follow
resource-group settings, states, controller config, and keyspace
service limits from etcd with safer watcher lifecycle and cache
consistency handling.

Summary by CodeRabbit

  • New Features

    • Metadata watcher now synchronizes resource group configurations and service limits from centralized storage
    • Default resource groups can be restored and reinitialized
  • Bug Fixes

    • Resource group deletion now uses transactional operations for improved consistency
  • Tests

    • Added comprehensive test coverage for metadata watcher and resource group operations

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Feb 27, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign qiuyesuifeng for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 27, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a metadata watcher that boots resource-group metadata into Manager; introduces parsing/validation and upsert/delete helpers for resource groups, transactional deletion paths, default-group lifecycle updates, service-limit loading, and tests covering watcher parsing, lifecycle, and cache/state behavior.

Changes

Cohort / File(s) Summary
Keyspace Resource Group Manager
pkg/mcs/resourcemanager/server/keyspace_manager.go, pkg/mcs/resourcemanager/server/keyspace_manager_test.go
Adds parsing/validation (parseResourceGroupFromRaw, validateResourceGroupProto), upsertResourceGroupFromRaw, deleteResourceGroupFromCache, default-group creation/restoration (newDefaultResourceGroup, restoreDefaultResourceGroupFromReserved), burstability sync, transactional-delete handling, and tests exercising upsert/delete and RU tracker behavior.
Manager Initialization & Config
pkg/mcs/resourcemanager/server/manager.go
Detects EnableResourceGroupMetadataWatcher() on server, adds enableMetadataWatcher flag, conditional Init path (metadata watcher vs legacy load), cancellable context handling, loadServiceLimits, and raw-apply helpers for watcher events.
Metadata Watcher Implementation
pkg/mcs/resourcemanager/server/metadata_watcher.go, pkg/mcs/resourcemanager/server/metadata_watcher_test.go
New watcher: path parsing/routing for controller config, settings, states, service limits; watch event handlers calling Manager/KRG apply methods; startup coordination (start/wait); tests for path parsing, put/delete flows, service-limit interactions, and default-group restoration.
Manager Tests / Watcher Lifecycle Mocks
pkg/mcs/resourcemanager/server/manager_test.go
Adds test scaffolding, fake metadata watcher, storage wrappers, and tests for watcher enablement, Init cancellation, Close behavior, controller-config snapshots, and mismatched-payload rejection.
Server Interface
pkg/mcs/resourcemanager/server/server.go
Adds EnableResourceGroupMetadataWatcher() bool on Server to advertise watcher bootstrap support.

Sequence Diagram

sequenceDiagram
    participant ETCD as etcd
    participant WATCHER as MetadataWatcher
    participant MGR as Manager
    participant KRG as KeyspaceResourceGroupManager
    participant STORE as Storage
    participant CACHE as InMemoryCache

    ETCD->>WATCHER: Watch Event (Put/Delete) on resource_group/...
    WATCHER->>WATCHER: parseResourceGroupWatchPath(path)
    alt Put -> Controller Config
        WATCHER->>MGR: applyControllerConfigFromRaw(raw)
        MGR->>STORE: SaveControllerConfig(...)
        MGR->>MGR: clone & update controller config
    else Put -> Settings
        WATCHER->>MGR: getOrCreateKeyspaceResourceGroupManager(keyspaceID)
        MGR->>KRG: ensure manager exists
        WATCHER->>KRG: upsertResourceGroupFromRaw(name, raw)
        KRG->>CACHE: insert/update ResourceGroup
        KRG->>KRG: syncBurstabilityWithServiceLimit(group)
    else Put -> States
        WATCHER->>MGR: find keyspace manager
        WATCHER->>KRG: applyResourceGroupStatesFromRaw(name, rawState)
        KRG->>CACHE: update RU tokens/consumption
    else Put/Delete -> Service Limits
        WATCHER->>MGR: applyServiceLimitFromRaw(keyspaceID, raw)
        MGR->>MGR: set/clear in-memory service limit
    else Delete -> Settings/States
        WATCHER->>MGR: find keyspace manager
        WATCHER->>KRG: deleteResourceGroupFromCache(name) (unless reserved default)
        KRG->>CACHE: remove group and RU tracker
        KRG->>STORE: (txn) delete persisted setting & states
    end
    CACHE-->>MGR: state updated
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

lgtm, approved

Suggested reviewers

  • rleungx
  • bufferflies
  • lhy1024

Poem

🐰 I sniff the keys where metadata streams,
I hop to parse the watcher’s dreams,
Upsert and delete, tokens kept right,
Defaults restored through day and night,
A tiny rabbit guards your resource beams.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.94% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'resourcemanager: add metadata watcher scaffold' clearly identifies the main change: adding a metadata watcher component to the ResourceManager module.
Description check ✅ Passed The PR description provides comprehensive details including problem statement, technical implementation, validation steps, and release notes. It covers the problem domain, design approach, and testing methodology thoroughly.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@okJiang okJiang marked this pull request as ready for review March 6, 2026 03:46
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2026
Copy link
Member Author

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking issues before this watcher path is safe to enable:

  1. watcher mode only keeps group settings/states live; controller config and keyspace service-limit metadata under resource_group/ are still one-shot loaded at startup and will drift after PD-side updates.
  2. the watch loop is started on the callback context before m.cancel exists, so Manager.close() and init-failure cleanup cannot stop it cleanly.

Open question: is controller/service-limit watch intentionally deferred to a follow-up PR? If so, the scope should be called out explicitly in the PR description/tests.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
pkg/mcs/resourcemanager/server/keyspace_manager.go (1)

109-132: Validate the key name against the protobuf name before upserting.

The watcher path passes the authoritative group name from the etcd key, but this method looks up and inserts by group.Name from the payload instead. If those ever diverge, PUTs update one cache entry while DELETEs later target another, leaving a ghost group in memory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/keyspace_manager.go` around lines 109 - 132,
upsertResourceGroupFromRaw currently trusts group.Name from the protobuf payload
instead of the authoritative key name (name param), so add a validation
immediately after proto.Unmarshal: compare group.Name to the provided name param
and if they differ log an error (include key name, payload name, keyspace id and
rawValue) and return a non-nil error instead of inserting; this prevents
inserting a cache entry under a different name than the etcd key and avoids
ghost groups—use the existing function name upsertResourceGroupFromRaw, the
local variable group and the name parameter to locate where to add the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 135-139: Legacy API path in deleteResourceGroup currently removes
only groups but not groupRUTrackers, letting recreated groups inherit stale RU;
update deleteResourceGroup to call the existing helper
deleteResourceGroupFromCache (which atomically deletes both groups and
groupRUTrackers under the lock) instead of directly deleting krgm.groups, and
ensure no duplicated deletions remain; reference deleteResourceGroup,
deleteResourceGroupFromCache, groups, groupRUTrackers and cleanupStaleRUTrackers
when making the change.

In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 92-95: NewManager currently checks for the metadataWatcherProvider
interface to set enableMetadataWatcher, but the concrete *Server type does not
implement EnableResourceGroupMetadataWatcher(), so the flag stays false and
watcher bootstrap in Init() never runs; implement the method on the concrete
Server (add a receiver method Server.EnableResourceGroupMetadataWatcher() bool
that returns the appropriate config/flag), or wire the decision into NewManager
by reading the Server's existing config field and passing that boolean
explicitly, ensuring metadataWatcherProvider
(EnableResourceGroupMetadataWatcher) is satisfied so enableMetadataWatcher is
true when intended (update references in NewManager, Server, and Init to use the
new method or passed boolean).
- Around line 234-237: The watcher is being started with the parent ctx before
m.cancel is set, which can leak the watch goroutine on failure or Close; create
and assign a cancelable context on the Manager first (e.g. call
context.WithCancel on the incoming ctx, set m.ctx and m.cancel) before calling
initializeMetadataWatcher(ctx), ensure initializeMetadataWatcher is passed the
Manager's cancelable context (or uses m.ctx) so the started goroutine can be
canceled by m.cancel and paired with m.wg/Manager.close().

---

Nitpick comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 109-132: upsertResourceGroupFromRaw currently trusts group.Name
from the protobuf payload instead of the authoritative key name (name param), so
add a validation immediately after proto.Unmarshal: compare group.Name to the
provided name param and if they differ log an error (include key name, payload
name, keyspace id and rawValue) and return a non-nil error instead of inserting;
this prevents inserting a cache entry under a different name than the etcd key
and avoids ghost groups—use the existing function name
upsertResourceGroupFromRaw, the local variable group and the name parameter to
locate where to add the check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b217c21-9e8e-4096-b57e-c34e514b58f1

📥 Commits

Reviewing files that changed from the base of the PR and between 5885cec and d0e1aef.

📒 Files selected for processing (5)
  • pkg/mcs/resourcemanager/server/keyspace_manager.go
  • pkg/mcs/resourcemanager/server/keyspace_manager_test.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/metadata_watcher.go
  • pkg/mcs/resourcemanager/server/metadata_watcher_test.go

Keep controller config and keyspace service limits in sync under watcher
mode, create the manager-owned cancelable context before starting the
watcher, reuse cache cleanup for legacy deletes, and reject mismatched
watcher payload names.

Verified with:
go test ./pkg/mcs/resourcemanager/server -count=1

Signed-off-by: okjiang <819421878@qq.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/mcs/resourcemanager/server/metadata_watcher.go (2)

215-242: Consider adding debug logging for skipped delete events.

When handleMetadataWatchDelete skips processing due to missing keyspace manager (lines 223-224, 237-238), there's no logging to aid debugging. While this may be intentional to avoid log spam, adding debug-level logging could help troubleshoot issues in production.

📝 Optional: Add debug logging for skipped deletes
 	case resourceGroupWatchEntryServiceLimit:
 		krgm := m.getKeyspaceResourceGroupManager(target.keyspaceID)
 		if krgm == nil {
+			log.Debug("skip deleting service limit without corresponding manager",
+				zap.Uint32("keyspace-id", target.keyspaceID))
 			return nil
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/metadata_watcher.go` around lines 215 - 242,
In handleMetadataWatchDelete, add debug-level logs when the keyspace resource
group manager is missing so skipped deletes are visible: in the first nil-check
after parseResourceGroupWatchPath (inside the
resourceGroupWatchEntryServiceLimit case where getKeyspaceResourceGroupManager
returns nil) and in the second nil-check before deleteResourceGroupFromCache,
call the logger (e.g., m.logger.Debug or processLogger.Debug) to record the key,
target.keyspaceID, target.groupName and the entryType so operators can trace
skipped events without raising log noise; keep the log at debug level and
include concise context like "skipping delete: no keyspace manager" plus the
identifiers.

164-194: Potential redundancy: loadServiceLimits() called after watcher initial load.

The watcher's WaitLoad() processes all existing keys including service limits via applyServiceLimitFromRaw. Then loadServiceLimits() is called again at line 193, which reloads the same data. While both paths use setServiceLimitFromStorage (idempotent), this is redundant work during initialization.

Consider removing line 193 if service limits are already processed by the watcher's initial load, or document why the explicit reload is necessary (e.g., ordering guarantees).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/metadata_watcher.go` around lines 164 - 194,
The initializeMetadataWatcher function redundantly calls loadServiceLimits()
after the watcher.WaitLoad() because the watcher's initial load already invokes
applyServiceLimitFromRaw which calls setServiceLimitFromStorage; remove the
explicit loadServiceLimits() call at the end of initializeMetadataWatcher (or if
you need ordering guarantees, replace it with a short comment explaining why the
explicit reload is required) so that applyServiceLimitFromRaw via the
newMetadataLoopWatcher/WaitLoad path is the sole initialization path for service
limits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/mcs/resourcemanager/server/metadata_watcher.go`:
- Around line 215-242: In handleMetadataWatchDelete, add debug-level logs when
the keyspace resource group manager is missing so skipped deletes are visible:
in the first nil-check after parseResourceGroupWatchPath (inside the
resourceGroupWatchEntryServiceLimit case where getKeyspaceResourceGroupManager
returns nil) and in the second nil-check before deleteResourceGroupFromCache,
call the logger (e.g., m.logger.Debug or processLogger.Debug) to record the key,
target.keyspaceID, target.groupName and the entryType so operators can trace
skipped events without raising log noise; keep the log at debug level and
include concise context like "skipping delete: no keyspace manager" plus the
identifiers.
- Around line 164-194: The initializeMetadataWatcher function redundantly calls
loadServiceLimits() after the watcher.WaitLoad() because the watcher's initial
load already invokes applyServiceLimitFromRaw which calls
setServiceLimitFromStorage; remove the explicit loadServiceLimits() call at the
end of initializeMetadataWatcher (or if you need ordering guarantees, replace it
with a short comment explaining why the explicit reload is required) so that
applyServiceLimitFromRaw via the newMetadataLoopWatcher/WaitLoad path is the
sole initialization path for service limits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f04ad21f-8d12-4429-85c8-fbee3131ee0c

📥 Commits

Reviewing files that changed from the base of the PR and between d0e1aef and c766971.

📒 Files selected for processing (6)
  • pkg/mcs/resourcemanager/server/keyspace_manager.go
  • pkg/mcs/resourcemanager/server/keyspace_manager_test.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • pkg/mcs/resourcemanager/server/metadata_watcher.go
  • pkg/mcs/resourcemanager/server/metadata_watcher_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/mcs/resourcemanager/server/keyspace_manager_test.go
  • pkg/mcs/resourcemanager/server/keyspace_manager.go

Reorder imports for gci and rename unused test watcher-factory
parameters to satisfy revive.

Verified with:
make static

Signed-off-by: okjiang <819421878@qq.com>
@okJiang
Copy link
Member Author

okJiang commented Mar 6, 2026

/retest

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 77.96053% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.87%. Comparing base (5885cec) to head (6909d07).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10259      +/-   ##
==========================================
+ Coverage   78.80%   78.87%   +0.07%     
==========================================
  Files         523      528       +5     
  Lines       70529    71191     +662     
==========================================
+ Hits        55580    56153     +573     
- Misses      10955    11016      +61     
- Partials     3994     4022      +28     
Flag Coverage Δ
unittests 78.87% <77.96%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: okjiang <819421878@qq.com>
okJiang added 2 commits March 9, 2026 23:44
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 259-265: Currently DeleteResourceGroupStates(krgm.keyspaceID,
name) is called before DeleteResourceGroupSetting(krgm.keyspaceID, name) which
can leave state removed while settings remain; change this to perform both
deletes atomically using a single etcd transaction (combine
DeleteResourceGroupSetting and DeleteResourceGroupStates into one txn) or, if a
txn is not feasible, invert the order to call DeleteResourceGroupSetting first
and then call DeleteResourceGroupStates as a best-effort cleanup, and keep
deleteResourceGroupFromCache(name) only after the transactional/delete-success
path to ensure cache stays consistent with storage.

In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 241-250: The full-load branch that calls
loadKeyspaceResourceGroups() can create ghost groups because
addResourceGroupFromRaw() inserts by group.Name without validating the etcd
key/payload name guard; ensure the legacy bootstrap path performs the same
validation as watcher mode by either routing the full-load path to use
upsertResourceGroupFromRaw() instead of addResourceGroupFromRaw(), or add the
same key/payload name validation into addResourceGroupFromRaw() (the functions
to touch are loadKeyspaceResourceGroups, addResourceGroupFromRaw, and
upsertResourceGroupFromRaw); update loadKeyspaceResourceGroups() to call
upsertResourceGroupFromRaw() or add the name-check logic to
addResourceGroupFromRaw() so mismatched records are rejected during legacy
bootstrap.
- Around line 314-324: applyControllerConfigFromRaw currently swaps
m.controllerConfig under m.Lock but other goroutines (backgroundMetricsFlush,
UpdateControllerConfigItem) access or mutate the same object directly causing
races; change to publish a snapshot atomically: deep-copy or construct a new
ControllerConfig, assign it to m.controllerConfig only while holding m.Lock (or
use an atomic.Value field) and ensure readers (backgroundMetricsFlush) and
mutators (UpdateControllerConfigItem) always access controller config via the
same locked accessor method (e.g.,
GetControllerConfigLocked/GetControllerConfig) or via atomic load so they never
mutate the shared instance in place; also update UpdateControllerConfigItem to
operate on a copied config and replace the published config under the lock.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06b1c729-772b-4246-85d9-9f4480870f2d

📥 Commits

Reviewing files that changed from the base of the PR and between c766971 and 4092c2e.

📒 Files selected for processing (7)
  • pkg/mcs/resourcemanager/server/keyspace_manager.go
  • pkg/mcs/resourcemanager/server/keyspace_manager_test.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • pkg/mcs/resourcemanager/server/metadata_watcher.go
  • pkg/mcs/resourcemanager/server/metadata_watcher_test.go
  • pkg/mcs/resourcemanager/server/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mcs/resourcemanager/server/keyspace_manager_test.go

@okJiang
Copy link
Member Author

okJiang commented Mar 10, 2026

Also addressed the CodeRabbit review-body notes from the March 6 review:

  • handleMetadataWatchDelete() now emits debug logs when delete events are skipped because the keyspace manager is missing.
  • initializeMetadataWatcher() no longer replays loadServiceLimits() after WaitLoad().
  • Added TestInitializeMetadataWatcherDoesNotReloadServiceLimits.

Verification:

  • make gotest GOTEST_ARGS="./pkg/mcs/resourcemanager/server -count=1"
  • ./.tools/bin/golangci-lint run --verbose ./pkg/mcs/resourcemanager/server/... --allow-parallel-runners

Address the remaining metadata watcher review comments by
reapplying existing service limits when groups enter the cache,
restoring the default group with fully initialized runtime fields,
and extending the watcher regression coverage.

The package tests and server lint checks pass with these changes.

Signed-off-by: okjiang <819421878@qq.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/server/keyspace_manager.go`:
- Around line 120-130: addResourceGroupFromRaw (and the new-group path in
upsertResourceGroupFromRaw) currently skips the validations performed by
addResourceGroup, allowing invalid groups (e.g., invalid name or Priority >
maxPriority) into krgm.groups and causing downstream panics; after
parseResourceGroupFromRaw returns the proto group, run the same validation logic
used by addResourceGroup (validate group.Name format and that group.Priority is
within allowed bounds, or reuse addResourceGroup's validation helper if
available) and return an error if validation fails before calling
FromProtoResourceGroup() and inserting into krgm.groups; alternatively, call
addResourceGroup with the validated proto to centralize checks and avoid
duplicating logic, ensuring getPriorityQueues() cannot encounter an out-of-range
priority.

In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 417-427: The code currently updates m.controllerConfig and returns
nil even when m.storage.SaveControllerConfig(controllerConfig) fails; change the
control flow so that you do not assign m.controllerConfig unless
SaveControllerConfig returns nil, and if SaveControllerConfig returns an error,
return that error up the stack (use errors.Is/As where appropriate for sentinel
checks). Concretely: within the updated branch call
m.storage.SaveControllerConfig(controllerConfig) first, if it returns an error
log it and return the error (making sure to release the lock before returning),
and only on success set m.controllerConfig and continue to log the update; keep
references to SaveControllerConfig and m.controllerConfig to locate the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b80cdb91-0be4-4a9a-9e60-ab025a9b79c7

📥 Commits

Reviewing files that changed from the base of the PR and between 4092c2e and e7a2d93.

📒 Files selected for processing (6)
  • pkg/mcs/resourcemanager/server/keyspace_manager.go
  • pkg/mcs/resourcemanager/server/keyspace_manager_test.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • pkg/mcs/resourcemanager/server/metadata_watcher.go
  • pkg/mcs/resourcemanager/server/metadata_watcher_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mcs/resourcemanager/server/manager_test.go

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 10, 2026
okJiang added 3 commits March 10, 2026 15:13
Reject invalid raw resource-group payloads before inserting them into\nthe in-memory cache and keep controller config unchanged when\npersistence fails.\n\nAdd regression tests for both paths.

Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@okJiang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-realcluster-test e5a7c7f link true /test pull-integration-realcluster-test
pull-unit-test-next-gen-3 6909d07 link true /test pull-unit-test-next-gen-3
pull-unit-test-next-gen-2 6909d07 link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant