Skip to content

resource_group: set FillRate in non-trickle token bucket responses#10322

Open
JmPotato wants to merge 1 commit intotikv:masterfrom
JmPotato:fix-10251
Open

resource_group: set FillRate in non-trickle token bucket responses#10322
JmPotato wants to merge 1 commit intotikv:masterfrom
JmPotato:fix-10251

Conversation

@JmPotato
Copy link
Member

@JmPotato JmPotato commented Mar 10, 2026

What problem does this PR solve?

Issue Number: Close #10251

When the resource group FillRate is much larger than actual consumption (e.g., FillRate=1,000,000 vs consumption≈200 RU/s), requests experience ~1s latency spikes.

Root cause: The server's request() and assignSlotTokens() never set FillRate in non-trickle responses. The client's modifyTokenCounter uses cfg.newFillRate = float64(bucket.GetSettings().FillRate), which gets 0. After the granted tokens are depleted, durationFromTokens() returns InfDuration, causing all subsequent OnRequestWait calls to fail with ErrClientResourceGroupThrottled or context.DeadlineExceeded.

What is changed and how does it work?

Set FillRate on all three non-trickle return paths in the server so the client limiter can refill tokens locally:

  1. No-slot path in request() — when the slot doesn't exist (initial periodic report with requiredToken=0), return FillRate = gtb.getFillRate() (group-level fill rate).
  2. requiredToken<=0 path in assignSlotTokens() — when the slot exists but no tokens are requested, return FillRate = ts.fillRate (per-slot fill rate).
  3. Full-satisfy path in assignSlotTokens() — when the slot has enough capacity to fully satisfy the request, return FillRate = ts.fillRate (per-slot fill rate).

Trickle paths are intentionally left unchanged to avoid double-counting with the client's own trickle-based fill rate calculation (cfg.newFillRate = FillRate + granted/trickleTime).

Set FillRate on all three non-trickle return paths in the server:
1. No-slot path in request() — initial periodic report with no slot yet
2. requiredToken<=0 path in assignSlotTokens() — zero-demand requests
3. Full-satisfy path in assignSlotTokens() — capacity covers demand

Trickle paths are intentionally left unchanged to avoid double-counting
with the client's own trickle-based fill rate calculation.

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has no configuration change
  • Has no HTTP API change
  • Has no persistent data change

Side effects

  • No performance regression
  • No increased code complexity
  • No breaking backward compatibility

Release note

Fix the issue that the resource group token bucket server does not set FillRate in non-trickle responses, which causes the client limiter to have fillRate=0 and results in ~1s latency spikes when the resource group FillRate is much larger than actual consumption.

Summary by CodeRabbit

Bug Fixes

  • Improved token bucket rate limiting to consistently propagate fill rates across all token grant scenarios, preventing double-counting issues and ensuring accurate token refilling behavior.

When the server grants tokens in non-trickle mode (capacity sufficient),
the response previously carried FillRate=0. This caused the client
limiter to have fillRate=0, making durationFromTokens() return
InfDuration after the granted tokens were depleted, resulting in ~1s
latency spikes.

Set FillRate on all three non-trickle return paths in the server:
1. No-slot path in request() — initial periodic report with no slot yet
2. requiredToken<=0 path in assignSlotTokens() — zero-demand requests
3. Full-satisfy path in assignSlotTokens() — capacity covers demand

Trickle paths are intentionally left unchanged to avoid double-counting
with the client's own trickle-based fill rate calculation.

Close tikv#10251

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 10, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andremouche for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 10, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 059f52cd-eb20-4479-a16d-306f36442df8

📥 Commits

Reviewing files that changed from the base of the PR and between 58252c5 and c80979d.

📒 Files selected for processing (3)
  • pkg/mcs/resourcemanager/server/token_buckets.go
  • pkg/mcs/resourcemanager/server/token_buckets_test.go
  • tests/integrations/mcs/resourcemanager/resource_manager_test.go

📝 Walkthrough

Walkthrough

The pull request ensures FillRate is consistently propagated in token bucket settings during slot creation and token assignment across multiple code paths. When a new slot is created or tokens are assigned, the resulting TokenBucket.Settings now includes the appropriate FillRate value, with special handling for trickle versus non-trickle response modes.

Changes

Cohort / File(s) Summary
Token Bucket FillRate Propagation
pkg/mcs/resourcemanager/server/token_buckets.go
Added FillRate field assignment to TokenBucket.Settings in three scenarios: when creating new slots in GroupTokenBucket.request(), and when assigning tokens in tokenSlot.assignSlotTokens() for both zero-requirement and direct-grant cases.
FillRate Propagation Tests
pkg/mcs/resourcemanager/server/token_buckets_test.go
Added assertions to verify FillRate propagation behavior, ensuring non-trickle responses include the slot's FillRate and trickle responses zero it out to prevent double-counting.
Integration Test Coverage
tests/integrations/mcs/resourcemanager/resource_manager_test.go
Added TestNonTrickleFillRate integration test with two scenarios testing FillRate propagation in non-trickle token grant paths for resource groups.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

size/L, release-note-none, lgtm, approved

Suggested reviewers

  • rleungx
  • disksing

Poem

🐰 A fill rate lost in the token stream,
Now flowing through each slot with gleam,
No more the empty bucket's cry,
When trickles pause and rates run dry! 🪣✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: setting FillRate in non-trickle token bucket responses for resource groups.
Description check ✅ Passed The PR description thoroughly addresses the template requirements: clearly states the linked issue, explains the problem and solution with detailed technical context, includes commit message, and documents test coverage, code changes, and release notes.
Linked Issues check ✅ Passed The PR successfully addresses issue #10251 by setting FillRate on all three non-trickle response paths (no-slot path, requiredToken<=0 path, and full-satisfy path) to enable client-side local token refilling and prevent the zero-fillrate starved state.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the issue: FillRate propagation in non-trickle paths in token_buckets.go, corresponding test additions in token_buckets_test.go and resource_manager_test.go, with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@JmPotato: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-1 c80979d link true /test pull-unit-test-next-gen-1
pull-unit-test-next-gen-2 c80979d link true /test pull-unit-test-next-gen-2
pull-unit-test-next-gen-3 c80979d link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resource Group: 1s latency spike when FillRate far exceeds actual RU consumption

1 participant