Skip to content

Comments

fix(discord): prevent rate limit retry storm#1586

Merged
simonkberg merged 2 commits intomainfrom
fix/discord-rate-limit-retry-storm
Feb 20, 2026
Merged

fix(discord): prevent rate limit retry storm#1586
simonkberg merged 2 commits intomainfrom
fix/discord-rate-limit-retry-storm

Conversation

@simonkberg
Copy link
Owner

Summary

  • Add shared per-endpoint rate limit gate (rateLimitUntil Map) so concurrent requests wait on the same timestamp instead of independently retrying and consuming each other's rate limit budget
  • Cap retries at 5 (MAX_RETRIES) to prevent 30-second retry storms when multiple requests sustain the rate limit
  • Include retries count in log messages for observability

Test plan

  • All 397 existing tests pass
  • New test: max retry count exceeded (always 429) → throws after 5 retries
  • New test: shared rate limit gate prevents redundant fetches from concurrent requests
  • New test: retry count appears in log messages
  • pnpm lint passes (TypeScript, ESLint, Prettier)

🤖 Generated with Claude Code

@railway-app railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:29 Destroyed
@railway-app
Copy link

railway-app bot commented Feb 10, 2026

🚅 Deployed to the simon.dev-pr-1586 environment in simon.dev

Service Status Web Updated (UTC)
simon.dev ✅ Success (View Logs) Web Feb 20, 2026 at 9:08 pm

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @simonkberg's task in 3m 36s —— View job


Code Review

  • Gather context and read changed files
  • Review api.ts changes
  • Review api.test.ts changes
  • Post inline comments
  • Post comprehensive review feedback

Summary

The implementation is correct and solves the retry storm problem cleanly. The shared rateLimitUntil Map is the right primitive — module-level state in Next.js server context means all concurrent call() invocations within the same process share the gate, exactly as intended.


Code Quality

api.ts

The logic flow is sound:

  1. Gate-before-fetch — checking for an active gate before issuing the request is the right order. A request that arrives while another request's retry_after window is still open will wait out the remainder rather than firing and burning another 429.
  2. Math.max on gate updates (line 111–112) — elegantly handles the case where two concurrent requests both receive 429s and try to update the gate. The later-expiring window wins, preventing the first request's gate from being shortened by a second one with a shorter retry_after.
  3. Dual exit condition (retries > MAX_RETRIES || elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) — the two axes of protection are complementary: max retries caps the count regardless of timing; elapsed-plus-retry caps wall time regardless of count. Together they bound worst-case behaviour well.
  4. Gate not updated on throw — when the function throws (max retries or timeout exceeded), the gate from the previous successful retry iteration remains active. Concurrent requests will still see it and wait. Good.
  5. Mixed clocks (performance.now for elapsedMs, Date.now for gate timestamps) — not a bug. Both yield millisecond durations when used as differences; adding them in the timeout comparisons is arithmetically valid.

One inline note left on line 99 about retries > MAX_RETRIES allowing MAX_RETRIES+1 total calls — conventional, but worth confirming it's the intended count.

Exported test helpers

_resetRateLimitState and _setRateLimitGate are a pragmatic pattern. The @internal JSDoc and _ prefix clearly signal their purpose. The alternative (injecting a clock/state factory) would be overengineering for this scale.


Security

No issues. The endpoint string is used only as a Map key and in log fields — never executed or interpolated into any sensitive context. The Authorization header is correctly sourced from validated env vars.


Performance

  • The rateLimitUntil Map grows only as large as the number of distinct endpoints hit. Discord API endpoints are a finite, fixed set, so there's no unbounded growth risk.
  • Expired entries (gate value ≤ Date.now()) are effectively no-ops and self-clean on the next successful response for that endpoint. No manual TTL cleanup needed.

Testing

Coverage is thorough and well-structured:

Scenario Test
Retry after 429, eventual success
retry_after → timeout exceeded
Missing retry_after → 1s fallback
Elapsed time accumulates across retries
Max retries cap (6 total calls)
Shared gate — concurrent requests
Pre-set gate → timeout via _setRateLimitGate
retries count appears in every log message

The "shared state across concurrent requests" test (line 891) is the core regression test for this fix. The comment explaining the "without shared state: 4 fetches / with shared state: 3 fetches" distinction makes the assertion immediately understandable.

One inline observation left on line 124 about the gate-deletion-on-success edge case with concurrent sleepers — it's harmless, documented for future reference.


Overall

Solid fix. No blocking issues. The two inline comments are observations/minor notes rather than requests for changes.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.81%. Comparing base (6a712a8) to head (c746445).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1586      +/-   ##
==========================================
+ Coverage   97.76%   97.81%   +0.04%     
==========================================
  Files          49       49              
  Lines         896      916      +20     
  Branches      156      161       +5     
==========================================
+ Hits          876      896      +20     
  Misses          8        8              
  Partials       12       12              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@simonkberg
Copy link
Owner Author

@claude Evaluated both suggestions:

1. Clear gate on non-429 errors — Not implementing. The gate represents Discord's stated rate limit window. A subsequent 500 doesn't mean the rate limit has ended — it's a separate error. Clearing the gate prematurely would let other requests hit the rate limit again. The gate expires naturally when waitUntil - Date.now() <= 0, so it never "sticks" beyond the original retry_after duration.

2. Extended inline comment — The existing comment (// Per-endpoint rate limit tracking: endpoint → timestamp until which requests should wait) is sufficient. The code is straightforward.

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @simonkberg's task in 19s —— View job


Response to Feedback

Thank you for the thoughtful evaluation. Your reasoning on both points makes sense:

1. Gate cleanup on non-429 errors: You're correct that the gate represents Discord's declared rate limit window, and a 500 error doesn't invalidate that window. The natural expiration via waitUntil - Date.now() <= 0 handles this properly without premature clearing.

2. Inline comment: The existing comment adequately documents the purpose. The implementation is clear enough without additional explanation.

Both decisions are sound. The PR is well-designed and ready to merge as-is. ✅


@railway-app railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:50 Destroyed
simonkberg and others added 2 commits February 20, 2026 22:06
Concurrent requests to the same Discord endpoint independently retried
on 429 responses, each consuming the rate limit budget and preventing
others from succeeding. This created a self-sustaining retry loop
lasting ~30 seconds per cycle.

Add a shared per-endpoint rate limit gate so all concurrent callers
wait on the same timestamp, and cap retries at 5 to prevent prolonged
retry storms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _setRateLimitGate test helper to directly set a rate limit gate,
since the gate timeout path is unreachable through natural 429 responses
alone (the gate is bounded by RATE_LIMIT_TIMEOUT_MS).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonkberg simonkberg force-pushed the fix/discord-rate-limit-retry-storm branch from 4ffb6d1 to c746445 Compare February 20, 2026 21:06
@railway-app railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 20, 2026 21:06 Destroyed

if (elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) {
if (
retries > MAX_RETRIES ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming note: MAX_RETRIES = 5 with retries > MAX_RETRIES allows 6 total API calls (1 initial + 5 retries). Using >= instead would give you exactly MAX_RETRIES calls, or renaming the constant to MAX_ATTEMPTS = 6 would make the intent self-documenting. The current behaviour is conventional for retry logic, and the test comment at line 882 confirms it's intentional — just worth a quick sanity check that 6 total calls is what you want.

}

// Clear rate limit gate on success
rateLimitUntil.delete(endpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the gate on any successful response is correct for the common case, but there's a subtle race: if Request A and Request B both slept on the gate, A succeeds first and deletes the gate, then B (which has already computed and is sleeping through its waitMs timer) wakes and succeeds too — that's fine because B's timer was already scheduled and the deletion had no effect on it.

The only interesting case is a hypothetical Request C that arrives after A's deletion but before B's timer fires: C sees no gate and fires immediately. Since A just got a 200, the rate limit window has passed, so C getting a 200 is correct behaviour. No issue, just documenting the reasoning in case this edge case comes up in a future review.

@simonkberg simonkberg merged commit e6712f5 into main Feb 20, 2026
10 checks passed
@simonkberg simonkberg deleted the fix/discord-rate-limit-retry-storm branch February 20, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant