fix(discord): prevent rate limit retry storm by simonkberg · Pull Request #1586 · simonkberg/simon.dev

simonkberg · 2026-02-10T19:29:51Z

Summary

Add shared per-endpoint rate limit gate (rateLimitUntil Map) so concurrent requests wait on the same timestamp instead of independently retrying and consuming each other's rate limit budget
Cap retries at 5 (MAX_RETRIES) to prevent 30-second retry storms when multiple requests sustain the rate limit
Include retries count in log messages for observability

Test plan

All 397 existing tests pass
New test: max retry count exceeded (always 429) → throws after 5 retries
New test: shared rate limit gate prevents redundant fetches from concurrent requests
New test: retry count appears in log messages
pnpm lint passes (TypeScript, ESLint, Prettier)

🤖 Generated with Claude Code

railway-app · 2026-02-10T19:29:58Z

🚅 Deployed to the simon.dev-pr-1586 environment in simon.dev

Service	Status	Web	Updated (UTC)
simon.dev	✅ Success (View Logs)	Web	Feb 20, 2026 at 9:08 pm

claude · 2026-02-10T19:30:27Z

Claude finished @simonkberg's task in 3m 36s —— View job

Code Review

Summary

The implementation is correct and solves the retry storm problem cleanly. The shared rateLimitUntil Map is the right primitive — module-level state in Next.js server context means all concurrent call() invocations within the same process share the gate, exactly as intended.

Code Quality

api.ts

The logic flow is sound:

Gate-before-fetch — checking for an active gate before issuing the request is the right order. A request that arrives while another request's retry_after window is still open will wait out the remainder rather than firing and burning another 429.
Math.max on gate updates (line 111–112) — elegantly handles the case where two concurrent requests both receive 429s and try to update the gate. The later-expiring window wins, preventing the first request's gate from being shortened by a second one with a shorter retry_after.
Dual exit condition (retries > MAX_RETRIES || elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) — the two axes of protection are complementary: max retries caps the count regardless of timing; elapsed-plus-retry caps wall time regardless of count. Together they bound worst-case behaviour well.
Gate not updated on throw — when the function throws (max retries or timeout exceeded), the gate from the previous successful retry iteration remains active. Concurrent requests will still see it and wait. Good.
Mixed clocks (performance.now for elapsedMs, Date.now for gate timestamps) — not a bug. Both yield millisecond durations when used as differences; adding them in the timeout comparisons is arithmetically valid.

One inline note left on line 99 about retries > MAX_RETRIES allowing MAX_RETRIES+1 total calls — conventional, but worth confirming it's the intended count.

Exported test helpers

_resetRateLimitState and _setRateLimitGate are a pragmatic pattern. The @internal JSDoc and _ prefix clearly signal their purpose. The alternative (injecting a clock/state factory) would be overengineering for this scale.

Security

No issues. The endpoint string is used only as a Map key and in log fields — never executed or interpolated into any sensitive context. The Authorization header is correctly sourced from validated env vars.

Performance

The rateLimitUntil Map grows only as large as the number of distinct endpoints hit. Discord API endpoints are a finite, fixed set, so there's no unbounded growth risk.
Expired entries (gate value ≤ Date.now()) are effectively no-ops and self-clean on the next successful response for that endpoint. No manual TTL cleanup needed.

Testing

Coverage is thorough and well-structured:

Scenario	Test
Retry after 429, eventual success	✅
`retry_after` → timeout exceeded	✅
Missing `retry_after` → 1s fallback	✅
Elapsed time accumulates across retries	✅
Max retries cap (6 total calls)	✅
Shared gate — concurrent requests	✅
Pre-set gate → timeout via `_setRateLimitGate`	✅
`retries` count appears in every log message	✅

The "shared state across concurrent requests" test (line 891) is the core regression test for this fix. The comment explaining the "without shared state: 4 fetches / with shared state: 3 fetches" distinction makes the assertion immediately understandable.

One inline observation left on line 124 about the gate-deletion-on-success edge case with concurrent sleepers — it's harmless, documented for future reference.

Overall

Solid fix. No blocking issues. The two inline comments are observations/minor notes rather than requests for changes.

codecov · 2026-02-10T19:40:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.81%. Comparing base (6a712a8) to head (c746445).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1586      +/-   ##
==========================================
+ Coverage   97.76%   97.81%   +0.04%     
==========================================
  Files          49       49              
  Lines         896      916      +20     
  Branches      156      161       +5     
==========================================
+ Hits          876      896      +20     
  Misses          8        8              
  Partials       12       12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

simonkberg · 2026-02-10T19:41:19Z

@claude Evaluated both suggestions:

1. Clear gate on non-429 errors — Not implementing. The gate represents Discord's stated rate limit window. A subsequent 500 doesn't mean the rate limit has ended — it's a separate error. Clearing the gate prematurely would let other requests hit the rate limit again. The gate expires naturally when waitUntil - Date.now() <= 0, so it never "sticks" beyond the original retry_after duration.

2. Extended inline comment — The existing comment (// Per-endpoint rate limit tracking: endpoint → timestamp until which requests should wait) is sufficient. The code is straightforward.

claude · 2026-02-10T19:41:43Z

Claude finished @simonkberg's task in 19s —— View job

Response to Feedback

Thank you for the thoughtful evaluation. Your reasoning on both points makes sense:

1. Gate cleanup on non-429 errors: You're correct that the gate represents Discord's declared rate limit window, and a 500 error doesn't invalidate that window. The natural expiration via waitUntil - Date.now() <= 0 handles this properly without premature clearing.

2. Inline comment: The existing comment adequately documents the purpose. The implementation is clear enough without additional explanation.

Both decisions are sound. The PR is well-designed and ready to merge as-is. ✅

Concurrent requests to the same Discord endpoint independently retried on 429 responses, each consuming the rate limit budget and preventing others from succeeding. This created a self-sustaining retry loop lasting ~30 seconds per cycle. Add a shared per-endpoint rate limit gate so all concurrent callers wait on the same timestamp, and cap retries at 5 to prevent prolonged retry storms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add _setRateLimitGate test helper to directly set a rate limit gate, since the gate timeout path is unreachable through natural 429 responses alone (the gate is bounded by RATE_LIMIT_TIMEOUT_MS). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-02-20T21:10:18Z

app/lib/discord/api.ts


-      if (elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) {
+      if (
+        retries > MAX_RETRIES ||


Minor naming note: MAX_RETRIES = 5 with retries > MAX_RETRIES allows 6 total API calls (1 initial + 5 retries). Using >= instead would give you exactly MAX_RETRIES calls, or renaming the constant to MAX_ATTEMPTS = 6 would make the intent self-documenting. The current behaviour is conventional for retry logic, and the test comment at line 882 confirms it's intentional — just worth a quick sanity check that 6 total calls is what you want.

claude · 2026-02-20T21:10:28Z

app/lib/discord/api.ts

    }

+    // Clear rate limit gate on success
+    rateLimitUntil.delete(endpoint);


Deleting the gate on any successful response is correct for the common case, but there's a subtle race: if Request A and Request B both slept on the gate, A succeeds first and deletes the gate, then B (which has already computed and is sleeping through its waitMs timer) wakes and succeeds too — that's fine because B's timer was already scheduled and the deletion had no effect on it.

The only interesting case is a hypothetical Request C that arrives after A's deletion but before B's timer fires: C sees no gate and fires immediately. Since A just got a 200, the rate limit window has passed, so C getting a 200 is correct behaviour. No issue, just documenting the reasoning in case this edge case comes up in a future review.

railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:29 Destroyed

railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:50 Destroyed

simonkberg and others added 2 commits February 20, 2026 22:06

simonkberg force-pushed the fix/discord-rate-limit-retry-storm branch from 4ffb6d1 to c746445 Compare February 20, 2026 21:06

railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 20, 2026 21:06 Destroyed

claude bot reviewed Feb 20, 2026

View reviewed changes

simonkberg merged commit e6712f5 into main Feb 20, 2026
10 checks passed

simonkberg deleted the fix/discord-rate-limit-retry-storm branch February 20, 2026 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(discord): prevent rate limit retry storm#1586

fix(discord): prevent rate limit retry storm#1586
simonkberg merged 2 commits intomainfrom
fix/discord-rate-limit-retry-storm

simonkberg commented Feb 10, 2026

Uh oh!

railway-app bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

simonkberg commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

claude bot Feb 20, 2026

Uh oh!

claude bot Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

simonkberg commented Feb 10, 2026

Summary

Test plan

Uh oh!

railway-app bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Code Quality

Security

Performance

Testing

Overall

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

simonkberg commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Response to Feedback

Uh oh!

claude bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

railway-app bot commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 10, 2026 •

edited

Loading