fix(discord): prevent rate limit retry storm#1586
Conversation
|
🚅 Deployed to the simon.dev-pr-1586 environment in simon.dev
|
|
Claude finished @simonkberg's task in 3m 36s —— View job Code Review
SummaryThe implementation is correct and solves the retry storm problem cleanly. The shared Code Quality
The logic flow is sound:
One inline note left on line 99 about Exported test helpers
SecurityNo issues. The Performance
TestingCoverage is thorough and well-structured:
The "shared state across concurrent requests" test (line 891) is the core regression test for this fix. The comment explaining the "without shared state: 4 fetches / with shared state: 3 fetches" distinction makes the assertion immediately understandable. One inline observation left on line 124 about the gate-deletion-on-success edge case with concurrent sleepers — it's harmless, documented for future reference. OverallSolid fix. No blocking issues. The two inline comments are observations/minor notes rather than requests for changes. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1586 +/- ##
==========================================
+ Coverage 97.76% 97.81% +0.04%
==========================================
Files 49 49
Lines 896 916 +20
Branches 156 161 +5
==========================================
+ Hits 876 896 +20
Misses 8 8
Partials 12 12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@claude Evaluated both suggestions: 1. Clear gate on non-429 errors — Not implementing. The gate represents Discord's stated rate limit window. A subsequent 500 doesn't mean the rate limit has ended — it's a separate error. Clearing the gate prematurely would let other requests hit the rate limit again. The gate expires naturally when 2. Extended inline comment — The existing comment ( |
|
Claude finished @simonkberg's task in 19s —— View job Response to FeedbackThank you for the thoughtful evaluation. Your reasoning on both points makes sense: 1. Gate cleanup on non-429 errors: You're correct that the gate represents Discord's declared rate limit window, and a 500 error doesn't invalidate that window. The natural expiration via 2. Inline comment: The existing comment adequately documents the purpose. The implementation is clear enough without additional explanation. Both decisions are sound. The PR is well-designed and ready to merge as-is. ✅ |
Concurrent requests to the same Discord endpoint independently retried on 429 responses, each consuming the rate limit budget and preventing others from succeeding. This created a self-sustaining retry loop lasting ~30 seconds per cycle. Add a shared per-endpoint rate limit gate so all concurrent callers wait on the same timestamp, and cap retries at 5 to prevent prolonged retry storms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _setRateLimitGate test helper to directly set a rate limit gate, since the gate timeout path is unreachable through natural 429 responses alone (the gate is bounded by RATE_LIMIT_TIMEOUT_MS). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4ffb6d1 to
c746445
Compare
|
|
||
| if (elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) { | ||
| if ( | ||
| retries > MAX_RETRIES || |
There was a problem hiding this comment.
Minor naming note: MAX_RETRIES = 5 with retries > MAX_RETRIES allows 6 total API calls (1 initial + 5 retries). Using >= instead would give you exactly MAX_RETRIES calls, or renaming the constant to MAX_ATTEMPTS = 6 would make the intent self-documenting. The current behaviour is conventional for retry logic, and the test comment at line 882 confirms it's intentional — just worth a quick sanity check that 6 total calls is what you want.
| } | ||
|
|
||
| // Clear rate limit gate on success | ||
| rateLimitUntil.delete(endpoint); |
There was a problem hiding this comment.
Deleting the gate on any successful response is correct for the common case, but there's a subtle race: if Request A and Request B both slept on the gate, A succeeds first and deletes the gate, then B (which has already computed and is sleeping through its waitMs timer) wakes and succeeds too — that's fine because B's timer was already scheduled and the deletion had no effect on it.
The only interesting case is a hypothetical Request C that arrives after A's deletion but before B's timer fires: C sees no gate and fires immediately. Since A just got a 200, the rate limit window has passed, so C getting a 200 is correct behaviour. No issue, just documenting the reasoning in case this edge case comes up in a future review.
Summary
rateLimitUntilMap) so concurrent requests wait on the same timestamp instead of independently retrying and consuming each other's rate limit budgetMAX_RETRIES) to prevent 30-second retry storms when multiple requests sustain the rate limitretriescount in log messages for observabilityTest plan
pnpm lintpasses (TypeScript, ESLint, Prettier)🤖 Generated with Claude Code