PYTHON-5504 Prototype exponential backoff in with_transaction#2492
PYTHON-5504 Prototype exponential backoff in with_transaction#2492ShaneHarvey merged 2 commits intomongodb:backpressurefrom
Conversation
|
Can you add an async version of the benchmark? Having both APIs be tested before merging this into the backpressure branch would be ideal. |
|
Done. Added the scrip to the Jira ticket. The async version still shows a significant reduction in the number of wasted retries and lower p50 to p90 latency but less/no benefit for p99 and p100 latency. Before: After: |
Async sees significantly less improvement with the backoff, but I'd say that's expected. Asyncio's cooperative multitasking structure already prevents a given operation from retrying before the other concurrent async tasks have had a chance to run (assuming the async/await code is written correctly). |
PYTHON-5505 Prototype system overload retry loop for all operations (#2497) All commands that fail with the "Retryable" error label will be retried up to 3 times. When the error includes the "SystemOverloaded" error label we apply exponential backoff with jitter before attempting a retry. PYTHON-5506 Prototype adaptive token bucket retry (#2501) Add adaptive token bucket based retry policy. Successfully completed commands deposit 0.1 token. Failed retry attempts consume 1 token. A retry is only permitted if there is an available token. Token bucket starts full with the maximum 1000 tokens. PYTHON-5505 Use proper RetryableError and SystemOverloadedError labels
PYTHON-5504 Prototype exponential backoff in with_transaction.
Using the repro scrip in jira which runs 200 concurrent transactions in 200 threads all updating the same document shows a significant reduction in wasted retry attempts and latency (from p50 to p100). Before this change:
After (with 50ms initial backoff, 1000ms max backoff, and full jitter, backoff starting on the second retry attempt):
Backoff starting on the first retry attempt appears to work even better:
Note I'm using free-threaded mode to make this repro more similar to the behavior of other languages and other deployment types (eg many single threaded clients running on different machines).