PYTHON-5504 Prototype exponential backoff in with_transaction by ShaneHarvey · Pull Request #2492 · mongodb/mongo-python-driver

ShaneHarvey · 2025-08-19T15:43:51Z

PYTHON-5504 Prototype exponential backoff in with_transaction.

Using the repro scrip in jira which runs 200 concurrent transactions in 200 threads all updating the same document shows a significant reduction in wasted retry attempts and latency (from p50 to p100). Before this change:

$ python3.13t repro-with_transaction-write-conflict-storm.py
Completed 200 transactions in 200 threads in 4.8626720905303955 seconds
Total retry attempts: 8132
avg latency: 3.04s p50: 3.36s p90: 4.59s p99: 4.83s p100: 4.84s

After (with 50ms initial backoff, 1000ms max backoff, and full jitter, backoff starting on the second retry attempt):

$ python3.13t repro-with_transaction-write-conflict-storm.py
Completed 200 transactions in 200 threads in 4.251200914382935 seconds
Total retry attempts: 1089
avg latency: 1.53s p50: 1.45s p90: 2.77s p99: 3.69s p100: 4.24s

Backoff starting on the first retry attempt appears to work even better:

$ python3.13t repro-with_transaction-write-conflict-storm.py
Completed 200 transactions in 200 threads in 3.272695779800415 seconds
Total retry attempts: 886
avg latency: 1.33s p50: 1.21s p90: 2.50s p99: 3.22s p100: 3.25s

Note I'm using free-threaded mode to make this repro more similar to the behavior of other languages and other deployment types (eg many single threaded clients running on different machines).

NoahStapp · 2025-08-19T18:16:02Z

Can you add an async version of the benchmark? Having both APIs be tested before merging this into the backpressure branch would be ideal.

ShaneHarvey · 2025-08-19T18:30:28Z

Done. Added the scrip to the Jira ticket. The async version still shows a significant reduction in the number of wasted retries and lower p50 to p90 latency but less/no benefit for p99 and p100 latency.

Before:

$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.467694044113159 seconds
Total retry attempts: 5950
avg latency: 1.87s p50: 2.04s p90: 3.22s p99: 3.45s p100: 3.46s

After:

$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.5634748935699463 seconds
Total retry attempts: 887
avg latency: 1.48s p50: 1.41s p90: 2.81s p99: 3.49s p100: 3.56s

NoahStapp · 2025-08-19T18:40:18Z

Done. Added the scrip to the Jira ticket. The async version still shows a significant reduction in the number of wasted retries and lower p50 to p90 latency but less/no benefit for p99 and p100 latency.

Before:
$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.467694044113159 seconds
Total retry attempts: 5950
avg latency: 1.87s p50: 2.04s p90: 3.22s p99: 3.45s p100: 3.46s
After:
$ python3.13 repro-storm-async.py
Completed 200 concurrent async transactions in 3.5634748935699463 seconds
Total retry attempts: 887
avg latency: 1.48s p50: 1.41s p90: 2.81s p99: 3.49s p100: 3.56s

Async sees significantly less improvement with the backoff, but I'd say that's expected. Asyncio's cooperative multitasking structure already prevents a given operation from retrying before the other concurrent async tasks have had a chance to run (assuming the async/await code is written correctly).

PYTHON-5505 Prototype system overload retry loop for all operations (#2497) All commands that fail with the "Retryable" error label will be retried up to 3 times. When the error includes the "SystemOverloaded" error label we apply exponential backoff with jitter before attempting a retry. PYTHON-5506 Prototype adaptive token bucket retry (#2501) Add adaptive token bucket based retry policy. Successfully completed commands deposit 0.1 token. Failed retry attempts consume 1 token. A retry is only permitted if there is an available token. Token bucket starts full with the maximum 1000 tokens. PYTHON-5505 Use proper RetryableError and SystemOverloadedError labels

DRIVERS-1934 POC exponential backoff in withTransaction

cec7bc2

ShaneHarvey changed the base branch from master to backpressure August 19, 2025 15:44

PYTHON-5504 Start backoff on first retry not second

86172fc

ShaneHarvey marked this pull request as ready for review August 19, 2025 17:17

ShaneHarvey requested a review from a team as a code owner August 19, 2025 17:17

ShaneHarvey requested a review from NoahStapp August 19, 2025 17:17

ShaneHarvey changed the title ~~DRIVERS-1934 POC exponential backoff in withTransaction~~ PYTHON-5504 Prototype exponential backoff in with_transaction Aug 19, 2025

NoahStapp approved these changes Aug 19, 2025

View reviewed changes

ShaneHarvey merged commit cf7a1aa into mongodb:backpressure Aug 19, 2025
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PYTHON-5504 Prototype exponential backoff in with_transaction#2492

PYTHON-5504 Prototype exponential backoff in with_transaction#2492
ShaneHarvey merged 2 commits intomongodb:backpressurefrom
ShaneHarvey:PYTHON-5504

ShaneHarvey commented Aug 19, 2025 •

edited

Loading

Uh oh!

NoahStapp commented Aug 19, 2025

Uh oh!

ShaneHarvey commented Aug 19, 2025 •

edited

Loading

Uh oh!

NoahStapp commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ShaneHarvey commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NoahStapp commented Aug 19, 2025

Uh oh!

ShaneHarvey commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NoahStapp commented Aug 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

ShaneHarvey commented Aug 19, 2025 •

edited

Loading

ShaneHarvey commented Aug 19, 2025 •

edited

Loading