Skip to content

feat(NODE-7142): Exponential backoff and jitter in retry loops#4871

Open
baileympearson wants to merge 11 commits intomainfrom
NODE-7142-backpressure-backoff
Open

feat(NODE-7142): Exponential backoff and jitter in retry loops#4871
baileympearson wants to merge 11 commits intomainfrom
NODE-7142-backpressure-backoff

Conversation

@baileympearson
Copy link
Contributor

@baileympearson baileympearson commented Feb 10, 2026

Description

Summary of Changes

Implements the Client Backpressure specification, which adds automatic retry with exponential backoff for commands that fail with SystemOverloadedError + RetryableError labels.

Note:
this PR also implements NODE-7366: clarified error propagation when multiple NoWritesPerformed errors are encountered during retries. This is a prerequisite for backpressure since the multi-retry loop can now encounter more than two errors.

Notes for Reviewers

The retry loop was restructured to match the spec pseudocode. The canRetry() function centralizes all retry eligibility logic:

  • SystemOverloadedError + RetryableError → always retryable (including runCommand)
  • runCommand without overload labels → never retryable
  • Standard retryable read/write errors → existing behavior preserved

The maxAttempts/attemptsMade mechanism on AbstractOperation exists specifically for commitTransaction, which splits its retry budget across two executeOperation calls (first with original write concern, second with majority).

What is the motivation for this change?

NODE-7142
NODE-7366

Release Highlight

Client Backpressure

The driver now implements the Client Backpressure specification. When a MongoDB server responds with a SystemOverloadedError, the driver will automatically retry the operation with exponential backoff and jitter, up to 5 retries. A per-client token bucket rate-limits retry attempts during prolonged overload to protect server goodput.

Double check the following

  • Lint is passing (npm run check:lint)
  • Self-review completed using the steps outlined here
  • PR title follows the correct format: type(NODE-xxxx)[!]: description
    • Example: feat(NODE-1234)!: rewriting everything in coffeescript
  • Changes are covered by tests
  • New TODOs have a related JIRA ticket

@tadjik1 tadjik1 changed the title WIP - exponential backoff + jitter PR feat(NODE-7142): Exponential backoff and jitter in retry loops Feb 19, 2026
@tadjik1 tadjik1 force-pushed the NODE-7142-backpressure-backoff branch from bcf6a51 to 091e989 Compare February 19, 2026 10:00
Comment on lines +1253 to +1264
if (session.transaction.state === TxnState.STARTING_TRANSACTION) {
if (document.ok === 1) {
session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS);
} else {
const error = new MongoServerError(document.toObject());
const isRetryableError = error.hasErrorLabel(MongoErrorLabel.RetryableError);

if (!isRetryableError) {
session.transaction.transition(TxnState.TRANSACTION_IN_PROGRESS);
}
}
}
Copy link
Member

@tadjik1 tadjik1 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to defer transition of the transition into TRANSACTION_IN_PROGRESS:

  • success (ok === 1) means that the server accepted the command, transaction is in progress.
  • non-retryable error - the server processed something, the transaction state should advance so subsequent commands don't re-send startTransaction.
  • and NOT retryable errors - the session stays in STARTING_TRANSACTION so the retry can include startTransaction: true.

Previously, applySession transitioned STARTING_TRANSACTION -> TRANSACTION_IN_PROGRESS before the command was sent to the server (line 1205 removed).

}
} catch (error) {
if (options.session != null && !(error instanceof MongoServerError)) {
updateSessionFromResponse(options.session, MongoDBResponse.empty);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will consequently transition the transaction into TRANSACTION_IN_PROGRESS state (as .empty has no error and no success labels).

Comment on lines +598 to +600
if (!options.timeoutContext || options.timeoutContext.clearServerSelectionTimeout) {
timeout?.clear();
}
Copy link
Member

@tadjik1 tadjik1 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a pre-existing bug that exists on main but surfaced by backpressure tests. When selectServer is called without a timeoutContext, it creates a local Timeout.expires(30000). So, on sharded clusters with pinned sessions local timeout was never cleared. It fired 30 seconds later as an unhandled rejection (evergreen was red only on sharded topologies, but all tests passed on all others).

Existing CSOT (clearServerSelectionTimeout = false) skips cleanup because it manages its' own timeout.

)
);

function canRetry(operation: AbstractOperation, error: MongoError) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function centralizes retry eligibility, as before it was split across the entire old loop. Key idea: SystemOverloadedError + RetryableError is always retryable regardless of retryReads/retryWrites settings or operation type (including runCommand).

readPreference: ReadPreference.primary,
bypassPinningCheck: true
});
operation.maxAttempts = MAX_RETRIES + 1;
Copy link
Member

@tadjik1 tadjik1 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that commitTransaction is split across 2 executeOperations: first with original write concern, second (on retryable error) with majority write concern.

The total retry budget is MAX_RETRIES + 1 (6 attempts). After the first executeOperation "spends" some attempts, remainingAttempts is computed from attemptsMade and passed as maxAttempts to the second operation.

@tadjik1 tadjik1 marked this pull request as ready for review February 19, 2026 20:08
@tadjik1 tadjik1 requested a review from a team as a code owner February 19, 2026 20:08
@tadjik1 tadjik1 requested a review from nbbeeken February 19, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments