-
Notifications
You must be signed in to change notification settings - Fork 247
DRIVERS-3391: Clarify withTransaction CSOT timeout error with backoff+jitter #1890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
47c94b1
14870f2
03f12b3
5ef3afd
87b3cf8
160d1fb
8aad52a
634a14d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -123,8 +123,8 @@ This method should perform the following sequence of actions: | |||
|
|
||||
| 2. If `transactionAttempt` > 0: | ||||
|
|
||||
| 1. If elapsed time + `backoffMS` > `TIMEOUT_MS`, then raise the previously encountered error. If the elapsed time of | ||||
| `withTransaction` is less than TIMEOUT_MS, calculate the backoffMS to be | ||||
| 1. If elapsed time + `backoffMS` > `TIMEOUT_MS`, then raise the previously encountered error (see Note 1 below). If | ||||
| the elapsed time of `withTransaction` is less than TIMEOUT_MS, calculate the backoffMS to be | ||||
| `jitter * min(BACKOFF_INITIAL * 1.5 ** (transactionAttempt - 1), BACKOFF_MAX)`. sleep for `backoffMS`. | ||||
|
|
||||
| 1. jitter is a random float between \[0, 1) | ||||
|
|
@@ -162,7 +162,8 @@ This method should perform the following sequence of actions: | |||
| committed a transaction, propagate the callback's error to the caller of `withTransaction` and return | ||||
| immediately. | ||||
|
|
||||
| 4. Otherwise, propagate the callback's error to the caller of `withTransaction` and return immediately. | ||||
| 4. Otherwise, propagate the callback's error (see Note 1 below) to the caller of `withTransaction` and return | ||||
| immediately. | ||||
|
|
||||
| 8. If the ClientSession is in the "no transaction", "transaction aborted", or "transaction committed" state, assume the | ||||
| callback intentionally aborted or committed the transaction and return immediately. | ||||
|
|
@@ -178,10 +179,21 @@ This method should perform the following sequence of actions: | |||
|
|
||||
| 2. If the `commitTransaction` error includes a "TransientTransactionError" label, jump back to step two. | ||||
|
|
||||
| 3. Otherwise, propagate the `commitTransaction` error to the caller of `withTransaction` and return immediately. | ||||
| 3. Otherwise, propagate the `commitTransaction` error (see Note 1 below) to the caller of `withTransaction` and | ||||
| return immediately. | ||||
|
|
||||
| 11. The transaction was committed successfully. Return immediately. | ||||
|
|
||||
| ______________________________________________________________________ | ||||
|
|
||||
| **Note 1:** When the `TIMEOUT_MS` (calculated in step [1.3](#sequence-of-actions)) is reached we MUST report a timeout | ||||
| error wrapping the last error that was encountered which triggered the retry behavior. If `timeoutMS` is set, then | ||||
| timeout error is a special type which is defined in CSOT | ||||
| [specification](https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md#errors) | ||||
| , If `timeoutMS` is not set, then propagate it as timeout error if the language allows to expose the underlying error as | ||||
| a cause of a timeout error (see `makeTimeoutError` below in [pseudo-code](#pseudo-code)). If timeout error is thrown | ||||
| then it SHOULD expose error label(s) from the transient error. | ||||
|
|
||||
| ##### Pseudo-code | ||||
|
|
||||
| This method can be expressed by the following pseudo-code: | ||||
|
|
@@ -203,7 +215,7 @@ withTransaction(callback, options) { | |||
| BACKOFF_MAX); | ||||
|
|
||||
| if (Date.now() + backoff - startTime >= timeout) { | ||||
| throw lastError; | ||||
| throw makeTimeoutError(lastError); | ||||
| } | ||||
| sleep(backoff); | ||||
| } | ||||
|
|
@@ -220,9 +232,12 @@ withTransaction(callback, options) { | |||
| this.abortTransaction(); | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also out of diff, but can you also update the prose description accordingly? |
||||
| } | ||||
|
|
||||
| if (error.hasErrorLabel("TransientTransactionError") && | ||||
| Date.now() - startTime < timeout) { | ||||
| continue retryTransaction; | ||||
| if (error.hasErrorLabel("TransientTransactionError")) { | ||||
| if (Date.now() - startTime < timeout) { | ||||
| continue retryTransaction; | ||||
| } else { | ||||
| throw makeTimeoutError(error) | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As i understand based on the PR and ticket descriptions the change in this PR is intended to clarify CSOT behaviour in case of a timeout. The 1) CSOT spec forbids changing the legacy path when
client-side-operations-timeout.md (lines 54–59):
2) Convenient transactions prose specifies propagating the previously encountered error without wrapping it in a timeout exception
Step 2.1:
This describes throwing the same error that triggered the retry loop (the previously encountered Given the above, i think we should:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good points 👍
The key question: Timeout is the only exit for transient errors: In the retry loop, transient errors keep getting retried until timeout is reached. The timeout becomes the actual failure reason from the user's perspective. (The transient error is a contributing factor but not the ultimate cause of failure.)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wrapping the last transient error in a timeout exception is much clearer. Otherwise the user will get an error that is not actually the cause of their The legacy workflow potentially broken by this clarification would be a user explicitly catching all of the transient exceptions for custom handling. I agree with Nabil that such a use case seems at odds with the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify: I’m not arguing against wrapping transient errors in a timeout exception in general. Conceptually, a timeout error is a cleaner terminal outcome for both CSOT and the legacy (pre-CSOT) path. The concern is scope/spec alignment. The ticket/PR is framed as "Clarify withTransaction CSOT timeout", but the pseudocode change also alters the legacy (no timeoutMS) behavior. Is the legacy behavior change intended as part of this CSOT ticket, or is it a separate effort? If it’s intentional, it would be good to make that explicit in the PR title/description (and ticket, if possible) so scope is unambiguous. If this is part of CSOT, the CSOT spec currently requires preserving legacy behavior when timeoutMS is not set:
In any case, the prose needs to be updated alongside the pseudocode, and the legacy prose tests need clarification if legacy behavior is changing. Currently they say:
As written, that appears to describe "propagate the previously encountered error" (which was the intent previously), not "throw a timeout". One more clarification: CSOT tests explicitly verify the timeout exception carries the last transient error as the cause:
Should the legacy timeout exception be required to carry the last transient error as the cause as well? The pseudocode calls |
||||
| } | ||||
| } | ||||
|
|
||||
| throw error; | ||||
|
|
@@ -247,15 +262,16 @@ withTransaction(callback, options) { | |||
| * {ok:0, code: 50, codeName: "MaxTimeMSExpired"} | ||||
| * {ok:1, writeConcernError: {code: 50, codeName: "MaxTimeMSExpired"}} | ||||
| */ | ||||
| lastError = error; | ||||
| if (Date.now() - startTime >= timeout) { | ||||
| throw makeTimeoutError(error); | ||||
| } | ||||
| if (!isMaxTimeMSExpiredError(error) && | ||||
| error.hasErrorLabel("UnknownTransactionCommitResult") && | ||||
| Date.now() - startTime < timeout) { | ||||
| error.hasErrorLabel("UnknownTransactionCommitResult")) { | ||||
| continue retryCommit; | ||||
| } | ||||
|
|
||||
| if (error.hasErrorLabel("TransientTransactionError") && | ||||
| Date.now() - startTime < timeout) { | ||||
| lastError = error; | ||||
| if (error.hasErrorLabel("TransientTransactionError")) { | ||||
| continue retryTransaction; | ||||
| } | ||||
|
|
||||
|
|
@@ -266,6 +282,10 @@ withTransaction(callback, options) { | |||
| break; // Transaction was successful | ||||
| } | ||||
| } | ||||
|
|
||||
| function makeTimeoutError(error) { | ||||
| return getCSOTTimeoutIfSet() != null ? createCSOTMongoTimeoutException(error) : createLegacyMongoTimeoutException(error); | ||||
| } | ||||
| ``` | ||||
|
|
||||
| ### ClientSession must provide access to a MongoClient | ||||
|
|
@@ -419,6 +439,9 @@ provides an implementation of a technique already described in the MongoDB 4.0 d | |||
|
|
||||
| ## Changelog | ||||
|
|
||||
| - 2026-02-17: Clarify expected error when timeout is reached | ||||
| [DRIVERS-3391](https://jira.mongodb.org/browse/DRIVERS-3391). | ||||
|
|
||||
| - 2025-11-20: withTransaction applies exponential backoff when retrying. | ||||
|
|
||||
| - 2024-09-06: Migrated from reStructuredText to Markdown. | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is out of diff: but we also need to raise to clarify the error handling behavior on line 206.