[GATEWAY V2]: Bifurcate connect / connection-acquire timeout between Gateway V1 and Gateway V2 endpoints.#48174
[GATEWAY V2]: Bifurcate connect / connection-acquire timeout between Gateway V1 and Gateway V2 endpoints.#48174jeet1995 wants to merge 120 commits intoAzure:mainfrom
Conversation
* fix few tests part 2 --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local>
…ning effort configuration (Azure#47772) Co-authored-by: Xiting Zhang <xitzhang@microsoft.com>
* [VoiceLive]Release 1.0.0-beta.4 Updated release date for version 1.0.0-beta.4 and added feature details. * Revise CHANGELOG for clarity and bug fixes Updated changelog to remove breaking changes section and added details about bug fixes.
…Java-5433741 (Azure#46952) * Configurations: 'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'aae85aa3e7e4fda95ea2d3abac0ba1d8159db214' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5433741 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release. * Configurations: 'specification/nginx/Nginx.Management/tspconfig.yaml', API Version: 2025-03-01-preview, SDK Release Type: beta, and CommitSHA: 'de8103ff8e94ea51c56bb22094ded5d2dfc45a6a' in SpecRepo: 'https://github.com/Azure/azure-rest-api-specs' Pipeline run: https://dev.azure.com/azure-sdk/internal/_build/results?buildId=5857234 Refer to https://eng.ms/docs/products/azure-developer-experience/develop/sdk-release/sdk-release-prerequisites to prepare for SDK release. --------- Co-authored-by: Weidong Xu <weidxu@microsoft.com>
false can't be assigned to int in java. Updating type to boolean
* Deprecating azure-resourcemanager-mixedreality * Typos * use 1.0.1 as version * Update CHANGELOG.md --------- Co-authored-by: Michael Zappe <michaelzappe@microsoft.com> Co-authored-by: Weidong Xu <weidxu@microsoft.com>
* fix few tests part 3 --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Initial regeneration using TypeSpec * Working on migrating tests, adding back convenience APIs that are being kept * Complete most of the migration * Additional work * Stable point before tests * Newer TypeSpec SHA * Add back SearchAudience support * Last changes before testing * Rerecord tests and misc fixes along the way * Fix a few recordings and stress tests * Fix a few recordings and linting * Few more fixes * Another round of recording * Rerun TypeSpec codegen * Remove errant import * Cleanup APIs * Regeneration * Clean up linting
* escape non-ascii character for pkValue --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local> Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
…k connector 4.43.0 (Azure#47968) * Release azure-cosmos 4.78.0, azure-cosmos-encryption 2.27.0, and Spark connector 4.43.0 --------- Co-authored-by: Annie Liang <anniemac@Annies-MacBook-Pro.local> Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
… Gateway V2 endpoints.
… Gateway V2 endpoints.
…into AzCosmos_H2ConnectAcquireTimeout # Conflicts: # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
… Gateway V2 endpoints.
There was a problem hiding this comment.
Pull request overview
This PR improves Azure Cosmos DB Gateway mode behavior when “thin client” (Gateway V2, port 10250) is enabled by applying a shorter per-request TCP connect timeout for data-plane requests, while keeping the existing longer timeout for Gateway V1 metadata requests (port 443). It also expands diagnostics to surface HTTP/2 channel identity and request timeout details, and adds targeted tests (including manual network-manipulation suites) to validate the behavior.
Changes:
- Apply a per-request
CONNECT_TIMEOUT_MILLISinReactorNettyClientbased on whether the request targets the thin client proxy. - Introduce thin-client-specific timeout policy wiring and new config plumbing (
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS) plus diagnostics output updates. - Add/extend unit and fault-injection/manual tests and accompanying docs to validate connect-timeout bifurcation and HTTP/2 connection lifecycle behavior.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ResponseTimeoutAndDelays.java | Adds Duration-based delay representation alongside seconds. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyRequestRecord.java | Captures HTTP/2/channel identifiers for richer diagnostics. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java | Implements per-request connect timeout selection; captures channel IDs via connection observer. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpTimeoutPolicyForGatewayV2.java | Adds Gateway V2 timeout policy class for thin client document operations. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpTimeoutPolicy.java | Routes eligible thin-client document operations to Gateway V2 timeout policies. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpRequest.java | Adds isThinClientRequest flag + fluent setter for transport customization. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpClientConfig.java | Emits gwV2Cto in diagnostics when thin client is enabled. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/WebExceptionRetryPolicy.java | Switches retry backoff handling to use Duration from timeout policy. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java | Marks thin-client requests and hardens ByteBuf handling for released buffers. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxGatewayStoreModel.java | Ensures request record is available on success/error; sets request URI on cancellation for diagnostics. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentServiceRequest.java | Adds useThinClientMode flag and ensures it is preserved during cloning. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java | Sets useThinClientMode when routing to thin client store model. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/DocumentServiceRequestContext.java | Stores per-attempt ReactorNettyRequestRecord for diagnostics enrichment. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java | Introduces thin client connect-timeout config (sysprop/env) with default value. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientSideRequestStatistics.java | Adds HTTP response timeout, channel IDs, HTTP/2 flag, and e2e policy config to gateway stats serialization. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the new Gateway V2 connect-timeout behavior and timeout policies. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java | Minor import/format cleanup. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/WebExceptionRetryPolicyTest.java | Extends test coverage for thin-client timeout policies and write behavior. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ConfigsTests.java | Adds unit tests for thin client timeout config parsing and request flag defaulting. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectionLifecycleTests.java | Adds manual tc netem tests to validate HTTP/2 parent connection survival across real delays/timeouts. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectTimeoutBifurcationTests.java | Adds manual iptables/tc tests to validate connect-timeout bifurcation by port/path. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/FaultInjectionServerErrorRuleOnGatewayV2Tests.java | Updates thin-client FI tests to account for new Gateway V2 timeout behavior. |
| sdk/cosmos/azure-cosmos-tests/NETWORK_DELAY_TESTING_README.md | Documents how to run the new manual network-delay lifecycle tests. |
| sdk/cosmos/azure-cosmos-tests/CONNECT_TIMEOUT_TESTING_README.md | Documents how to run the new manual connect-timeout bifurcation tests. |
Comments suppressed due to low confidence (1)
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectTimeoutBifurcationTests.java:300
- Method name says “1sFiresOnDroppedSyn”, but the test description and assertions are written for a 5s default connect timeout. Rename the test to match the actual expected behavior (or adjust the timeout setup) so the name stays accurate over time.
@Test(groups = {TEST_GROUP}, timeOut = TEST_TIMEOUT)
public void connectTimeout_GwV2_DataPlane_1sFiresOnDroppedSyn() throws Exception {
// Close and recreate client to ensure no pooled connections exist —
// we need to force a NEW TCP connection which will hit the iptables DROP.
...smos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java
Outdated
Show resolved
Hide resolved
...tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectTimeoutBifurcationTests.java
Show resolved
Hide resolved
...cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpTimeoutPolicyForGatewayV2.java
Outdated
Show resolved
Hide resolved
...smos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ConfigsTests.java
Outdated
Show resolved
Hide resolved
...osmos-tests/src/test/java/com/azure/cosmos/faultinjection/Http2ConnectionLifecycleTests.java
Show resolved
Hide resolved
...smos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java
Outdated
Show resolved
Hide resolved
...cosmos/src/main/java/com/azure/cosmos/implementation/http/HttpTimeoutPolicyForGatewayV2.java
Show resolved
Hide resolved
| // payload is a slice/derived view; super() owns payload, we still own the container | ||
| // this includes scenarios where payloadBuf == EMPTY_BUFFER | ||
| if (payloadBuf == Unpooled.EMPTY_BUFFER && content.refCnt() > 0) { | ||
| ReferenceCountUtil.safeRelease(content); |
There was a problem hiding this comment.
potential a common method:
safeRelease(content.refCnt):
if content.refCnt > 0 -> ReferenceCountUtil.safeRelease(content);
There was a problem hiding this comment.
Hmm - I would prefer keeping teh refCnt check explicit instead o fshuffling it into some safeRelase or safeSilentRelease - IMO makes the code flow more readable
…into AzCosmos_H2ConnectAcquireTimeout # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ThinClientStoreModel.java
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Problem
When thin client (Gateway V2) is enabled, both metadata requests (port 443, GW V1) and data-plane requests (port 10250, GW V2 HTTP/2) share the same
CONNECT_TIMEOUT_MILLISof 45s. If the thin client proxy on port 10250 is unreachable, the SDK waits 45s per connect attempt before failing — far too long for a data-plane path that should fail fast and trigger regional failover.Solution
Bifurcate
CONNECT_TIMEOUT_MILLISat the Reactor Netty level based on request type:CONNECT_TIMEOUT_MILLISThe timeout is applied per-request via Reactor Netty's immutable
HttpClient.option(), which returns a new config snapshot without mutating the shared client.Diagnostic proof —
connectTimeout_Bifurcation_DelayBasedBoth ports receive the same 7s SYN-only delay via Linux
tc netem+iptables mangle. The only difference isCONNECT_TIMEOUT_MILLIS.Full CosmosDiagnostics — data plane failure:
{"userAgent":"azsdk-java-cosmos/4.79.0-beta.1 Linux/6.6.87.2-microsoft-standard-WSL2 JRE/21.0.10|F14","activityId":"4b4dbffd-e1ad-43bd-b4c7-9a8f41de6ada","requestLatencyInMs":29998,"requestStartTimeUTC":"2026-03-03T23:24:30.353063100Z","requestEndTimeUTC":"2026-03-03T23:25:00.352036291Z","responseStatisticsList":[],"supplementalResponseStatisticsList":[],"addressResolutionStatistics":{},"regionsContacted":["central us","east us 2"],"retryContext":{"statusAndSubStatusCodes":[[503,10001]],"retryCount":1,"retryLatency":7320},"metadataDiagnosticsContext":{"metadataDiagnosticList":[{"metaDataName":"CONTAINER_LOOK_UP","startTimeUTC":"2026-03-03T23:24:30.353779798Z","endTimeUTC":"2026-03-03T23:24:37.866011985Z","durationinMS":7512,"activityId":"6faa8b6d-653d-45ee-aae8-28f67b9e01f7","collectionRid":"UYRLAJa2jxQ="},{"metaDataName":"PARTITION_KEY_RANGE_LOOK_UP","startTimeUTC":"2026-03-03T23:24:37.866118644Z","endTimeUTC":"2026-03-03T23:24:37.992319706Z","durationinMS":126}],"empty":false},"serializationDiagnosticsContext":{"serializationDiagnosticsList":null},"gatewayStatisticsList":[{"sessionToken":null,"operationType":"Read","resourceType":"Document","statusCode":503,"subStatusCode":10001,"requestCharge":0.0,"requestTimeline":[{"eventName":"connectionAcquired","startTimeUTC":"2026-03-03T23:24:37.991135806Z","durationInMilliSecs":5028.05566}],"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"httpNetworkResponseTimeout":"PT6S","exceptionMessage":"connection timed out after 5000 ms: thin-client-multi-writer-eastus2.documents.azure.com/40.84.77.67:10250","exceptionResponseHeaders":"{x-ms-substatus=10001}","endpoint":"https://x-eastus2.documents.azure.com:10250/...","e2ePolicyCfg":"{e2eto=PT30S, as=}"},{"sessionToken":null,"operationType":"Read","resourceType":"Document","statusCode":503,"subStatusCode":10001,"requestCharge":0.0,"requestTimeline":[{"eventName":"connectionAcquired","startTimeUTC":"2026-03-03T23:24:43.019820747Z","durationInMilliSecs":5004.424708}],"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"httpNetworkResponseTimeout":"PT6S","exceptionMessage":"connection timed out after 5000 ms: thin-client-multi-writer-eastus2.documents.azure.com/40.84.77.67:10250","exceptionResponseHeaders":"{x-ms-substatus=10001}","endpoint":"https://x-eastus2.documents.azure.com:10250/...","e2ePolicyCfg":"{e2eto=PT30S, as=}"},{"sessionToken":null,"operationType":"Read","resourceType":"Document","statusCode":503,"subStatusCode":10001,"requestCharge":0.0,"requestTimeline":[{"eventName":"connectionAcquired","startTimeUTC":"2026-03-03T23:24:48.024689171Z","durationInMilliSecs":5005.594693}],"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"httpNetworkResponseTimeout":"PT10S","exceptionMessage":"connection timed out after 5000 ms: thin-client-multi-writer-eastus2.documents.azure.com/40.84.77.67:10250","exceptionResponseHeaders":"{x-ms-substatus=10001}","endpoint":"https://x-eastus2.documents.azure.com:10250/...","e2ePolicyCfg":"{e2eto=PT30S, as=}"},{"sessionToken":null,"operationType":"Read","resourceType":"Document","statusCode":503,"subStatusCode":10001,"requestCharge":0.0,"requestTimeline":[{"eventName":"connectionAcquired","startTimeUTC":"2026-03-03T23:24:54.043417607Z","durationInMilliSecs":5228.884334}],"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"httpNetworkResponseTimeout":"PT6S","exceptionMessage":"connection timed out after 5000 ms: thin-client-multi-writer-centralus.documents.azure.com/20.15.133.49:10250","exceptionResponseHeaders":"{x-ms-substatus=10001}","endpoint":"https://x-centralus.documents.azure.com:10250/...","e2ePolicyCfg":"{e2eto=PT30S, as=}"},{"sessionToken":null,"operationType":"Read","resourceType":"Document","statusCode":408,"subStatusCode":20008,"requestCharge":0.0,"requestTimeline":[{"eventName":"connectionAcquired","startTimeUTC":"2026-03-03T23:24:59.273240394Z","durationInMilliSecs":1078.506504}],"partitionKeyRangeId":null,"responsePayloadSizeInBytes":0,"httpNetworkResponseTimeout":"PT6S","exceptionMessage":"Request cancelled by client after reaching timeout specified in end-end timeout policy","exceptionResponseHeaders":"{x-ms-substatus=20008}","endpoint":"","e2ePolicyCfg":"{e2eto=PT30S, as=}"}],"samplingRateSnapshot":1.0,"bloomFilterInsertionCountSnapshot":0,"systemInformation":{"usedMemory":"50496 KB","availableMemory":"2046656 KB","availableProcessors":8},"clientCfgs":{"id":9,"machineId":"uuid:2560edc9-6d4d-4e52-87b3-979d8e765c3f","connectionMode":"GATEWAY","numberOfClients":1,"connCfg":{"rntbd":null,"gw":"(cps:1000, nrto:PT1M, icto:PT1M, cto:PT45S, gwV2Cto:PT5S, p:false, http2:(enabled:true, maxc:10, minc:1, maxs:30))","other":"(ed: true, cs: false, rv: true)"}}}Reading the diagnostic:
CONTAINER_LOOK_UP: 7512ms— metadata on port 443 took 7.5s (absorbed 7s SYN delay, succeeded)gatewayStatisticsList[0]:connectionAcquired: 5028ms,"connection timed out after 5000 ms: ...eastus2...:10250"— 5s timeout firedgatewayStatisticsList[1]:connectionAcquired: 5004ms, sameconnection timed out after 5000 ms— second attempt, same 5sgatewayStatisticsList[2]:connectionAcquired: 5005ms, same — third attempt on eastus2gatewayStatisticsList[3]:connectionAcquired: 5228ms,"...centralus...:10250"— failover to Central US, still 5s timeoutgatewayStatisticsList[4]:408/20008— e2e timeout (30s budget) cancelled the fifth attemptconnCfg.gw:cto:PT45S, gwV2Cto:PT5S— both timeouts visibleThe bifurcation proof:
cto:PT45SbuildAsyncClient()succeededgwV2Cto:PT5Sconnection timed out after 5000 msSame network condition. Same delay. Different timeouts. Different outcomes.
CosmosDiagnostics — what changes for customers
The
clientCfgs.connCfg.gwdiagnostic string gains the newgwV2Ctofield:Before:
After:
When data-plane connect times out, each
gatewayStatisticsListentry shows:{ "statusCode": 503, "subStatusCode": 10001, "requestTimeline": [{"eventName": "connectionAcquired", "durationInMilliSecs": 5028.05566}], "exceptionMessage": "connection timed out after 5000 ms: thin-client-multi-writer-eastus2.documents.azure.com/40.84.77.67:10250", "exceptionResponseHeaders": "{x-ms-substatus=10001}", "endpoint": "https://thin-client-multi-writer-eastus2.documents.azure.com:10250/..." }Production code changes
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS(env:COSMOS_THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS), default 5s. New methodgetThinClientConnectionTimeoutInSeconds().HttpRequest.javaisThinClientRequestflag + fluentwithThinClientRequest(boolean)setter.ThinClientStoreModel.java.withThinClientRequest(true)on thin client path requests.ReactorNettyClient.javaresolveConnectTimeoutMs(HttpRequest)— applies per-request via.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeoutMs).HttpClientConfig.javatoDiagnosticsString()emitsgwV2Ctoalongsidecto.Testing
connectTimeout_GwV2_DataPlane_1sFiresOnDroppedSyniptables -j DROPSYN on :10250connectTimeout_GwV1_Metadata_UnaffectedByGwV2Dropiptables -j DROPSYN on :10250 onlyconnectTimeout_GwV2_PreciseTimingiptables -j DROPSYN, 12s e2econnectTimeout_Bifurcation_DelayBased_...tc netemSYN-only 7s delay on both portsConfigsTestsTest infra: Added
manual-thinclient-network-delayto@BeforeSuite/@AfterSuitein TestSuiteBase.java.Configuration
COSMOS.THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDSCOSMOS_THINCLIENT_CONNECTION_TIMEOUT_IN_SECONDS5All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines
closes #48092