Kushagrathapar/optimize cosmos ci shared build by kushagraThapar · Pull Request #48210 · Azure/azure-sdk-for-java

kushagraThapar · 2026-03-03T18:02:47Z

Optimize Cosmos DB CI Pipeline — Emulator Tests

Summary

Reduce CI agent time and test execution time for the Cosmos DB emulator test pipeline. Based on analysis of build #5953527 (82 min wall time, 23.4 agent hours, 30 jobs).

Changes

1. PR-conditional emulator matrix (16 → 11 jobs)

For PR builds, use a reduced matrix that drops redundant JDK variant jobs for Spark and Kafka connectors. Full matrix continues to run on main branch merges.

Dropped for PRs (5 jobs):

Job	Kept Variant
Spark 3.3 Java 11	Java 8
Spark 3.4 Java 8	Java 11
Spark 3.5/Scala 2.12 Java 8	Java 17
Spark 4.0/Scala 2.13 Java 17	Java 21
Kafka Java 11	Java 17

Savings: ~5 agent hours per PR

2. Increase Maven build parallelization (1 → 2)

All three stages (Build, TestEmulator, TestVNextEmulator) now use BuildParallelization: 2 instead of 1.

Savings: ~5 min per job × 16 jobs = ~80 min agent time

3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs

Core emulator, long emulator, and encryption jobs don't need Spark/Kafka uber JARs. Added BuildOptions: "-Dshade.skip=true" to these 5 matrix entries. The build step in each emulator job spends ~14 min (88% of 17 min total) creating 5 Spark uber JARs that are never used by non-Spark tests.

Savings: ~7-8 min per non-Spark job × 5 jobs = ~35 min agent time

4. Configurable endpoint failover retry constants

NetworkFailureTest#createCollectionWithUnreachableHost takes 121s because it waits for ClientRetryPolicy to exhaust 120 retries × 1s interval.

Added COSMOS.CLIENT_ENDPOINT_FAILOVER_MAX_RETRY_COUNT and COSMOS.CLIENT_ENDPOINT_FAILOVER_RETRY_INTERVAL_IN_MS system properties to Configs.java (defaults unchanged: 120 retries, 1000ms)
ClientRetryPolicy reads from Configs at each usage point, allowing runtime override
NetworkFailureTest overrides to 5 retries × 100ms, restores defaults after test

Savings: 121s → 0.5s per test run

5. Poll instead of fixed sleep in ChangeFeedProcessor tests

IncrementalChangeFeedProcessorTest.validateChangeFeedProcessing previously did Thread.sleep(sleepTime) for the full duration (10-50s) even if documents arrived in 1-2s. Replaced with a polling loop that checks every 100ms and returns as soon as all documents are received.

Savings: 5-40s per CFP test invocation

Impact Summary

Metric	Before	After	Improvement
PR emulator jobs	16	11	-31%
Agent time (PR)	~16.4 hrs	~10 hrs	-39%
Long emulator job	~78 min	~62 min	-20%
NetworkFailureTest	121s	0.5s	-99.6%

Files Changed

Pipeline config

eng/pipelines/templates/stages/cosmos-sdk-client.yml — PR-conditional matrix, build parallelization, shade skip
eng/pipelines/templates/stages/cosmos-emulator-matrix.json — Added BuildOptions per job
eng/pipelines/templates/stages/cosmos-emulator-matrix-pr.json — New reduced matrix for PRs

Production code (no behavior change)

sdk/cosmos/azure-cosmos/.../Configs.java — New configurable retry properties
sdk/cosmos/azure-cosmos/.../ClientRetryPolicy.java — Read retry constants from Configs

Test code

sdk/cosmos/azure-cosmos-tests/.../NetworkFailureTest.java — Override retry for fast execution
sdk/cosmos/azure-cosmos-tests/.../IncrementalChangeFeedProcessorTest.java — Poll instead of sleep

Testing

These changes will be validated by the CI pipeline itself. The production code changes (Configs/ClientRetryPolicy) maintain identical defaults — no behavioral change unless system properties are explicitly set.

1. PR-conditional emulator matrix (16 → 11 jobs): Drops redundant JDK variants for Spark/Kafka in PR builds. Full matrix on main merges. Dropped for PRs (5 jobs, ~5 agent hours saved): - Spark 3.3 Java 11 (keeping Java 8) - Spark 3.4 Java 8 (keeping Java 11) - Spark 3.5/Scala 2.12 Java 8 (keeping Java 17) - Spark 4.0/Scala 2.13 Java 17 (keeping Java 21) - Kafka Java 11 (keeping Java 17) 2. Increase BuildParallelization from 1 to 2 in all stages (Build, TestEmulator, TestVNextEmulator). 3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs: Core emulator, long emulator, and encryption jobs don't need Spark/Kafka uber JARs. Adding -Dshade.skip=true saves ~90s of shade plugin execution per Spark module × 5 modules = ~7-8 min per non-Spark job (5 jobs × 7 min = ~35 min agent time saved). 4. Remove outdated comment about emulator download time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…properties NetworkFailureTest#createCollectionWithUnreachableHost takes 121s because it waits for ClientRetryPolicy to exhaust 120 retries × 1s interval. Changes: - Add COSMOS.CLIENT_ENDPOINT_FAILOVER_MAX_RETRY_COUNT and COSMOS.CLIENT_ENDPOINT_FAILOVER_RETRY_INTERVAL_IN_MS to Configs.java (defaults: 120 retries, 1000ms — no behavior change in production) - ClientRetryPolicy reads from Configs at each usage point (not cached in final static), allowing runtime override via system properties - NetworkFailureTest sets 5 retries × 100ms at test start, restores defaults in finally block → test completes in ~0.5s instead of 121s - Other tests in the same JVM are unaffected (properties restored) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace Thread.sleep(sleepTime) in validateChangeFeedProcessing with a polling loop that returns as soon as all documents are received. The previous implementation always slept the full duration (10-50s) even if documents arrived in 1-2s. The polling loop checks every 100ms if receivedDocuments.size() has reached the expected count, with sleepTime as the maximum timeout. Estimated savings: 5-40s per test invocation depending on how quickly documents are processed by the change feed processor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The antrun 03-repack phase expects shade output (native .jnilib/.so files in target/tmp/). When -Dshade.skip=true, the shade output doesn't exist and antrun fails with 'Could not find file'. Add -Dmaven.antrun.skip=true alongside -Dshade.skip=true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The test step runs 'clean verify' which recompiles everything from scratch, including Spark shade. Our BuildOptions only affected the build step. Add -Dshade.skip=true -Dmaven.antrun.skip=true to AdditionalArgs for non-Spark jobs so it flows into TestOptions too. Keep BuildOptions for the build step as well (both steps need it). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add BuildOptions parameter through ci.yml → ci.tests.yml → build-and-test.yml pipeline chain. Defaults to empty string (no behavior change for other SDKs). Cosmos Build stage sets BuildOptions to '-Dshade.skip=true -Dmaven.antrun.skip=true' to skip Spark/Kafka uber JAR creation during unit test matrix jobs, saving ~14 min per job. The release artifact deploy step is unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Each Spark emulator job previously compiled ALL 14 modules including other Spark versions it doesn't test, wasting ~11 min per job on unnecessary shade+compile. Changes: - generate-project-list.ps1: Check for ProjectListOverride env var at the top. If set, use it directly and skip normal computation. Defaults to empty (no behavior change for other SDKs). - Emulator matrix JSONs: Add ProjectListOverride for each Spark and Kafka job with only the modules they need (core + their specific Spark/Kafka module). Example: Spark 3.5/2.13 job previously built 14 modules (41 min test step). Now builds only 6 modules, saving ~11 min per Spark job. Estimated savings: ~11 min × 9 Spark jobs + ~5 min × 2 Kafka jobs = ~109 min agent time per full CI run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar and others added 3 commits March 3, 2026 09:07

github-actions bot added the Cosmos label Mar 3, 2026

kushagraThapar and others added 4 commits March 3, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kushagrathapar/optimize cosmos ci shared build#48210

Kushagrathapar/optimize cosmos ci shared build#48210
kushagraThapar wants to merge 7 commits intoAzure:mainfrom
kushagraThapar:kushagrathapar/optimize-cosmos-ci-shared-build

kushagraThapar commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kushagraThapar commented Mar 3, 2026

Optimize Cosmos DB CI Pipeline — Emulator Tests

Summary

Changes

1. PR-conditional emulator matrix (16 → 11 jobs)

2. Increase Maven build parallelization (1 → 2)

3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs

4. Configurable endpoint failover retry constants

5. Poll instead of fixed sleep in ChangeFeedProcessor tests

Impact Summary

Files Changed

Pipeline config

Production code (no behavior change)

Test code

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant