Skip to content

Kushagrathapar/optimize cosmos ci shared build#48210

Draft
kushagraThapar wants to merge 7 commits intoAzure:mainfrom
kushagraThapar:kushagrathapar/optimize-cosmos-ci-shared-build
Draft

Kushagrathapar/optimize cosmos ci shared build#48210
kushagraThapar wants to merge 7 commits intoAzure:mainfrom
kushagraThapar:kushagrathapar/optimize-cosmos-ci-shared-build

Conversation

@kushagraThapar
Copy link
Member

Optimize Cosmos DB CI Pipeline — Emulator Tests

Summary

Reduce CI agent time and test execution time for the Cosmos DB emulator test pipeline. Based on analysis of build #5953527 (82 min wall time, 23.4 agent hours, 30 jobs).

Changes

1. PR-conditional emulator matrix (16 → 11 jobs)

For PR builds, use a reduced matrix that drops redundant JDK variant jobs for Spark and Kafka connectors. Full matrix continues to run on main branch merges.

Dropped for PRs (5 jobs):

Job Kept Variant
Spark 3.3 Java 11 Java 8
Spark 3.4 Java 8 Java 11
Spark 3.5/Scala 2.12 Java 8 Java 17
Spark 4.0/Scala 2.13 Java 17 Java 21
Kafka Java 11 Java 17

Savings: ~5 agent hours per PR

2. Increase Maven build parallelization (1 → 2)

All three stages (Build, TestEmulator, TestVNextEmulator) now use BuildParallelization: 2 instead of 1.

Savings: ~5 min per job × 16 jobs = ~80 min agent time

3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs

Core emulator, long emulator, and encryption jobs don't need Spark/Kafka uber JARs. Added BuildOptions: "-Dshade.skip=true" to these 5 matrix entries. The build step in each emulator job spends ~14 min (88% of 17 min total) creating 5 Spark uber JARs that are never used by non-Spark tests.

Savings: ~7-8 min per non-Spark job × 5 jobs = ~35 min agent time

4. Configurable endpoint failover retry constants

NetworkFailureTest#createCollectionWithUnreachableHost takes 121s because it waits for ClientRetryPolicy to exhaust 120 retries × 1s interval.

  • Added COSMOS.CLIENT_ENDPOINT_FAILOVER_MAX_RETRY_COUNT and COSMOS.CLIENT_ENDPOINT_FAILOVER_RETRY_INTERVAL_IN_MS system properties to Configs.java (defaults unchanged: 120 retries, 1000ms)
  • ClientRetryPolicy reads from Configs at each usage point, allowing runtime override
  • NetworkFailureTest overrides to 5 retries × 100ms, restores defaults after test

Savings: 121s → 0.5s per test run

5. Poll instead of fixed sleep in ChangeFeedProcessor tests

IncrementalChangeFeedProcessorTest.validateChangeFeedProcessing previously did Thread.sleep(sleepTime) for the full duration (10-50s) even if documents arrived in 1-2s. Replaced with a polling loop that checks every 100ms and returns as soon as all documents are received.

Savings: 5-40s per CFP test invocation

Impact Summary

Metric Before After Improvement
PR emulator jobs 16 11 -31%
Agent time (PR) ~16.4 hrs ~10 hrs -39%
Long emulator job ~78 min ~62 min -20%
NetworkFailureTest 121s 0.5s -99.6%

Files Changed

Pipeline config

  • eng/pipelines/templates/stages/cosmos-sdk-client.yml — PR-conditional matrix, build parallelization, shade skip
  • eng/pipelines/templates/stages/cosmos-emulator-matrix.json — Added BuildOptions per job
  • eng/pipelines/templates/stages/cosmos-emulator-matrix-pr.json — New reduced matrix for PRs

Production code (no behavior change)

  • sdk/cosmos/azure-cosmos/.../Configs.java — New configurable retry properties
  • sdk/cosmos/azure-cosmos/.../ClientRetryPolicy.java — Read retry constants from Configs

Test code

  • sdk/cosmos/azure-cosmos-tests/.../NetworkFailureTest.java — Override retry for fast execution
  • sdk/cosmos/azure-cosmos-tests/.../IncrementalChangeFeedProcessorTest.java — Poll instead of sleep

Testing

These changes will be validated by the CI pipeline itself. The production code changes (Configs/ClientRetryPolicy) maintain identical defaults — no behavioral change unless system properties are explicitly set.

kushagraThapar and others added 3 commits March 3, 2026 09:07
1. PR-conditional emulator matrix (16 → 11 jobs):
   Drops redundant JDK variants for Spark/Kafka in PR builds.
   Full matrix on main merges.

   Dropped for PRs (5 jobs, ~5 agent hours saved):
   - Spark 3.3 Java 11 (keeping Java 8)
   - Spark 3.4 Java 8 (keeping Java 11)
   - Spark 3.5/Scala 2.12 Java 8 (keeping Java 17)
   - Spark 4.0/Scala 2.13 Java 17 (keeping Java 21)
   - Kafka Java 11 (keeping Java 17)

2. Increase BuildParallelization from 1 to 2 in all stages
   (Build, TestEmulator, TestVNextEmulator).

3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs:
   Core emulator, long emulator, and encryption jobs don't need
   Spark/Kafka uber JARs. Adding -Dshade.skip=true saves ~90s of
   shade plugin execution per Spark module × 5 modules = ~7-8 min
   per non-Spark job (5 jobs × 7 min = ~35 min agent time saved).

4. Remove outdated comment about emulator download time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…properties

NetworkFailureTest#createCollectionWithUnreachableHost takes 121s because
it waits for ClientRetryPolicy to exhaust 120 retries × 1s interval.

Changes:
- Add COSMOS.CLIENT_ENDPOINT_FAILOVER_MAX_RETRY_COUNT and
  COSMOS.CLIENT_ENDPOINT_FAILOVER_RETRY_INTERVAL_IN_MS to Configs.java
  (defaults: 120 retries, 1000ms — no behavior change in production)
- ClientRetryPolicy reads from Configs at each usage point (not cached
  in final static), allowing runtime override via system properties
- NetworkFailureTest sets 5 retries × 100ms at test start, restores
  defaults in finally block → test completes in ~0.5s instead of 121s
- Other tests in the same JVM are unaffected (properties restored)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace Thread.sleep(sleepTime) in validateChangeFeedProcessing with a
polling loop that returns as soon as all documents are received. The
previous implementation always slept the full duration (10-50s) even
if documents arrived in 1-2s.

The polling loop checks every 100ms if receivedDocuments.size() has
reached the expected count, with sleepTime as the maximum timeout.

Estimated savings: 5-40s per test invocation depending on how quickly
documents are processed by the change feed processor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Cosmos label Mar 3, 2026
kushagraThapar and others added 4 commits March 3, 2026 10:42
The antrun 03-repack phase expects shade output (native .jnilib/.so
files in target/tmp/). When -Dshade.skip=true, the shade output doesn't
exist and antrun fails with 'Could not find file'. Add
-Dmaven.antrun.skip=true alongside -Dshade.skip=true.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test step runs 'clean verify' which recompiles everything from
scratch, including Spark shade. Our BuildOptions only affected the
build step. Add -Dshade.skip=true -Dmaven.antrun.skip=true to
AdditionalArgs for non-Spark jobs so it flows into TestOptions too.

Keep BuildOptions for the build step as well (both steps need it).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add BuildOptions parameter through ci.yml → ci.tests.yml → build-and-test.yml
pipeline chain. Defaults to empty string (no behavior change for other SDKs).

Cosmos Build stage sets BuildOptions to '-Dshade.skip=true -Dmaven.antrun.skip=true'
to skip Spark/Kafka uber JAR creation during unit test matrix jobs, saving ~14 min
per job. The release artifact deploy step is unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each Spark emulator job previously compiled ALL 14 modules including
other Spark versions it doesn't test, wasting ~11 min per job on
unnecessary shade+compile.

Changes:
- generate-project-list.ps1: Check for ProjectListOverride env var
  at the top. If set, use it directly and skip normal computation.
  Defaults to empty (no behavior change for other SDKs).
- Emulator matrix JSONs: Add ProjectListOverride for each Spark and
  Kafka job with only the modules they need (core + their specific
  Spark/Kafka module).

Example: Spark 3.5/2.13 job previously built 14 modules (41 min test
step). Now builds only 6 modules, saving ~11 min per Spark job.

Estimated savings: ~11 min × 9 Spark jobs + ~5 min × 2 Kafka jobs
= ~109 min agent time per full CI run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant