Skip to content

[Azure-cosmos-benchmark]Refactor azure-cosmos-benchmark: move all configs from CLI to workload config JSON#48250

Open
xinlian12 wants to merge 5 commits intoAzure:mainfrom
xinlian12:reduceCLIArgs
Open

[Azure-cosmos-benchmark]Refactor azure-cosmos-benchmark: move all configs from CLI to workload config JSON#48250
xinlian12 wants to merge 5 commits intoAzure:mainfrom
xinlian12:reduceCLIArgs

Conversation

@xinlian12
Copy link
Member

@xinlian12 xinlian12 commented Mar 4, 2026

Summary

Refactors the azure-cosmos-benchmark configuration system to eliminate CLI parameter sprawl, centralize all workload configuration in a single JSON file (-workloadConfig), and unify all benchmark types under a single BenchmarkOrchestrator.

Before: ~60 CLI parameters in Configuration.java, 5 separate code paths in Main.java.
After: 6 CLI params remain, single orchestrator entry point for all benchmark types.


Architecture: Before vs After

Before

Main.main()
  ├─ if (isSync)         → syncBenchmark()       [own MetricRegistry, reporter, result uploader]
  ├─ if (CtlWorkload)    → asyncCtlWorkload()    [own MetricRegistry, reporter]
  ├─ if (LinkedInCtl)    → linkedInCtlWorkload()  [own MetricRegistry, reporter]
  ├─ if (encryption)     → asyncEncryptionBenchmark() [own MetricRegistry, reporter]
  └─ else                → BenchmarkOrchestrator  [shared MetricRegistry, reporter]

After

Main.main()
  └─ BenchmarkOrchestrator.run(benchConfig)
       ├─ Sets up shared infrastructure (MetricRegistry, reporters, system properties)
       ├─ Factory dispatches based on operationType + flags (isSync, isEncryptionEnabled):
       │   ├─ Sync:       SyncReadBenchmark / SyncWriteBenchmark
       │   ├─ CTL:        AsyncCtlWorkload
       │   ├─ LinkedIn:   LICtlWorkload
       │   ├─ Encryption: AsyncEncryptionRead/Write/Query benchmarks
       │   └─ Async:      AsyncRead/Write/Query/Mixed/ReadMany benchmarks
       └─ Lifecycle loop: create → run → shutdown → settle × N cycles

All benchmark types now implement a common Benchmark interface (run() + shutdown()) and accept an injected MetricRegistry from the orchestrator.


Workload Config JSON Structure

{
  "tenantDefaults": {
    "connectionMode": "DIRECT",
    "consistencyLevel": "SESSION",
    "concurrency": "50",
    "numberOfOperations": "100000",
    "operation": "ReadThroughput",
    "numberOfPreCreatedDocuments": "1000",
    "isPartitionLevelCircuitBreakerEnabled": "true",
    "isPerPartitionAutomaticFailoverRequired": "true",
    "minConnectionPoolSizePerEndpoint": "0"
  },
  "metrics": {
    "enableJvmStats": true,
    "enableNettyHttpMetrics": false,
    "printingInterval": 10,
    "reportingDirectory": "/tmp/benchmark-output",
    "resultUpload": {
      "serviceEndpoint": "https://results-account.documents.azure.com:443/",
      "masterKey": "",
      "database": "benchresults",
      "container": "runs"
    },
    "runMetadata": {
      "testVariationName": "baseline-direct-mode",
      "branchName": "main",
      "commitId": "abc123def"
    }
  },
  "tenants": [
    {
      "id": "tenant-0",
      "serviceEndpoint": "https://cosmos-account-0.documents.azure.com:443/",
      "masterKey": "",
      "databaseId": "benchdb",
      "containerId": "benchcol",
      "operation": "ReadThroughput",
      "concurrency": 100
    }
  ]
}

How defaults work:

  • Each tenant inherits from tenantDefaults (e.g., both tenants get connectionMode: DIRECT)
  • Per-tenant values override defaults (e.g., tenant-0 gets concurrency: 100)
  • metrics section is global (not per-tenant)
  • JVM system properties (isPartitionLevelCircuitBreakerEnabled, etc.) live in tenantDefaults but apply globally

All Configurations — Before vs After

Account & Connection

Config Name Before After
serviceEndpoint CLI -serviceEndpoint tenants[] / tenantDefaults
masterKey CLI -masterKey tenants[] / tenantDefaults
databaseId CLI -databaseId tenants[] / tenantDefaults
containerId CLI -collectionId tenants[] / tenantDefaults
connectionMode CLI -connectionMode tenants[] / tenantDefaults
consistencyLevel CLI -consistencyLevel tenants[] / tenantDefaults
maxConnectionPoolSize CLI -maxConnectionPoolSize tenants[] / tenantDefaults
preferredRegionsList CLI -preferredRegionsList tenants[] / tenantDefaults
applicationName CLI -applicationName tenants[] / tenantDefaults
manageDatabase CLI -manageDatabase tenants[] / tenantDefaults

Workload Parameters

Config Name Before After
operation CLI -operation tenants[] / tenantDefaults
concurrency CLI -concurrency tenants[] / tenantDefaults
numberOfOperations CLI -numberOfOperations tenants[] / tenantDefaults
useSync CLI -useSync tenants[] / tenantDefaults
encryptionEnabled CLI -encryptionEnabled tenants[] / tenantDefaults

JVM System Properties

Config Name Before After
isPartitionLevelCircuitBreakerEnabled CLI tenantDefaults only
isPerPartitionAutomaticFailoverRequired CLI tenantDefaults only
minConnectionPoolSizePerEndpoint CLI tenantDefaults only

Metrics & Reporting

Config Name Before After
enableJvmStats CLI -enableJvmStats metrics
printingInterval CLI -printingInterval metrics
reportingDirectory CLI -reportingDirectory metrics
Result upload fields CLI flags metrics.resultUpload
Run metadata fields CLI flags metrics.runMetadata

Remaining CLI Parameters (unchanged)

Config Name Description
-workloadConfig Path to workload config JSON (required)
-cycles Number of create/destroy cycles
-settleTimeMs Wait between cycles
-gcBetweenCycles Force GC during settle
-suppressCleanup Skip DB/container cleanup
-help Show help

Key Changes

  1. Main.java: 190 → 62 lines. Single code path through BenchmarkOrchestrator
  2. Benchmark.java (new): Common interface (run() + shutdown()) implemented by all benchmark types
  3. BenchmarkOrchestrator.java: Expanded factory dispatches sync, CTL, encryption, LinkedIn benchmarks based on operationType + flags
  4. SyncBenchmark, AsyncCtlWorkload, AsyncEncryptionBenchmark, LICtlWorkload: Accept injected MetricRegistry, removed self-managed reporters/JVM stats/result uploaders
  5. BenchmarkConfig.java: Separated config parsing into focused methods (loadJvmSystemProperties, loadMetricsConfig, loadResultUploadConfig, loadRunMetadata)
  6. Configuration.java: 1057 → 131 lines. Only CLI lifecycle params remain
  7. TenantWorkloadConfig.java: All workload params with tenantDefaults merge
  8. Renamed: globalDefaultstenantDefaults, -tenantsFile-workloadConfig
  9. Removed: Graphite reporting, redundant TenantWorkloadConfigFromConfigurationTest.java

Verification

mvn compile -pl sdk/cosmos/azure-cosmos-benchmark -DskipTests passes cleanly

Annie Liang and others added 2 commits March 4, 2026 12:51
…d config JSON

- Remove ~50 CLI parameters from Configuration.java; all workload/connection
  params now come from the workload config JSON file (TenantWorkloadConfig)
- Configuration.java reduced from 1057 to 131 lines with only 6 CLI params:
  workloadConfig, cycles, settleTimeMs, gcBetweenCycles, suppressCleanup, help
- Move metrics/reporting/result-upload configs into a nested 'metrics' section
  in the workload config JSON with 'resultUpload' and 'runMetadata' sub-objects
- Remove all Graphite reporting support (use Application Insights instead)
- Remove TenantWorkloadConfig.fromConfiguration() factory method
- Make workload config file mandatory (-workloadConfig flag)
- Rename -tenantsFile CLI flag to -workloadConfig
- Update all consumer classes (29 files) to use TenantWorkloadConfig for
  workload params and BenchmarkConfig for global/reporting params

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Cosmos label Mar 4, 2026
Annie Liang and others added 2 commits March 4, 2026 14:50
Route sync, CTL, encryption, and LinkedIn benchmarks through the
orchestrator instead of having separate code paths in Main.java.

- Add Benchmark interface (run + shutdown) implemented by all benchmark types
- Refactor SyncBenchmark, AsyncCtlWorkload, AsyncEncryptionBenchmark, and
  LICtlWorkload to accept injected MetricRegistry (like AsyncBenchmark)
- Remove self-managed reporter, result uploader, and JVM stats from each
  benchmark — orchestrator handles all infrastructure concerns
- Expand orchestrator factory to dispatch based on operationType + flags
  (isSync, isEncryptionEnabled)
- Simplify Main.java from 5 code paths to a single orchestrator call

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The section provides default values for tenant-level configuration,
so tenantDefaults better reflects its purpose.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Split loadGlobalSystemPropertiesFromWorkloadConfig into 4 focused methods:
  loadJvmSystemProperties, loadMetricsConfig, loadResultUploadConfig, loadRunMetadata
- Add clarifying comment in Main.java explaining Configuration -> BenchmarkConfig
  -> BenchmarkOrchestrator flow

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 marked this pull request as ready for review March 4, 2026 23:22
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners March 4, 2026 23:22
Copilot AI review requested due to automatic review settings March 4, 2026 23:22
@xinlian12 xinlian12 requested a review from a team as a code owner March 4, 2026 23:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the azure-cosmos-benchmark tool to eliminate ~60 CLI parameters and centralize all workload configuration in a single JSON file (-workloadConfig). The result is a unified BenchmarkOrchestrator entry point for all benchmark types (sync, async, CTL, encryption, LinkedIn), with a new Benchmark interface and per-benchmark MetricRegistry injection from the orchestrator.

Changes:

  • Config system: Configuration.java reduced from ~1057 to 131 lines; all workload params moved to TenantWorkloadConfig and loaded from JSON; BenchmarkConfig now only loads from JSON (no more CLI single-tenant path).
  • Benchmark unification: New Benchmark interface (run() + shutdown()) implemented by all benchmark types; BenchmarkOrchestrator now dispatches sync, CTL, encryption, and LinkedIn benchmarks in addition to async ones.
  • Test updates: Tests migrated from building CLI strings + JCommander parsing to direct TenantWorkloadConfig construction; TenantWorkloadConfigFromConfigurationTest deleted.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Configuration.java Reduced to 6 CLI lifecycle params only
TenantWorkloadConfig.java All workload params added; fromConfiguration() factory removed; parseWorkloadConfig renamed from parseTenantsFile
BenchmarkConfig.java Now always requires a JSON config file; split loading into focused methods
BenchmarkOrchestrator.java Factory now dispatches all benchmark types via Benchmark interface; buildCosmosMicrometerRegistry inlined
Benchmark.java New interface with run() and shutdown()
Main.java Simplified to single code path through BenchmarkOrchestrator
SyncBenchmark.java, SyncReadBenchmark.java, SyncWriteBenchmark.java Accept injected MetricRegistry; implement Benchmark
AsyncBenchmark.java Implements Benchmark; run()/shutdown() made public
AsyncEncryptionBenchmark.java Accept injected MetricRegistry; implements Benchmark
AsyncCtlWorkload.java Accept injected MetricRegistry; implements Benchmark
LICtlWorkload.java Accept injected MetricRegistry; implements Benchmark; setup() merged into run()
ScheduledReporterFactory.java Accepts BenchmarkConfig instead of Configuration
WorkflowTest.java Tests migrated to direct TenantWorkloadConfig construction
ReadMyWritesConsistencyTest.java Test migrated to direct TenantWorkloadConfig construction
TenantWorkloadConfigFromConfigurationTest.java Deleted (no longer needed)
LinkedIn classes Configuration replaced with TenantWorkloadConfig
Comments suppressed due to low confidence (3)

sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/encryption/AsyncEncryptionBenchmark.java:219

  • In AsyncEncryptionBenchmark.shutdown(), the suppressCleanup flag is not respected. Unlike AsyncBenchmark, SyncBenchmark, and AsyncCtlWorkload which all check workloadConfig.isSuppressCleanup() before deleting database/container, AsyncEncryptionBenchmark.shutdown() unconditionally deletes them when databaseCreated or collectionCreated is true. This means the encryption benchmark will always clean up even when cycles > 1 and suppressCleanup=true is set.
    public void shutdown() {
        if (this.databaseCreated) {
            cosmosAsyncDatabase.delete().block();
            logger.info("Deleted temporary database {} created for this test", this.workloadConfig.getDatabaseId());
        } else if (this.collectionCreated) {
            cosmosAsyncContainer.delete().block();
            logger.info("Deleted temporary collection {} created for this test", this.workloadConfig.getContainerId());
        }

        cosmosClient.close();
    }

sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/ScheduledReporterFactory.java:36

  • The ScheduledReporterFactory.create() method signature changed from accepting Configuration to accepting BenchmarkConfig, but the method is no longer called anywhere (reporter creation is now inlined in BenchmarkOrchestrator.run()). The ScheduledReporterFactory class appears to be dead code after this refactor. Verify whether it is still used, and if not, consider removing it to avoid confusion.
    public static ScheduledReporter create(final BenchmarkConfig benchConfig,
        final MetricRegistry metricsRegistry) {
        if (benchConfig.getReportingDirectory() != null) {
            return CsvReporter.forRegistry(metricsRegistry)
                .convertDurationsTo(TimeUnit.MILLISECONDS)
                .convertRatesTo(TimeUnit.SECONDS)
                .build(new File(benchConfig.getReportingDirectory()));
        } else {
            return ConsoleReporter.forRegistry(metricsRegistry)
                .convertDurationsTo(TimeUnit.MILLISECONDS)
                .convertRatesTo(TimeUnit.SECONDS)
                .build();
        }

sdk/cosmos/azure-cosmos-benchmark/src/test/java/com/azure/cosmos/benchmark/WorkflowTest.java:99

  • The readMyWritesCLI and writeLatencyCLI tests (lines 33–46 and 86–99) call Main.main(StringUtils.split(cmd)) with the old CLI parameters (-serviceEndpoint, -masterKey, -databaseId, -collectionId, -operation, etc.). After this refactor, Main.main() no longer accepts those parameters — it requires -workloadConfig pointing to a JSON file. These tests will fail at runtime with a ParameterException from JCommander (unknown parameters) followed by a NullPointerException in BenchmarkConfig.fromConfiguration() because workloadConfig will be null. These tests need to be either updated to write a temporary JSON config file and pass -workloadConfig to them, or removed if they are superseded by the readMyWrites/writeLatency test variants below them.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +33 to 63
validateConfiguration(firstTenant);

new BenchmarkOrchestrator().run(benchConfig);
} catch (ParameterException e) {
System.err.println("INVALID Usage: " + e.getMessage());
System.err.println("Try '-help' for more information.");
throw e;
}
}

private static void validateConfiguration(Configuration cfg) {
switch (cfg.getOperationType()) {
private static void validateConfiguration(TenantWorkloadConfig workloadCfg) {
switch (workloadCfg.getOperationType()) {
case WriteLatency:
case WriteThroughput:
break;
default:
if (!cfg.isContentResponseOnWriteEnabled()) {
if (!workloadCfg.isContentResponseOnWriteEnabled()) {
throw new IllegalArgumentException("contentResponseOnWriteEnabled parameter can only be set to false " +
"for write latency and write throughput operations");
}
}

switch (cfg.getOperationType()) {
switch (workloadCfg.getOperationType()) {
case ReadLatency:
case ReadThroughput:
break;
default:
if (cfg.getSparsityWaitTime() != null) {
throw new IllegalArgumentException("sparsityWaitTime is not supported for " + cfg.getOperationType());
if (workloadCfg.getSparsityWaitTime() != null) {
throw new IllegalArgumentException("sparsityWaitTime is not supported for " + workloadCfg.getOperationType());
}
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validateConfiguration method in Main.java only validates firstTenant (the first element of the tenant workloads list), but the workload config JSON can define multiple tenants. Configuration constraints such as contentResponseOnWriteEnabled and sparsityWaitTime should be validated for all tenants, not just the first one. Invalid configurations in tenants 2..N will only be discovered at runtime when the benchmark starts.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +33
TenantWorkloadConfig firstTenant = benchConfig.getTenantWorkloads().get(0);

if (cfg.isSync()) {
syncBenchmark(cfg);
} else {
if (cfg.getOperationType().equals(CtlWorkload)) {
asyncCtlWorkload(cfg);
} else if (cfg.getOperationType().equals(LinkedInCtlWorkload)) {
linkedInCtlWorkload(cfg);
} else if (cfg.isEncryptionEnabled()) {
asyncEncryptionBenchmark(cfg);
} else {
asyncBenchmark(cfg);
}
}
validateConfiguration(firstTenant);
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseWorkloadConfig method returns an empty list silently when the tenants array is absent or empty in the JSON. Then Main.java at line 31 calls benchConfig.getTenantWorkloads().get(0) which will throw an IndexOutOfBoundsException with no useful error message. The orchestrator does check if (config.getTenantWorkloads().isEmpty()) at line 61 of BenchmarkOrchestrator, but Main.java crashes before that check is reached. Either parseWorkloadConfig should throw an IllegalArgumentException when no tenants are defined, or Main.java should guard against the empty list before calling get(0).

Copilot uses AI. Check for mistakes.
}

String roleInstance = System.getenv("APPLICATIONINSIGHTS_ROLE_INSTANCE");
if (roleName != null) {
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In BenchmarkOrchestrator.buildCosmosMicrometerRegistry(), at line 454, there is a copy-paste bug: the condition if (roleName != null) is used to add the cloud_RoleInstance tag, but it should be if (roleInstance != null). As written, if APPLICATIONINSIGHTS_ROLE_NAME env var is not set (roleName == null), the cloud_RoleInstance tag is never added even when roleInstance has a value. Conversely, if roleName is set but roleInstance is null, then Tag.of("cloud_RoleInstance", null) may cause a NullPointerException. This same bug existed in the old Configuration.java code that was inlined here.

Suggested change
if (roleName != null) {
if (roleInstance != null) {

Copilot uses AI. Check for mistakes.
Comment on lines +317 to +320
public Configuration.Environment getEnvironment() {
if (environment == null) return Configuration.Environment.Daily;
return Configuration.Environment.valueOf(environment);
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TenantWorkloadConfig.getEnvironment() (line 317-320), the method returns a Configuration.Environment type. This creates a coupling between TenantWorkloadConfig and the now-reduced Configuration class (which is meant to hold only CLI lifecycle params). The Environment enum should either be moved to TenantWorkloadConfig or to a shared location, rather than being accessed from Configuration.

Suggested change
public Configuration.Environment getEnvironment() {
if (environment == null) return Configuration.Environment.Daily;
return Configuration.Environment.valueOf(environment);
}
/**
* Benchmark execution environment (for example, Daily).
* <p>
* This enum is defined here to avoid coupling {@link TenantWorkloadConfig}
* to the CLI {@code Configuration} class.
* </p>
*/
public enum Environment {
Daily
}
public Environment getEnvironment() {
if (environment == null) {
return Environment.Daily;
}
return Environment.valueOf(environment);
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants