[Azure-cosmos-benchmark]Refactor azure-cosmos-benchmark: move all configs from CLI to workload config JSON#48250
[Azure-cosmos-benchmark]Refactor azure-cosmos-benchmark: move all configs from CLI to workload config JSON#48250xinlian12 wants to merge 5 commits intoAzure:mainfrom
Conversation
…d config JSON - Remove ~50 CLI parameters from Configuration.java; all workload/connection params now come from the workload config JSON file (TenantWorkloadConfig) - Configuration.java reduced from 1057 to 131 lines with only 6 CLI params: workloadConfig, cycles, settleTimeMs, gcBetweenCycles, suppressCleanup, help - Move metrics/reporting/result-upload configs into a nested 'metrics' section in the workload config JSON with 'resultUpload' and 'runMetadata' sub-objects - Remove all Graphite reporting support (use Application Insights instead) - Remove TenantWorkloadConfig.fromConfiguration() factory method - Make workload config file mandatory (-workloadConfig flag) - Rename -tenantsFile CLI flag to -workloadConfig - Update all consumer classes (29 files) to use TenantWorkloadConfig for workload params and BenchmarkConfig for global/reporting params Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Route sync, CTL, encryption, and LinkedIn benchmarks through the orchestrator instead of having separate code paths in Main.java. - Add Benchmark interface (run + shutdown) implemented by all benchmark types - Refactor SyncBenchmark, AsyncCtlWorkload, AsyncEncryptionBenchmark, and LICtlWorkload to accept injected MetricRegistry (like AsyncBenchmark) - Remove self-managed reporter, result uploader, and JVM stats from each benchmark — orchestrator handles all infrastructure concerns - Expand orchestrator factory to dispatch based on operationType + flags (isSync, isEncryptionEnabled) - Simplify Main.java from 5 code paths to a single orchestrator call Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The section provides default values for tenant-level configuration, so tenantDefaults better reflects its purpose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/BenchmarkConfig.java
Outdated
Show resolved
Hide resolved
- Split loadGlobalSystemPropertiesFromWorkloadConfig into 4 focused methods: loadJvmSystemProperties, loadMetricsConfig, loadResultUploadConfig, loadRunMetadata - Add clarifying comment in Main.java explaining Configuration -> BenchmarkConfig -> BenchmarkOrchestrator flow Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors the azure-cosmos-benchmark tool to eliminate ~60 CLI parameters and centralize all workload configuration in a single JSON file (-workloadConfig). The result is a unified BenchmarkOrchestrator entry point for all benchmark types (sync, async, CTL, encryption, LinkedIn), with a new Benchmark interface and per-benchmark MetricRegistry injection from the orchestrator.
Changes:
- Config system:
Configuration.javareduced from ~1057 to 131 lines; all workload params moved toTenantWorkloadConfigand loaded from JSON;BenchmarkConfignow only loads from JSON (no more CLI single-tenant path). - Benchmark unification: New
Benchmarkinterface (run()+shutdown()) implemented by all benchmark types;BenchmarkOrchestratornow dispatches sync, CTL, encryption, and LinkedIn benchmarks in addition to async ones. - Test updates: Tests migrated from building CLI strings +
JCommanderparsing to directTenantWorkloadConfigconstruction;TenantWorkloadConfigFromConfigurationTestdeleted.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
Configuration.java |
Reduced to 6 CLI lifecycle params only |
TenantWorkloadConfig.java |
All workload params added; fromConfiguration() factory removed; parseWorkloadConfig renamed from parseTenantsFile |
BenchmarkConfig.java |
Now always requires a JSON config file; split loading into focused methods |
BenchmarkOrchestrator.java |
Factory now dispatches all benchmark types via Benchmark interface; buildCosmosMicrometerRegistry inlined |
Benchmark.java |
New interface with run() and shutdown() |
Main.java |
Simplified to single code path through BenchmarkOrchestrator |
SyncBenchmark.java, SyncReadBenchmark.java, SyncWriteBenchmark.java |
Accept injected MetricRegistry; implement Benchmark |
AsyncBenchmark.java |
Implements Benchmark; run()/shutdown() made public |
AsyncEncryptionBenchmark.java |
Accept injected MetricRegistry; implements Benchmark |
AsyncCtlWorkload.java |
Accept injected MetricRegistry; implements Benchmark |
LICtlWorkload.java |
Accept injected MetricRegistry; implements Benchmark; setup() merged into run() |
ScheduledReporterFactory.java |
Accepts BenchmarkConfig instead of Configuration |
WorkflowTest.java |
Tests migrated to direct TenantWorkloadConfig construction |
ReadMyWritesConsistencyTest.java |
Test migrated to direct TenantWorkloadConfig construction |
TenantWorkloadConfigFromConfigurationTest.java |
Deleted (no longer needed) |
| LinkedIn classes | Configuration replaced with TenantWorkloadConfig |
Comments suppressed due to low confidence (3)
sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/encryption/AsyncEncryptionBenchmark.java:219
- In
AsyncEncryptionBenchmark.shutdown(), thesuppressCleanupflag is not respected. UnlikeAsyncBenchmark,SyncBenchmark, andAsyncCtlWorkloadwhich all checkworkloadConfig.isSuppressCleanup()before deleting database/container,AsyncEncryptionBenchmark.shutdown()unconditionally deletes them whendatabaseCreatedorcollectionCreatedis true. This means the encryption benchmark will always clean up even when cycles > 1 andsuppressCleanup=trueis set.
public void shutdown() {
if (this.databaseCreated) {
cosmosAsyncDatabase.delete().block();
logger.info("Deleted temporary database {} created for this test", this.workloadConfig.getDatabaseId());
} else if (this.collectionCreated) {
cosmosAsyncContainer.delete().block();
logger.info("Deleted temporary collection {} created for this test", this.workloadConfig.getContainerId());
}
cosmosClient.close();
}
sdk/cosmos/azure-cosmos-benchmark/src/main/java/com/azure/cosmos/benchmark/ScheduledReporterFactory.java:36
- The
ScheduledReporterFactory.create()method signature changed from acceptingConfigurationto acceptingBenchmarkConfig, but the method is no longer called anywhere (reporter creation is now inlined inBenchmarkOrchestrator.run()). TheScheduledReporterFactoryclass appears to be dead code after this refactor. Verify whether it is still used, and if not, consider removing it to avoid confusion.
public static ScheduledReporter create(final BenchmarkConfig benchConfig,
final MetricRegistry metricsRegistry) {
if (benchConfig.getReportingDirectory() != null) {
return CsvReporter.forRegistry(metricsRegistry)
.convertDurationsTo(TimeUnit.MILLISECONDS)
.convertRatesTo(TimeUnit.SECONDS)
.build(new File(benchConfig.getReportingDirectory()));
} else {
return ConsoleReporter.forRegistry(metricsRegistry)
.convertDurationsTo(TimeUnit.MILLISECONDS)
.convertRatesTo(TimeUnit.SECONDS)
.build();
}
sdk/cosmos/azure-cosmos-benchmark/src/test/java/com/azure/cosmos/benchmark/WorkflowTest.java:99
- The
readMyWritesCLIandwriteLatencyCLItests (lines 33–46 and 86–99) callMain.main(StringUtils.split(cmd))with the old CLI parameters (-serviceEndpoint,-masterKey,-databaseId,-collectionId,-operation, etc.). After this refactor,Main.main()no longer accepts those parameters — it requires-workloadConfigpointing to a JSON file. These tests will fail at runtime with aParameterExceptionfrom JCommander (unknown parameters) followed by aNullPointerExceptioninBenchmarkConfig.fromConfiguration()becauseworkloadConfigwill be null. These tests need to be either updated to write a temporary JSON config file and pass-workloadConfigto them, or removed if they are superseded by thereadMyWrites/writeLatencytest variants below them.
You can also share your feedback on Copilot code review. Take the survey.
| validateConfiguration(firstTenant); | ||
|
|
||
| new BenchmarkOrchestrator().run(benchConfig); | ||
| } catch (ParameterException e) { | ||
| System.err.println("INVALID Usage: " + e.getMessage()); | ||
| System.err.println("Try '-help' for more information."); | ||
| throw e; | ||
| } | ||
| } | ||
|
|
||
| private static void validateConfiguration(Configuration cfg) { | ||
| switch (cfg.getOperationType()) { | ||
| private static void validateConfiguration(TenantWorkloadConfig workloadCfg) { | ||
| switch (workloadCfg.getOperationType()) { | ||
| case WriteLatency: | ||
| case WriteThroughput: | ||
| break; | ||
| default: | ||
| if (!cfg.isContentResponseOnWriteEnabled()) { | ||
| if (!workloadCfg.isContentResponseOnWriteEnabled()) { | ||
| throw new IllegalArgumentException("contentResponseOnWriteEnabled parameter can only be set to false " + | ||
| "for write latency and write throughput operations"); | ||
| } | ||
| } | ||
|
|
||
| switch (cfg.getOperationType()) { | ||
| switch (workloadCfg.getOperationType()) { | ||
| case ReadLatency: | ||
| case ReadThroughput: | ||
| break; | ||
| default: | ||
| if (cfg.getSparsityWaitTime() != null) { | ||
| throw new IllegalArgumentException("sparsityWaitTime is not supported for " + cfg.getOperationType()); | ||
| if (workloadCfg.getSparsityWaitTime() != null) { | ||
| throw new IllegalArgumentException("sparsityWaitTime is not supported for " + workloadCfg.getOperationType()); | ||
| } | ||
| } |
There was a problem hiding this comment.
The validateConfiguration method in Main.java only validates firstTenant (the first element of the tenant workloads list), but the workload config JSON can define multiple tenants. Configuration constraints such as contentResponseOnWriteEnabled and sparsityWaitTime should be validated for all tenants, not just the first one. Invalid configurations in tenants 2..N will only be discovered at runtime when the benchmark starts.
| TenantWorkloadConfig firstTenant = benchConfig.getTenantWorkloads().get(0); | ||
|
|
||
| if (cfg.isSync()) { | ||
| syncBenchmark(cfg); | ||
| } else { | ||
| if (cfg.getOperationType().equals(CtlWorkload)) { | ||
| asyncCtlWorkload(cfg); | ||
| } else if (cfg.getOperationType().equals(LinkedInCtlWorkload)) { | ||
| linkedInCtlWorkload(cfg); | ||
| } else if (cfg.isEncryptionEnabled()) { | ||
| asyncEncryptionBenchmark(cfg); | ||
| } else { | ||
| asyncBenchmark(cfg); | ||
| } | ||
| } | ||
| validateConfiguration(firstTenant); |
There was a problem hiding this comment.
The parseWorkloadConfig method returns an empty list silently when the tenants array is absent or empty in the JSON. Then Main.java at line 31 calls benchConfig.getTenantWorkloads().get(0) which will throw an IndexOutOfBoundsException with no useful error message. The orchestrator does check if (config.getTenantWorkloads().isEmpty()) at line 61 of BenchmarkOrchestrator, but Main.java crashes before that check is reached. Either parseWorkloadConfig should throw an IllegalArgumentException when no tenants are defined, or Main.java should guard against the empty list before calling get(0).
| } | ||
|
|
||
| String roleInstance = System.getenv("APPLICATIONINSIGHTS_ROLE_INSTANCE"); | ||
| if (roleName != null) { |
There was a problem hiding this comment.
In BenchmarkOrchestrator.buildCosmosMicrometerRegistry(), at line 454, there is a copy-paste bug: the condition if (roleName != null) is used to add the cloud_RoleInstance tag, but it should be if (roleInstance != null). As written, if APPLICATIONINSIGHTS_ROLE_NAME env var is not set (roleName == null), the cloud_RoleInstance tag is never added even when roleInstance has a value. Conversely, if roleName is set but roleInstance is null, then Tag.of("cloud_RoleInstance", null) may cause a NullPointerException. This same bug existed in the old Configuration.java code that was inlined here.
| if (roleName != null) { | |
| if (roleInstance != null) { |
| public Configuration.Environment getEnvironment() { | ||
| if (environment == null) return Configuration.Environment.Daily; | ||
| return Configuration.Environment.valueOf(environment); | ||
| } |
There was a problem hiding this comment.
In TenantWorkloadConfig.getEnvironment() (line 317-320), the method returns a Configuration.Environment type. This creates a coupling between TenantWorkloadConfig and the now-reduced Configuration class (which is meant to hold only CLI lifecycle params). The Environment enum should either be moved to TenantWorkloadConfig or to a shared location, rather than being accessed from Configuration.
| public Configuration.Environment getEnvironment() { | |
| if (environment == null) return Configuration.Environment.Daily; | |
| return Configuration.Environment.valueOf(environment); | |
| } | |
| /** | |
| * Benchmark execution environment (for example, Daily). | |
| * <p> | |
| * This enum is defined here to avoid coupling {@link TenantWorkloadConfig} | |
| * to the CLI {@code Configuration} class. | |
| * </p> | |
| */ | |
| public enum Environment { | |
| Daily | |
| } | |
| public Environment getEnvironment() { | |
| if (environment == null) { | |
| return Environment.Daily; | |
| } | |
| return Environment.valueOf(environment); | |
| } |
Summary
Refactors the azure-cosmos-benchmark configuration system to eliminate CLI parameter sprawl, centralize all workload configuration in a single JSON file (
-workloadConfig), and unify all benchmark types under a singleBenchmarkOrchestrator.Before: ~60 CLI parameters in
Configuration.java, 5 separate code paths inMain.java.After: 6 CLI params remain, single orchestrator entry point for all benchmark types.
Architecture: Before vs After
Before
After
All benchmark types now implement a common
Benchmarkinterface (run()+shutdown()) and accept an injectedMetricRegistryfrom the orchestrator.Workload Config JSON Structure
{ "tenantDefaults": { "connectionMode": "DIRECT", "consistencyLevel": "SESSION", "concurrency": "50", "numberOfOperations": "100000", "operation": "ReadThroughput", "numberOfPreCreatedDocuments": "1000", "isPartitionLevelCircuitBreakerEnabled": "true", "isPerPartitionAutomaticFailoverRequired": "true", "minConnectionPoolSizePerEndpoint": "0" }, "metrics": { "enableJvmStats": true, "enableNettyHttpMetrics": false, "printingInterval": 10, "reportingDirectory": "/tmp/benchmark-output", "resultUpload": { "serviceEndpoint": "https://results-account.documents.azure.com:443/", "masterKey": "", "database": "benchresults", "container": "runs" }, "runMetadata": { "testVariationName": "baseline-direct-mode", "branchName": "main", "commitId": "abc123def" } }, "tenants": [ { "id": "tenant-0", "serviceEndpoint": "https://cosmos-account-0.documents.azure.com:443/", "masterKey": "", "databaseId": "benchdb", "containerId": "benchcol", "operation": "ReadThroughput", "concurrency": 100 } ] }How defaults work:
tenantDefaults(e.g., both tenants getconnectionMode: DIRECT)concurrency: 100)metricssection is global (not per-tenant)isPartitionLevelCircuitBreakerEnabled, etc.) live intenantDefaultsbut apply globallyAll Configurations — Before vs After
Account & Connection
serviceEndpoint-serviceEndpointtenants[]/tenantDefaultsmasterKey-masterKeytenants[]/tenantDefaultsdatabaseId-databaseIdtenants[]/tenantDefaultscontainerId-collectionIdtenants[]/tenantDefaultsconnectionMode-connectionModetenants[]/tenantDefaultsconsistencyLevel-consistencyLeveltenants[]/tenantDefaultsmaxConnectionPoolSize-maxConnectionPoolSizetenants[]/tenantDefaultspreferredRegionsList-preferredRegionsListtenants[]/tenantDefaultsapplicationName-applicationNametenants[]/tenantDefaultsmanageDatabase-manageDatabasetenants[]/tenantDefaultsWorkload Parameters
operation-operationtenants[]/tenantDefaultsconcurrency-concurrencytenants[]/tenantDefaultsnumberOfOperations-numberOfOperationstenants[]/tenantDefaultsuseSync-useSynctenants[]/tenantDefaultsencryptionEnabled-encryptionEnabledtenants[]/tenantDefaultsJVM System Properties
isPartitionLevelCircuitBreakerEnabledtenantDefaultsonlyisPerPartitionAutomaticFailoverRequiredtenantDefaultsonlyminConnectionPoolSizePerEndpointtenantDefaultsonlyMetrics & Reporting
enableJvmStats-enableJvmStatsmetricsprintingInterval-printingIntervalmetricsreportingDirectory-reportingDirectorymetricsmetrics.resultUploadmetrics.runMetadataRemaining CLI Parameters (unchanged)
-workloadConfig-cycles-settleTimeMs-gcBetweenCycles-suppressCleanup-helpKey Changes
BenchmarkOrchestratorrun()+shutdown()) implemented by all benchmark typesoperationType+ flagsMetricRegistry, removed self-managed reporters/JVM stats/result uploadersloadJvmSystemProperties,loadMetricsConfig,loadResultUploadConfig,loadRunMetadata)tenantDefaultsmergeglobalDefaults→tenantDefaults,-tenantsFile→-workloadConfigTenantWorkloadConfigFromConfigurationTest.javaVerification
mvn compile -pl sdk/cosmos/azure-cosmos-benchmark -DskipTestspasses cleanly