Releases: google/fleetbench
v2.0.7
Handle missing performance counters in parallel_bench. PiperOrigin-RevId: 861168827 Change-Id: I1f1c4e4405994281af7ac7009a2d9f156cff80c4
Fleetbench v2.1
Following the major architectural overhaul in v2.0, v2.1 improves the benchmarks in data fidelity, better observability, and framework stabilization.
Key Changes
-
High-Fidelity Benchmark Data
We re-implemented field sample logic. This ensures the distribution of message types, enums, and nesting levels more accurately matches the statistical profile of our production traffic. We also improved message type generation to better simulate cache pressure. -
Framework Stability & Parallel Execution
The Multiprocessing Framework parallel controller now gracefully handles failed benchmark worker runs, preventing individual workload crashes from stalling the entire suite. It also now support passing additional flags directly to the underlying benchmark targets for more granular configuration of individual workloads. -
Observability & Result Processing
Cache size overrides are now included in the output context. This helps verifying system topology when running on emulators or non-standard hardware. -
Bug fixes and dependency updates
We fixed several issues in the Multiprocessing Framework and update third-party dependencies and CI environment configurations.
We hope these improvements help you get more accurate and reliable performance data. If you have any questions or feedback, please feel free to contact us. Happy benchmarking!
v2.0.6
Upgrade GitHub Actions for Node 24 compatibility See https://github.com/google/fleetbench/pull/32 PiperOrigin-RevId: 845386328 Change-Id: I62106e7c22ef08cd858dabde23ca2d720bfc9c3c
v2.0.5
Pass additional flags to the benchmark target. PiperOrigin-RevId: 834415354 Change-Id: Id5ddd4bfffaa08c72765a9d9badb35f0cdb30fe0
v2.0.4
Replace NaN in std columns with string "NaN" for JSON compatibility. PiperOrigin-RevId: 822745633 Change-Id: I4337e86104f4d3b1325a623146c1497aafd476f7
v2.0.3
Update GitHub workflow for releases. PiperOrigin-RevId: 810176172 Change-Id: I8d12b22776598cef5082bb9f0c4b8dcc3b11d18c
v2.0.2
Internal build changes. PiperOrigin-RevId: 789375628 Change-Id: I195c32da1afe9ad343fbf404be1014141045d719
v2.0.1
Update readme and version. PiperOrigin-RevId: 776609725 Change-Id: Idfed7de6be033c4e7ca09144ff598541c8977b3e
V2.0.0 Unlocking Deeper Performance Insights with Multi-Core Simulation and Enhanced Workloads
We are thrilled to announce the release of Fleetbench v2.0, a major milestone that significantly enhances our benchmarking suite's capability to accurately characterize system performance under realistic, concurrent workloads. This release introduces the powerful Multiprocessing Framework, alongside critical New Benchmarks (gRPC and SIMD), and substantial Improvements and Bug Fixes across the suite.
This version represents a substantial step forward in capturing system performance from diverse angles, enabling developers and performance engineers to gain granular insights into how important libraries behave in complex, multi-core environments.
π New Features & Capabilities
Broadened Hardware & Environment Supports
- Runnability on Emulation and Real Hardwares: Fleetbench is now rigorously tested and validated for consistent performance measurement across both emulated environments and physical hardware. This ensures that development and testing workflows utilizing platforms like QEMU can accurately predict real-world performance characteristics, enabling a more seamless transition from concept to development to deployment.
Multiprocessing Framework (/fleetbench/parallel/)
The new Fleetbench Multiprocessing framework is designed for precise CPU load simulation, moving beyond simplistic single-threaded measurements to analyze system behavior under controlled, concurrent loads.
-
Core Architecture: At its heart,
parallel_bench.pyorchestrates parallel benchmark execution. A central controller dynamically schedules Fleetbench binaries across a configurable pool of worker threads, distributed over multiple CPU cores. -
Adaptive Load Simulation: Load maintenance is achieved through an adaptive scheduling approach. The controller continuously monitors real-time CPU utilization and dynamically adjusts benchmark scheduling strategy to ensure sustained target CPU utilization.
-
Granular Control: We've introduced extensive customization options, including:
-
Workload Distribution Strategies: Users can define workload composition with strategies like
WORKLOAD_WEIGHTED(based on aggregate workload runtime) orDCTAX_WEIGHTED(user-defined proportional weights via weights.csv), allowing for fine-tuned synthetic load generation. -
Hyperthreading Control (x86_64): Advanced SMT state manipulation via
--hyperthreading_modeenables detailed analysis of core contention and cache behavior. -
Flexible Execution Parameters: Flags such as
--duration,--num_cpus, and--workload_filterprovide precise control over the benchmark environment.
-
-
Google Benchmark Integration: The framework seamlessly integrates with the underlying Google Benchmark library, supporting familiar flags like
--benchmark_repetitions,--benchmark_filter, and--benchmark_perf_countersfor detailed metric collection.
Usage
First build two targets, one for the Fleetbench binary and the other is the multiprocessing framework:
bazel build --config=clang --config=opt --config=haswell fleetbench:fleetbench
bazel build --config=clang --config=opt --config=haswell fleetbench/parallel:parallel_bench
Then run with command:
bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench
For more usages, please check the README.md or get the flag list via bazel-bin/fleetbench/parallel/parallel_bench --help.
New Benchmarks
We've expanded our suite with two crucial, real-world representative benchmarks:
SIMD Benchmark
-
Purpose: Accurately measures the performance of Single Instruction, Multiple Data (SIMD) operations.
-
Workload: Based on the SIMD-heavy computational patterns from ScaNN LUT16, reflecting operations common in database query processing, cryptography, and approximate nearest neighbor search.
-
Mechanism: It calculates distance scores by indexing into query-specific Look-Up Tables (LUTs) using database item codes and accumulating retrieved values. Leverages parallel data loading, table lookups, and accumulation to harness SIMD power. The benchmark focuses entirely on the performance of the SIMD-heavy lookup-and-accumulate loop.
-
Relevance: SIMD instructions are fundamental to high performance in modern computing, accounting for a large portion of CPU instructions in our fleet and growing rapidly.
gRPC Benchmark
-
Purpose: Provides a realistic assessment of kernel and scheduling performance for remote procedure calls (RPC).
-
Workload: Utilizes synthesized representative protos reflecting common request/response patterns derived from real-world fleet traffic, similar to our existing Proto Benchmark.
-
Mechanism: Built upon the open-source gRPC framework, this benchmark employs a streamlined, asynchronous callback client/server architecture operating on a local host to minimize network interference.
-
Relevance: This benchmark addresses the need for accurately evaluating Hyperscale SoC performance under realistic and complex traffic patterns and server loads.
β¨ Benchmark Updates & Enhancements
Overall Suite Improvements
-
Updated Fleet Data: All V1.0 benchmarks now use the more recent fleet data for continuous representativeness.
-
Explicit Iteration Counts: Benchmarks have explicit iteration counts, ensuring more consistent and reproducible results.
-
Enhanced Stability with Warmup Phases: A warmup phase has been added to benchmarks to reduce initial variance, leading to more consistent performance measurements.
-
Accurate L3 Cache Size Detection on AMD Platforms: Fleetbench now correctly aggregates L3 cache size across all CCXs per socket, providing more accurate
coldbenchmark constructions.
Dedicated Benchmark Refinements
Proto Benchmark
-
Improved Representativeness: Re-implemented logic for field sample messages (now weight-based), better cold message generation, improved enum fields, and smarter message type generation with reused types.
-
Data Synthesis: Better distinguishing between synthesized data for varint and fixed integers.
-
Memory Optimization: Optimized memory usage for improved emulator compatibility.
Swissmap Benchmarks
-
Improved Capacity Sizing: More accurate Swissmap's capacity sizing and including fleet size-capacity parameters.
-
New InsertMiss Benchmarks: Introduced
InsertMiss_HotandInsertMiss_Coldfor measuring insertion performance of non-present elements. -
Optimized Destructor Benchmarks: Adjusted batch sizes in
IntDestructorandStrDestructorbenchmarks to reduce overhead from helper functions for more accurate measurements. -
Improved Hash Function: Updated to use a low-cost hash function for better entropy with random 32-bit integer keys.
LIBC Benchmarks
-
Realistic Branching Behavior: Incorporated a more fleet representative branching pattern for realistic branch prediction.
-
Improved
memcmp&bcmpbenchmarks: Now using the same source and destination buffer for correctly accounts for buffer overlaps. -
memmoveand Compare Benchmarks Fix: Corrected buffer size calculation for non-overlapping destination addresses, preventing potential infinite loops. -
Integer Overflow Protection: Added checks for maximum supported L3 cache size to enhance robustness.
π Bug Fixes
We also fixed a series of bugs across the suite to improve stability, accuracy, and reliability.
π Get Started
We encourage everyone to try Fleetbench v2.0 for the performance analysis and let us know how you think!
π Special Thanks to Our Contributors!
This release is a testament to the power of collaborative development. We extend our deepest gratitude to everyone who contributed to Fleetbench! Your insightful feedback, diligent bug reports, and valuable code contributions have been instrumental in making this release a reality and significantly advancing the capabilities of our benchmarking suite. A big thank you to everyone! πππ
v1.0.15
Update swissmap benchmarks to use a low cost hash function that has eβ¦