Rework span record benchmark and publish results#8031
Rework span record benchmark and publish results#8031jack-berg merged 3 commits intoopen-telemetry:mainfrom
Conversation
|
|
||
| /** | ||
| * The number of record operations per benchmark invocation. By using a constant across benchmarks | ||
| * of different signals, it's easier to compare benchmark results across signals. |
There was a problem hiding this comment.
If span, metric, and log benchmarks all record the same number of operations per benchmark invocation, we can see the relative cost of spans vs. logs. vs. metrics. Even though it will never be a perfect apples to apples comparison, its still useful to know the order of magnitude cost of the different signals.
Make sense?
| /** | ||
| * Notes on interpreting the data: | ||
| * This benchmark measures the performance of recording metrics and includes the following | ||
| * dimensions: |
There was a problem hiding this comment.
One of my initial concerns with public benchmarks was that they need to be contextualized.
To address this, I'd like to:
- Put some effort into making the javadoc for our public benchmarks up to date and useful
- Update the benchmark static webpage to link to the relevant javadoc for each benchmark
| * BatchSpanProcessor} paired with a noop {@link SpanExporter}. In order to avoid quickly outpacing | ||
| * the batch processor queue and dropping spans, the processor is configured with a queue size of | ||
| * {@link SpanRecordBenchmark#RECORDS_PER_INVOCATION} * {@link SpanRecordBenchmark#MAX_THREADS} and | ||
| * is flushed after each invocation. |
There was a problem hiding this comment.
This is a key aspect to a useful span record benchmark (and log record benchmark) IMO. We need to isolate from the export path, which is noisy due to the network dependency, while also being realistic. My definition of realistic is a batch span processor and a harness that makes sure that spans aren't just being dropped on the floor from a full queue.
| } | ||
| } | ||
|
|
||
| public enum SpanSize { |
There was a problem hiding this comment.
Check this out: if we have individual parameters for the num attributes, num events, num links, we end up with combinatorial explosion and a lot of noise. What we really want to characterize is the performance of different sizes of spans, where a size is a composite of a variety of dimensions.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8031 +/- ##
============================================
+ Coverage 90.16% 90.21% +0.04%
- Complexity 7484 7606 +122
============================================
Files 836 841 +5
Lines 22562 22923 +361
Branches 2237 2291 +54
============================================
+ Hits 20344 20680 +336
- Misses 1515 1526 +11
- Partials 703 717 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| span.setAttribute( | ||
| benchmarkState.attributeKeys.get(j), benchmarkState.attributeValues.get(j)); | ||
| } | ||
| for (int j = 0; j < benchmarkState.exceptions.size(); j++) { | ||
| span.recordException(benchmarkState.exceptions.get(j)); | ||
| } | ||
| for (int j = 0; j < benchmarkState.linkContexts.size(); j++) { | ||
| span.addLink(benchmarkState.linkContexts.get(j)); | ||
| } |
There was a problem hiding this comment.
might be interesting to also benchmark adding attributes and links to the span builder
There was a problem hiding this comment.
Do you expect perf differences between adding to the SpanBuilder vs. Span? I can imagine yes if spans are sampled out based on data recorded to SpanBuilder.
Do you have any thoughts for how I might include a dimension for that while avoiding an explosion of test cases / keeping things easy to understand. I like the current framing of a "span size" dimension with t-shirt size options (small, medium, large).
There was a problem hiding this comment.
Do you expect perf differences between adding to the
SpanBuildervs.Span?
i'm not sure, maybe run locally to see, and if they seem similar then no need to add it
There was a problem hiding this comment.
Switched the benchmark from recording attributes / links on Span to recording on SpanBuilder:
SpanBuilder spanBuilder = benchmarkState.tracer.spanBuilder("test span name");
for (int j = 0; j < benchmarkState.attributeKeys.size(); j++) {
spanBuilder.setAttribute(
benchmarkState.attributeKeys.get(j), benchmarkState.attributeValues.get(j));
}
for (int j = 0; j < benchmarkState.linkContexts.size(); j++) {
spanBuilder.addLink(benchmarkState.linkContexts.get(j));
}
Span span = spanBuilder.startSpan();
for (int j = 0; j < benchmarkState.exceptions.size(); j++) {
span.recordException(benchmarkState.exceptions.get(j));
}
span.end();
Change in benchmark:
| Benchmark | Span Size | Baseline (ops/s) | With Span Builder Record (ops/s) | Difference | Change % |
|---|---|---|---|---|---|
| threads1 | SMALL | 8,035,454 | 8,154,087 | +118,633 | +1.48% |
| threads1 | MEDIUM | 1,296,857 | 1,282,624 | -14,233 | -1.10% |
| threads1 | LARGE | 108,910 | 108,768 | -143 | -0.13% |
| threads4 | SMALL | 9,581,626 | 10,853,479 | +1,271,853 | +13.27% |
| threads4 | MEDIUM | 3,401,908 | 2,708,558 | -693,350 | -20.38% |
| threads4 | LARGE | 440,806 | 355,208 | -85,598 | -19.41% |
To me, the difference between recording on Span vs. SpanBuilder looks like interrun variance.
| @Measurement(iterations = 5, time = 1) | ||
| public void record_4Threads(ThreadState threadState) { | ||
| record(threadState); | ||
| @OperationsPerInvocation(RECORDS_PER_INVOCATION) |
There was a problem hiding this comment.
Adding this annotation is going to render the historic benchmark results useless. Options:
- Wipe the history after merging this
- Manually adjust the historic results to align with the new config
There was a problem hiding this comment.
good point, i'm good with any approach, including not adding the annotation
There was a problem hiding this comment.
I do like the annotation. The output figures seem insanely low on first glance until you read the fine print and understand that each operation is actually many.
Followup to #8000