Re-enable and extend chunked pack benchmarks#781
Re-enable and extend chunked pack benchmarks#781nirandaperera wants to merge 23 commits intorapidsai:mainfrom
Conversation
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
cpp/benchmarks/bench_pack.cpp
Outdated
| ); | ||
| RAPIDSMPF_CUDA_TRY(cudaMemcpyAsync( | ||
| static_cast<std::uint8_t*>(destination.data()) + offset, | ||
| reinterpret_cast<std::uint8_t*>(destination.data()) + offset, |
There was a problem hiding this comment.
isn't destination.data() a std::byte* already?
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…mpf into reenable-bench-pack
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
I think I found the issue for the poor |
| // Warm up | ||
| auto warm_up = cudf::pack(table.view(), stream, pack_mr); | ||
|
|
||
| rapidsmpf::HostBuffer dest(warm_up.gpu_data->size(), stream, dest_mr); |
There was a problem hiding this comment.
You probably want to ensure the dest is paged in by also memcopying into it from the packed warm_up buffer.
cpp/benchmarks/bench_pack.cpp
Outdated
| rmm::mr::pool_memory_resource<rmm::mr::cuda_async_memory_resource> pool_mr{ | ||
| cuda_mr, rmm::percent_of_free_device_memory(40) | ||
| }; |
There was a problem hiding this comment.
Throughout, this is a mad memory resource, and not one we ever use.
Just use the async memory resource.
There was a problem hiding this comment.
@wence- Hmm? Aren't we using pool as the default mr in bench shuffle?
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/benchmarks/bench_shuffle.cpp#L269
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); | ||
|
|
There was a problem hiding this comment.
I think this is a bad model for the bounce buffer size. I don't think we want to scale it with the table size, but rather have a fixed size bounce buffer. That way, if we're using a putative fixed size pinned host resource each chunk neatly fits into a block from that host resource.
There was a problem hiding this comment.
Okay, let me run with a fixed sized bounce buffer. The reason why I didnt go ahead with that previously was, fixed size buffer resource proposed in cucascade is 1MB. I felt like there will be too many calls to cudamemcpyasync. But I should have run a benchmark, rather than assuming things.
There was a problem hiding this comment.
I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.
There was a problem hiding this comment.
I intentionally left this out TBH. The main reasoning was, In both H100 & NVL4, when packing a 1GB device table to device/ pinned host, bandwidth increases as we increase the size of the bounce buffer (this corresponds to the number of cudamemcpyasync calls to the output buffer offsets). So, if we use a fixed buffer size (say 64/128MB), larger tables would have more async cpy calls than the smaller ones. I felt like this was problematic.
That's why I added a separate set of benchmarks to check the effect of the bounce buffer size for a 1GB table.
There was a problem hiding this comment.
Ok, since this is a benchmark only I think it's not a big deal. However, this approach makes it harder to reason about the results, therefore to me it feels like an unnecessary benchmark and we could remove it in favor of those where bounce buffer size is explicitly defined via the benchmark arguments. Selecting bounce buffer sizes based on the table size doesn't seem like a reasonable or useful approach, since you have in #781 (comment) determined that there's a size for which bandwidth flattens, we should therefore probably choose that instead of one based on the table's size. In practice this would also make little sense, because if you have a small table you'd still split it in 10 parts (if you follow exactly the same approach as here) which will deliver substantially lower bandwidth than the pre-determined good default.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…nch-pack Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
Latest results. My revised conclusions are (based on destination reservation), For variable sized destination buffers
The impact of the bounce buffer size in chunked_pack is shown here. In H100, If the destination buffer is,
So, my take here is, we can not rely on chunked_pack to directly pack into small (1MB) perallocated fixed sized pinned pools.
|
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); | ||
|
|
There was a problem hiding this comment.
I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); |
There was a problem hiding this comment.
We probably want a similar model for bounce buffers as above. Once the model is decided, maybe make it a function to compute the result or use a constant so the same is used everywhere.
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); |
| state.counters["bounce_buffer_mb"] = | ||
| static_cast<double>(bounce_buffer_size) / static_cast<double>(MB); |
There was a problem hiding this comment.
This function is named run_chunked_pack_without_bounce_buffer, but here's a bounce buffer size, why?
There was a problem hiding this comment.
This is because --benchmark_format=csv outputs all fields. Without this, it only prints the fields that are common to all benchmarks. 😇
There was a problem hiding this comment.
Ok, but why do you even have a bounce_buffer_size and you use it to compute total_size/etc. I would expect a bounce_buffer_size would not even exist in this benchmark.
cpp/benchmarks/bench_pack.cpp
Outdated
| * @param b The benchmark to configure with arguments. | ||
| */ | ||
| void PackArguments(benchmark::internal::Benchmark* b) { | ||
| // Test different table sizes in MB (minimum 1MB as requested) |
There was a problem hiding this comment.
As requested by whom/where? I was going to ask if there's use for smaller sizes too to show cases where performance is bad and should be avoided (if that's the case).
There was a problem hiding this comment.
Ah! AI slop TBH. "I" requested 1MB. LOL! Sorry about this.
|
Based on the latest results in H100 and NVL4, following were my conclusions.
|
Removed minimum size comment for clarity.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>


Since NVIDIA/cccl#7006 is merged, we should be able to reenable chunked pack benchmarks that use pinned memory.
This PR also extends the pack benchmarks to copy the packed data to a destination HostBuffer (host/ pinned). This gives a more appropriate picture for spilling.
Latest results
https://docs.google.com/spreadsheets/d/1yXiB3aFZO8GUD4dAVnwh7o9zjzXKwaSvGYGZphf9jgQ/edit?usp=sharing
Previous results
Workstation RTX A6000 driver 580.105.08 CUDA 13.0

PDX H100 driver 535.216.03 CUDA 13.1 (using cuda-compat)

Looking at these results, if we consider the spilling scenario where we pack and copy to host/ pinned host memory, for a 1GB table,