Skip to content

Re-enable and extend chunked pack benchmarks#781

Open
nirandaperera wants to merge 23 commits intorapidsai:mainfrom
nirandaperera:reenable-bench-pack
Open

Re-enable and extend chunked pack benchmarks#781
nirandaperera wants to merge 23 commits intorapidsai:mainfrom
nirandaperera:reenable-bench-pack

Conversation

@nirandaperera
Copy link
Contributor

@nirandaperera nirandaperera commented Jan 12, 2026

Since NVIDIA/cccl#7006 is merged, we should be able to reenable chunked pack benchmarks that use pinned memory.

This PR also extends the pack benchmarks to copy the packed data to a destination HostBuffer (host/ pinned). This gives a more appropriate picture for spilling.

Latest results

https://docs.google.com/spreadsheets/d/1yXiB3aFZO8GUD4dAVnwh7o9zjzXKwaSvGYGZphf9jgQ/edit?usp=sharing

Previous results

Workstation RTX A6000 driver 580.105.08 CUDA 13.0
image

PDX H100 driver 535.216.03 CUDA 13.1 (using cuda-compat)
image

Looking at these results, if we consider the spilling scenario where we pack and copy to host/ pinned host memory, for a 1GB table,

A6000 H100
BM_Pack_device_copy_to_pinned_host 22,760.87 BM_Pack_device_copy_to_pinned_host 45,062.92
BM_ChunkedPack_device_copy_to_pinned_host 22,483.63 BM_ChunkedPack_device_copy_to_pinned_host 43,336.96
BM_ChunkedPack_device_copy_to_host 21,823.79 BM_ChunkedPack_pinned_copy_to_pinned_host 22,014.05
BM_Pack_device_copy_to_host 21,011.26 BM_ChunkedPack_device_copy_to_host 20,057.96
BM_ChunkedPack_pinned_copy_to_pinned_host 11,564.92 BM_ChunkedPack_pinned_copy_to_host 14,565.09
BM_ChunkedPack_pinned_copy_to_host 11,346.53 BM_Pack_device_copy_to_host 14,189.53
BM_Pack_pinned_copy_to_pinned_host 9,027.17 BM_Pack_pinned_copy_to_pinned_host 7,902.45
BM_Pack_pinned_copy_to_host 8,462.00 BM_Pack_pinned_copy_to_host 869.39

Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera nirandaperera requested a review from a team as a code owner January 12, 2026 21:54
@nirandaperera nirandaperera added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 12, 2026
@nirandaperera nirandaperera changed the title Re-enable chunked pack benchmarks Re-enable and extend chunked pack benchmarks Jan 13, 2026
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera
Copy link
Contributor Author

nirandaperera commented Jan 13, 2026

The latest results on my workstation. I need to verify the results in a H100 machine as well

image

);
RAPIDSMPF_CUDA_TRY(cudaMemcpyAsync(
static_cast<std::uint8_t*>(destination.data()) + offset,
reinterpret_cast<std::uint8_t*>(destination.data()) + offset,
Copy link
Member

@madsbk madsbk Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't destination.data() a std::byte* already?

@nirandaperera
Copy link
Contributor Author

Following are the results from PDX H100 nodes (with cuda-compat 13.1)

image

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera
Copy link
Contributor Author

nirandaperera commented Jan 14, 2026

I think I found the issue for the poor cudf::pack performance a pinned mr. It turns out that pinned memory pool is VERY slow to start off, so the first bench iteration takes ~1s. Since I have not set any min limits, it stops there and only reports the 1st iteration results 😢 When I set min_time=4s warm_up_time=1s this discrepancy (falling off a cliff) falls away.

Updated results

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

// Warm up
auto warm_up = cudf::pack(table.view(), stream, pack_mr);

rapidsmpf::HostBuffer dest(warm_up.gpu_data->size(), stream, dest_mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to ensure the dest is paged in by also memcopying into it from the packed warm_up buffer.

Comment on lines +191 to +193
rmm::mr::pool_memory_resource<rmm::mr::cuda_async_memory_resource> pool_mr{
cuda_mr, rmm::percent_of_free_device_memory(40)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout, this is a mad memory resource, and not one we ever use.

Just use the async memory resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- Hmm? Aren't we using pool as the default mr in bench shuffle?
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/benchmarks/bench_shuffle.cpp#L269

Comment on lines +366 to +368
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad model for the bounce buffer size. I don't think we want to scale it with the table size, but rather have a fixed size bounce buffer. That way, if we're using a putative fixed size pinned host resource each chunk neatly fits into a block from that host resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me run with a fixed sized bounce buffer. The reason why I didnt go ahead with that previously was, fixed size buffer resource proposed in cucascade is 1MB. I felt like there will be too many calls to cudamemcpyasync. But I should have run a benchmark, rather than assuming things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally left this out TBH. The main reasoning was, In both H100 & NVL4, when packing a 1GB device table to device/ pinned host, bandwidth increases as we increase the size of the bounce buffer (this corresponds to the number of cudamemcpyasync calls to the output buffer offsets). So, if we use a fixed buffer size (say 64/128MB), larger tables would have more async cpy calls than the smaller ones. I felt like this was problematic.
That's why I added a separate set of benchmarks to check the effect of the bounce buffer size for a 1GB table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since this is a benchmark only I think it's not a big deal. However, this approach makes it harder to reason about the results, therefore to me it feels like an unnecessary benchmark and we could remove it in favor of those where bounce buffer size is explicitly defined via the benchmark arguments. Selecting bounce buffer sizes based on the table size doesn't seem like a reasonable or useful approach, since you have in #781 (comment) determined that there's a size for which bandwidth flattens, we should therefore probably choose that instead of one based on the table's size. In practice this would also make little sense, because if you have a small table you'd still split it in 10 parts (if you follow exactly the same approach as here) which will deliver substantially lower bandwidth than the pre-determined good default.

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…nch-pack

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera
Copy link
Contributor Author

@wence- @madsbk I think I have added all the combinations (almost) now. Can you take another look?

@nirandaperera
Copy link
Contributor Author

Latest results.

My revised conclusions are (based on destination reservation),

For variable sized destination buffers

  1. device
  • Use pack directly. Its significantly faster that chunked_pack . Since the output reservation is provided, we can assume that there is at least O(table size) amount of device memory is available.
  • chunked_pack to destination buffer offsets reaches pack perf for larger tables. So maybe we can think about this for larger tables?
  1. pinned host
  • pack using a pinned mr is faster for smaller tables (<100MB). But chunked_pack to destination pinned buffer offsets is faster and stable for larger tables (this maybe because pack destination buffer allocation time is included in the timings)
  • pack to device and copying to pinned buffer/ chunked_pack to device bounce buffer have comparable performance, but not greater.
  1. host
  • pack to device and copying/ chunked_pack to device bounce buffer have very similar performance.
    similarly, pack to pinned and copying/ chunked_pack to pinned bounce buffer have very similar performance, but slower than device bounce buffers
  • So, we can use chunked_pack always, and pick the bounce buffer based on availability.

The impact of the bounce buffer size in chunked_pack is shown here.

In H100, If the destination buffer is,

  • pinned host - perf increases and saturates around 21GB/s for >8MB bounce buffers for a 1GB table
  • device - perf increases and saturates around 600GB/s for ~512MB bounce buffers for a 1GB table (so, essentially this reached pack )
    In both cases, smaller bounce buffers sizes yield subpar performance.

So, my take here is, we can not rely on chunked_pack to directly pack into small (1MB) perallocated fixed sized pinned pools.
We could,

  • Increase the pinned buffer sizes to ~4MB (this would be in-efficient for smaller buffers)
  • Pack to pinned/ device memory (if available), and then async batch-copy to smaller buffer.

Comment on lines +366 to +368
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

Comment on lines +380 to +381
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a similar model for bounce buffers as above. Once the model is decided, maybe make it a function to compute the result or use a constant so the same is used everywhere.

Comment on lines +407 to +408
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment on lines +492 to +493
state.counters["bounce_buffer_mb"] =
static_cast<double>(bounce_buffer_size) / static_cast<double>(MB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is named run_chunked_pack_without_bounce_buffer, but here's a bounce buffer size, why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because --benchmark_format=csv outputs all fields. Without this, it only prints the fields that are common to all benchmarks. 😇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but why do you even have a bounce_buffer_size and you use it to compute total_size/etc. I would expect a bounce_buffer_size would not even exist in this benchmark.

* @param b The benchmark to configure with arguments.
*/
void PackArguments(benchmark::internal::Benchmark* b) {
// Test different table sizes in MB (minimum 1MB as requested)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested by whom/where? I was going to ask if there's use for smaller sizes too to show cases where performance is bad and should be avoided (if that's the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! AI slop TBH. "I" requested 1MB. LOL! Sorry about this.

@nirandaperera
Copy link
Contributor Author

Based on the latest results in H100 and NVL4, following were my conclusions.

chunked_packing bounce buffer size

In both H100 & NVL4, when packing a 1GB device table to device/ pinned host, bandwidth increases as we increase the size of the bounce buffer (this corresponds to the number of cudamemcpyasync calls to the output buffer offsets).

  • In pinned host, the packing bw flattens after 8MB in H100 and 32MB in NVL4.

So, when using chunked_pack, we should try to minimize the number of async copies to the output buffer.

PS: this the main reason why I picked 10% of the table size as the bounce buffer for next set of experiments.

Best strategy to pack a table to a destination memory type

To device

  • pack is strictly better than chunked_pack. Also, when packing, we make a device reservation before calling the operation. So, chunked_pack is, irrelevant.
  • Therefore, use pack with device mr

To pinned host

  • pack, and chunked_pack to offset output buffer yield similar performance. chunked_pack with a device bounce buffer, is slightly worse.
  • Therefore, use pack with pinned mr (as a device mr)

To host

  • Based on the results (best first)
    • pack to device and copy to host, chunked_pack with a device bounce buffer (1.5-1.7x faster than pinned)
    • pack to pinned and copy to host, chunked_pack with a pinned bounce buffer
  • Since chunked_pack offers less device/ pinned memory consumption, we let's forget about packing.
  • So the strategy would be,
    • First try to reserve device memory for the estimated size of the table (this reduces the number of subsequent memcpys to host buffer)
    • If the actual device available (considering overbooking) > 1MB, allocate that and use for chunked_pack
    • Else, try with pinned memory the same way.
    • If we can't truly reserve at least 1MB pinned memory, throw.
    • Since chunked_pack requires at least a 1MB bounce buffer, for small tables, use pack to device and copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants