Re-enable and extend chunked pack benchmarks by nirandaperera · Pull Request #781 · rapidsai/rapidsmpf

nirandaperera · 2026-01-12T21:54:58Z

Since NVIDIA/cccl#7006 is merged, we should be able to reenable chunked pack benchmarks that use pinned memory.

This PR also extends the pack benchmarks to copy the packed data to a destination HostBuffer (host/ pinned). This gives a more appropriate picture for spilling.

Latest results

https://docs.google.com/spreadsheets/d/1yXiB3aFZO8GUD4dAVnwh7o9zjzXKwaSvGYGZphf9jgQ/edit?usp=sharing

Previous results

Workstation RTX A6000 driver 580.105.08 CUDA 13.0

PDX H100 driver 535.216.03 CUDA 13.1 (using cuda-compat)

Looking at these results, if we consider the spilling scenario where we pack and copy to host/ pinned host memory, for a 1GB table,

A6000		H100
BM_Pack_device_copy_to_pinned_host	22,760.87	BM_Pack_device_copy_to_pinned_host	45,062.92
BM_ChunkedPack_device_copy_to_pinned_host	22,483.63	BM_ChunkedPack_device_copy_to_pinned_host	43,336.96
BM_ChunkedPack_device_copy_to_host	21,823.79	BM_ChunkedPack_pinned_copy_to_pinned_host	22,014.05
BM_Pack_device_copy_to_host	21,011.26	BM_ChunkedPack_device_copy_to_host	20,057.96
BM_ChunkedPack_pinned_copy_to_pinned_host	11,564.92	BM_ChunkedPack_pinned_copy_to_host	14,565.09
BM_ChunkedPack_pinned_copy_to_host	11,346.53	BM_Pack_device_copy_to_host	14,189.53
BM_Pack_pinned_copy_to_pinned_host	9,027.17	BM_Pack_pinned_copy_to_pinned_host	7,902.45
BM_Pack_pinned_copy_to_host	8,462.00	BM_Pack_pinned_copy_to_host	869.39

Signed-off-by: niranda perera <niranda.perera@gmail.com>

…nch-pack

nirandaperera · 2026-01-13T00:46:07Z

The latest results on my workstation. I need to verify the results in a H100 machine as well

madsbk · 2026-01-13T08:01:57Z

cpp/benchmarks/bench_pack.cpp

            );
            RAPIDSMPF_CUDA_TRY(cudaMemcpyAsync(
-                static_cast<std::uint8_t*>(destination.data()) + offset,
+                reinterpret_cast<std::uint8_t*>(destination.data()) + offset,


isn't destination.data() a std::byte* already?

nirandaperera · 2026-01-13T19:43:14Z

Following are the results from PDX H100 nodes (with cuda-compat 13.1)

Signed-off-by: niranda perera <niranda.perera@gmail.com>

…mpf into reenable-bench-pack

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera · 2026-01-14T22:42:44Z

I think I found the issue for the poor cudf::pack performance a pinned mr. It turns out that pinned memory pool is VERY slow to start off, so the first bench iteration takes ~1s. Since I have not set any min limits, it stops there and only reports the 1st iteration results 😢 When I set min_time=4s warm_up_time=1s this discrepancy (falling off a cliff) falls away.

Updated results

wence-

A few comments

wence- · 2026-01-14T09:58:00Z

cpp/benchmarks/bench_pack.cpp

+    // Warm up
+    auto warm_up = cudf::pack(table.view(), stream, pack_mr);
+
+    rapidsmpf::HostBuffer dest(warm_up.gpu_data->size(), stream, dest_mr);


You probably want to ensure the dest is paged in by also memcopying into it from the packed warm_up buffer.

wence- · 2026-01-14T09:59:03Z

cpp/benchmarks/bench_pack.cpp

+    rmm::mr::pool_memory_resource<rmm::mr::cuda_async_memory_resource> pool_mr{
+        cuda_mr, rmm::percent_of_free_device_memory(40)
+    };


Throughout, this is a mad memory resource, and not one we ever use.

Just use the async memory resource.

@wence- Hmm? Aren't we using pool as the default mr in bench shuffle?
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/benchmarks/bench_shuffle.cpp#L269

wence- · 2026-01-14T10:10:59Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
+


I think this is a bad model for the bounce buffer size. I don't think we want to scale it with the table size, but rather have a fixed size bounce buffer. That way, if we're using a putative fixed size pinned host resource each chunk neatly fits into a block from that host resource.

Okay, let me run with a fixed sized bounce buffer. The reason why I didnt go ahead with that previously was, fixed size buffer resource proposed in cucascade is 1MB. I felt like there will be too many calls to cudamemcpyasync. But I should have run a benchmark, rather than assuming things.

I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

I intentionally left this out TBH. The main reasoning was, In both H100 & NVL4, when packing a 1GB device table to device/ pinned host, bandwidth increases as we increase the size of the bounce buffer (this corresponds to the number of cudamemcpyasync calls to the output buffer offsets). So, if we use a fixed buffer size (say 64/128MB), larger tables would have more async cpy calls than the smaller ones. I felt like this was problematic.
That's why I added a separate set of benchmarks to check the effect of the bounce buffer size for a 1GB table.

Ok, since this is a benchmark only I think it's not a big deal. However, this approach makes it harder to reason about the results, therefore to me it feels like an unnecessary benchmark and we could remove it in favor of those where bounce buffer size is explicitly defined via the benchmark arguments. Selecting bounce buffer sizes based on the table size doesn't seem like a reasonable or useful approach, since you have in #781 (comment) determined that there's a size for which bandwidth flattens, we should therefore probably choose that instead of one based on the table's size. In practice this would also make little sense, because if you have a small table you'd still split it in 10 parts (if you follow exactly the same approach as here) which will deliver substantially lower bandwidth than the pre-determined good default.

Signed-off-by: niranda perera <niranda.perera@gmail.com>

…nch-pack Signed-off-by: niranda perera <niranda.perera@gmail.com>

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera · 2026-01-20T20:56:29Z

@wence- @madsbk I think I have added all the combinations (almost) now. Can you take another look?

nirandaperera · 2026-01-20T20:59:34Z

Latest results.

My revised conclusions are (based on destination reservation),

For variable sized destination buffers

device

Use pack directly. Its significantly faster that chunked_pack . Since the output reservation is provided, we can assume that there is at least O(table size) amount of device memory is available.
chunked_pack to destination buffer offsets reaches pack perf for larger tables. So maybe we can think about this for larger tables?

pinned host

pack using a pinned mr is faster for smaller tables (<100MB). But chunked_pack to destination pinned buffer offsets is faster and stable for larger tables (this maybe because pack destination buffer allocation time is included in the timings)
pack to device and copying to pinned buffer/ chunked_pack to device bounce buffer have comparable performance, but not greater.

host

pack to device and copying/ chunked_pack to device bounce buffer have very similar performance.
similarly, pack to pinned and copying/ chunked_pack to pinned bounce buffer have very similar performance, but slower than device bounce buffers
So, we can use chunked_pack always, and pick the bounce buffer based on availability.

The impact of the bounce buffer size in chunked_pack is shown here.

In H100, If the destination buffer is,

pinned host - perf increases and saturates around 21GB/s for >8MB bounce buffers for a 1GB table
device - perf increases and saturates around 600GB/s for ~512MB bounce buffers for a 1GB table (so, essentially this reached pack )
In both cases, smaller bounce buffers sizes yield subpar performance.

So, my take here is, we can not rely on chunked_pack to directly pack into small (1MB) perallocated fixed sized pinned pools.
We could,

Increase the pinned buffer sizes to ~4MB (this would be in-efficient for smaller buffers)
Pack to pinned/ device memory (if available), and then async batch-copy to smaller buffer.

cpp/benchmarks/bench_pack.cpp

pentschev · 2026-02-05T21:50:03Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
+


I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

pentschev · 2026-02-05T21:55:00Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);


We probably want a similar model for bounce buffers as above. Once the model is decided, maybe make it a function to compute the result or use a constant so the same is used everywhere.

pentschev · 2026-02-05T21:55:16Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);


cpp/benchmarks/bench_pack.cpp

pentschev · 2026-02-05T21:58:05Z

cpp/benchmarks/bench_pack.cpp

+    state.counters["bounce_buffer_mb"] =
+        static_cast<double>(bounce_buffer_size) / static_cast<double>(MB);


This function is named run_chunked_pack_without_bounce_buffer, but here's a bounce buffer size, why?

This is because --benchmark_format=csv outputs all fields. Without this, it only prints the fields that are common to all benchmarks. 😇

Ok, but why do you even have a bounce_buffer_size and you use it to compute total_size/etc. I would expect a bounce_buffer_size would not even exist in this benchmark.

pentschev · 2026-02-05T22:01:48Z

cpp/benchmarks/bench_pack.cpp

+ * @param b The benchmark to configure with arguments.
+ */
+void PackArguments(benchmark::internal::Benchmark* b) {
+    // Test different table sizes in MB (minimum 1MB as requested)


As requested by whom/where? I was going to ask if there's use for smaller sizes too to show cases where performance is bad and should be avoided (if that's the case).

Ah! AI slop TBH. "I" requested 1MB. LOL! Sorry about this.

nirandaperera · 2026-02-09T22:47:31Z

Based on the latest results in H100 and NVL4, following were my conclusions.

`chunked_pack`ing bounce buffer size

In both H100 & NVL4, when packing a 1GB device table to device/ pinned host, bandwidth increases as we increase the size of the bounce buffer (this corresponds to the number of cudamemcpyasync calls to the output buffer offsets).

In pinned host, the packing bw flattens after 8MB in H100 and 32MB in NVL4.

So, when using chunked_pack, we should try to minimize the number of async copies to the output buffer.

PS: this the main reason why I picked 10% of the table size as the bounce buffer for next set of experiments.

Best strategy to pack a table to a destination memory type

To device

pack is strictly better than chunked_pack. Also, when packing, we make a device reservation before calling the operation. So, chunked_pack is, irrelevant.
Therefore, use pack with device mr

To pinned host

pack, and chunked_pack to offset output buffer yield similar performance. chunked_pack with a device bounce buffer, is slightly worse.
Therefore, use pack with pinned mr (as a device mr)

To host

Based on the results (best first)
- pack to device and copy to host, chunked_pack with a device bounce buffer (1.5-1.7x faster than pinned)
- pack to pinned and copy to host, chunked_pack with a pinned bounce buffer
Since chunked_pack offers less device/ pinned memory consumption, we let's forget about packing.
So the strategy would be,
- First try to reserve device memory for the estimated size of the table (this reduces the number of subsequent memcpys to host buffer)
- If the actual device available (considering overbooking) > 1MB, allocate that and use for chunked_pack
- Else, try with pinned memory the same way.
- If we can't truly reserve at least 1MB pinned memory, throw.
- Since chunked_pack requires at least a 1MB bounce buffer, for small tables, use pack to device and copy.

…nch-pack

Removed minimum size comment for clarity.

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

reenable chunked pack benchmarks

3b842c1

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera requested a review from a team as a code owner January 12, 2026 21:54

nirandaperera added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 12, 2026

nirandaperera changed the title ~~Re-enable chunked pack benchmarks~~ Re-enable and extend chunked pack benchmarks Jan 13, 2026

nirandaperera added 2 commits January 12, 2026 16:38

extending bench

aed3a27

Signed-off-by: niranda perera <niranda.perera@gmail.com>

Merge branch 'main' of github.com:rapidsai/rapidsmpf into reenable-be…

6b9f82e

…nch-pack

madsbk reviewed Jan 13, 2026

View reviewed changes

Merge branch 'main' into reenable-bench-pack

ac9c112

nirandaperera added 6 commits January 13, 2026 14:20

more cases

dea025b

Signed-off-by: niranda perera <niranda.perera@gmail.com>

Merge branch 'reenable-bench-pack' of github.com:nirandaperera/rapids…

f16d260

…mpf into reenable-bench-pack

remvoing case

d485996

Signed-off-by: niranda perera <niranda.perera@gmail.com>

bypass pinned

fe1b080

Signed-off-by: niranda perera <niranda.perera@gmail.com>

chunked pack without bounce buffer

44db8f7

Signed-off-by: niranda perera <niranda.perera@gmail.com>

another bench

73b38a6

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera mentioned this pull request Jan 14, 2026

Enable priming the PinnedMemoryPool #786

Open

wence- reviewed Jan 15, 2026

View reviewed changes

nirandaperera added 3 commits January 15, 2026 10:31

addressing PR comments and fixed sized cases

47a0717

Signed-off-by: niranda perera <niranda.perera@gmail.com>

extending benchmarks

7126e01

Signed-off-by: niranda perera <niranda.perera@gmail.com>

Merge branch 'main' of github.com:rapidsai/rapidsmpf into reenable-be…

12eae84

…nch-pack Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera requested review from madsbk and wence- January 20, 2026 20:49

precommit

2fc4c36

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera added 2 commits January 21, 2026 11:57

Merge branch 'main' into reenable-bench-pack

c1e0671

Merge branch 'main' into reenable-bench-pack

71692b8

nirandaperera added 2 commits February 3, 2026 13:14

Merge branch 'main' into reenable-bench-pack

29d5896

Merge branch 'main' into reenable-bench-pack

57012d2

pentschev reviewed Feb 5, 2026

View reviewed changes

nirandaperera mentioned this pull request Feb 9, 2026

Add cudf table packing utilities with memory type support #843

Open

nirandaperera and others added 5 commits February 10, 2026 09:03

adding more benchmarks

7cb59ca

Merge branch 'main' of github.com:rapidsai/rapidsmpf into reenable-be…

7ae96e7

…nch-pack

Clarify table size testing in PackArguments

d2050f1

Removed minimum size comment for clarity.

Apply suggestion from @pentschev

44ec298

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

Merge branch 'main' into reenable-bench-pack

2117a34

		// Bounce buffer size: max(1MB, table_size / 10)
		auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);

		state.counters["bounce_buffer_mb"] =
		static_cast<double>(bounce_buffer_size) / static_cast<double>(MB);

Conversation

nirandaperera commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Latest results

Previous results

Uh oh!

nirandaperera commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirandaperera commented Jan 13, 2026

Uh oh!

nirandaperera commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirandaperera commented Jan 20, 2026

Uh oh!

nirandaperera commented Jan 20, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirandaperera commented Feb 9, 2026

chunked_packing bounce buffer size

Best strategy to pack a table to a destination memory type

To device

To pinned host

To host

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nirandaperera commented Jan 12, 2026 •

edited

Loading

nirandaperera commented Jan 13, 2026 •

edited

Loading

madsbk Jan 13, 2026 •

edited

Loading

nirandaperera commented Jan 14, 2026 •

edited

Loading

`chunked_pack`ing bounce buffer size