diff --git a/README.md b/README.md index 0e38ddb..b3eed3c 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,178 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* CARLOS LOPEZ GARCES + * [LinkedIn](https://www.linkedin.com/in/clopezgarces/) + * [Personal website](https://carlos-lopez-garces.github.io/) +* Tested on: Windows 11, 13th Gen Intel(R) Core(TM) i9-13900HX @ 2.20 GHz, RAM 32GB, NVIDIA GeForce RTX 4060, personal laptop. + +### Description -### (TODO: Your README) +This project implements several versions of the ***exclusive parallel prefix sum*** algorithm (aka parallel scan): -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +- `StreamCompaction::Naive::scan`, which first computes a per-block exclusive scan on input data, storing partial results in shared memory, then uses a second kernel to compute the exclusive scan of the block sums. Finally, it adds the block sums to the partial results from the previous step to produce the final scan result across all blocks. +- `StreamCompaction::Efficient::scan`, which performs a prefix sum computation using a balanced binary tree approach. It first runs an up-sweep phase to compute partial sums across the blocks of input data, followed by a down-sweep phase to propagate the exclusive scan results. For larger arrays, it combines the results from multiple blocks, performing a final addition of block-level results to complete the scan. + +- `StreamCompaction::Thrust::scan`, which simply calls `thrust::exclusive_scan` on the input. + +Using the work-efficient scan version, this project also implements ***stream compaction*** for efficiently removing 0s from an array of integers. Implementation is `StreamCompaction::Efficient::compact`. + +To test them for correctness, their outputs have been compared to CPU versions of these algorithms, which are more straighforward and thus more likely to be correct. Implementations are called from `StreamCompaction::CPU::scan`. If `simulateGPUScan` is true, the algorithm mimics the GPU parallel algorithm of StreamCompaction::Naive::scan to some extent; differences lie around how the GPU version deals with arbitrary length inputs across more than 1 block. + +### Extra Credit 2: GPU Scan Using Shared Memory + +`Naive::scan` performs the scan at the block level on shared memory (not in global memory). The shared memory chunk is twice the size of the block's size, so that we can alternate reads and writes using double-buffering (the first half of the shared memory chunk is one buffer and the second half is the other). `Efficient::scan` also performs the scan in shared memory (double-buffering is not needed in this case because a thread doesn't overwrite elements that other threads might need concurrently). + +I didn't deal with memory bank conflicts. + +### Improving the efficient scan + +When I originally wrote the efficient scan with 0-padding to deal with non-power-of-two input sizes, I was aware that there would be entire blocks of 0-padding elements. In my original implementation, my algorithm would perform all the steps of the up-sweep and down-sweep for these 0-padding elements; since these elements are all 0s, operating on them as if they were genuine input elements did not influence the result. Still, the blocks of 0-padding elements and threads assigned to them are scheduled and thus take up compute time and resources. In my final version, I deal with these threads more carefully so that they don't perform unnecessary global memory writes. + +To determine the gains in efficiency of the improved version, I ran Nsight Compute for a single call of the original version and a single call of the improved version for a large input size (2^16). In the report, `Duration` reports the `gpu__time_duration_measured_user` metric in microseconds and represents total time spent in this invocation of the kernel across the GPU. The improved version decreased duration signifcantly for the first `exclusiveScanKernel` invocation (although it increased it a little for the `addBlockIncrementsKernel` kernel). + +![](img/efficient_improvement.png) + +### Determining block size for naive scan + +To try to determine the best block size for the naive implementation of the scan, I ran Nsight Compute on a single invocation of the scan, for each block size in {64, 128, 256, 512, 1024} and input size in {128, 256, 500, 512, 1024, 2^16} (I included 500 to see if a non-power-of-two input size made any difference). From the reports, among all of the available metrics, I chose the `Duration` metric for the first invocation of `exclusiveScanKernel` (which is where the bulk of the work takes place; it's also the most complex kernel). `Duration` reports the `gpu__time_duration_measured_user` metric in microseconds and represents total time spent in this invocation of the kernel across the GPU. + +This table shows input sizes from left to right and block sizes from top to bottom. In blue is marked the case where the block size equals the input size (so that only one block is used). + +![](img/naive_block_size_choice_table.png) + +This chart shows multiple color bars per input size on the horizontal axis; each color represents a different block size. On the vertical axis, kernel duration is represented. + +![](img/naive_block_size_choice_chart.png) + +There is no clear winner, but we can choose one block size by elimination: a 1024 block size takes a noticeable longer time to process an input of equal size (1024); it is not a good sign when a kernel that can run for the entire input using only one block performs badly; a 512 block size takes significantly longer to process the largest input (2^16) than smaller block sizes; between block sizes 64 and 128, **I choose 128** because it yields slightly shorter duration than 64 across all input sizes (compare blue and red bars across input sizes). + +### Determining block size for efficient scan + +Following a similar procedure, I arrived at a block size of 256 for the work-efficient implementation. + +### Performance analysis for scan + +The following charts compare the execution times (in milliseconds) of all the scan implementations across a range of increasing power-of-2 input sizes. + +![](img/perf_bars.png) + +We observe that the GPU Efficient implementation consistently beats the execution time of the GPU Naive implementation; interestingly enough, though, for input size 1024, GPU Naive shows lower execution time. + +The CPU implementation shows shorter execution times for small input sizes but eventually takes off (at about input size 1024) and increases rapidly as input size grows. On the other hand, execution time of GPU implementations fluctuate somewhat, increasing and decreasing as the input size grows. + +One may speculate as to why the CPU version beats the GPU versions for small input size: the overhead of launching the different kernels combined with the many more instructions computed per iteration in the first scan kernel may offset the gains of parallelization. + +The same information is shown in the following charts in different format, where the lower one is a close-up of the upper one for small input sizes. In them, we can appreciate how the GPU Efficient implementation beats the GPU Naive one consistently, although the difference between them in execution time appears to remain constant (not growing, as one might hope). + +![](img/perf_lines.png) + +It's interesting to see how execution time changes when input size delta is small between a non-power-of-2 input and a power-of-2 input. Observe, for example, that there's a 0.2 ms difference between input size 32765 (NPOT) and 32768 (power of 2) for GPU Naive. + +![](img/perf_bars_npot.png) + +### Performance bottlenecks for scan + +Digging a little deeper, the **efficient scan implementation** doesn't seem to be memory bound. As reported by Nsight Compute for an input size of 32768, memory hardware units are not fully utilized by the first `exclusiveScanKernel` invocation, and memory bandwidth is only partially consumed: `Mem Busy` is at 19% and `Max Bandwith` is at 34%. The maximum throughput for memory instructions is at 34% too. This all indicates that this version of the scan is not memory bound. + +![](img/efficient_memory.png) + +The achieved warp occupancy is at 74%, though, so latency is not being hidden very well in this efficient scan implementation. + +![](img/efficient_occupancy.png) + +Furthermore, 43% of compute throughput is very poor: compute units could be utilized more to improve performances. Based on warp occupancy and compute throughput, I would say this **efficient scan implementation is compute-bound**. + +![](img/efficient_compute_throughput.png) + +Analysing the same set of metrics for the **efficient implementation**, Nsight Compute reports that it is less memory-bound than the efficient version: `Mem Busy` is only at 11% and `Max Bandwith` at 18%, which indicates that neither the memory units nor the communication bandwith within them are fully utilized. + +![](img/naive_memory.png) + +Warp occupancy is at 69%, which is lower than in the efficient implementation case (74%). I speculate that this is so because the `exclusiveScanKernel` in the naive case uses twice the amount of shared memory than in the efficient case (because it does double buffering and the efficient case doesn't). With higher memory requirements, fewer warps can be scheduled, which leads to lower occupancy. + +![](img/naive_occuppancy.png) + +Due to its low occupancy and apparent memory unit and bandwith subutilization, I would say that the **naive implementation is compute-bound**. Its lower occupancy compared to the efficient implementation might be part of the reason why the efficient implementation has a lower execution time. + +### Brief Thrust analysis + +The Thrust-based implementation launches 3 different kernels. The second one appears to prepare the data for the third kernel, which appears to the one that actually performs the scan, based on its duration. + +![](img/thrust_kernels.png) + +This version achieves an occupancy of 81%, suggesting that it carefully balances memory requirements of each block with the number an size of blocks that it divides the launch into. + +### Test program output + +All original tests pass when GPU scan and compact outputs are compared to reference outputs obtained using the CPU versions of the code. + +``` +**************** +** SCAN TESTS ** +**************** + [ 6 23 12 1 47 5 17 25 14 22 5 22 12 ... 41 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0021ms (std::chrono Measured) + [ 0 6 29 41 42 89 94 111 136 150 172 177 199 ... 5758 5799 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0022ms (std::chrono Measured) + [ 0 6 29 41 42 89 94 111 136 150 172 177 199 ... 5688 5694 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 1.2056ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.431104ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.643136ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.389312ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.09962ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.311904ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 0 1 1 3 1 1 2 0 3 0 2 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0007ms (std::chrono Measured) + [ 2 3 1 1 3 1 1 2 3 2 1 1 2 ... 2 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0007ms (std::chrono Measured) + [ 2 3 1 1 3 1 1 2 3 2 1 1 2 ... 3 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0018ms (std::chrono Measured) + [ 2 3 1 1 3 1 1 2 3 2 1 1 2 ... 2 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.714784ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.467104ms (CUDA Measured) + passed +``` + +### Extra tests + +I modified `main.cpp` to run the original set of tests for a range of input sizes (power of 2 and non-power-of-2): from `MIN_SIZE = 4` (so that NPOT>=1) to `MAX_SIZE = 1 << 16`. + +All tests pass in that range of sizes. + +``` +for (int SIZE = MIN_SIZE; SIZE <= MAX_SIZE; SIZE <<= 1) { ... scanTests(SIZE, NPOT, a, b, c, d); } + +for (int SIZE = MIN_SIZE; SIZE <= MAX_SIZE; SIZE <<= 1) { ... compactionTests(SIZE, NPOT, a, b, c, d); } +``` + +Unfortunately, for larger sizes, I get errors like `failed to mempcy odataDevice to odata: invalid configuration argument` from the naive implementation. I couldn't determine the cause. \ No newline at end of file diff --git a/img/efficient_compute_throughput.png b/img/efficient_compute_throughput.png new file mode 100644 index 0000000..3ee3e24 Binary files /dev/null and b/img/efficient_compute_throughput.png differ diff --git a/img/efficient_improvement.png b/img/efficient_improvement.png new file mode 100644 index 0000000..8930b79 Binary files /dev/null and b/img/efficient_improvement.png differ diff --git a/img/efficient_memory.png b/img/efficient_memory.png new file mode 100644 index 0000000..b369438 Binary files /dev/null and b/img/efficient_memory.png differ diff --git a/img/efficient_occupancy.png b/img/efficient_occupancy.png new file mode 100644 index 0000000..52a9469 Binary files /dev/null and b/img/efficient_occupancy.png differ diff --git a/img/naive_block_size_choice_chart.png b/img/naive_block_size_choice_chart.png new file mode 100644 index 0000000..d0a1db9 Binary files /dev/null and b/img/naive_block_size_choice_chart.png differ diff --git a/img/naive_block_size_choice_table.png b/img/naive_block_size_choice_table.png new file mode 100644 index 0000000..edc22ed Binary files /dev/null and b/img/naive_block_size_choice_table.png differ diff --git a/img/naive_compute_throughput.png b/img/naive_compute_throughput.png new file mode 100644 index 0000000..e9d8b45 Binary files /dev/null and b/img/naive_compute_throughput.png differ diff --git a/img/naive_memory.png b/img/naive_memory.png new file mode 100644 index 0000000..4720123 Binary files /dev/null and b/img/naive_memory.png differ diff --git a/img/naive_occuppancy.png b/img/naive_occuppancy.png new file mode 100644 index 0000000..b0e8780 Binary files /dev/null and b/img/naive_occuppancy.png differ diff --git a/img/perf_bars.png b/img/perf_bars.png new file mode 100644 index 0000000..c93de3c Binary files /dev/null and b/img/perf_bars.png differ diff --git a/img/perf_bars_npot.png b/img/perf_bars_npot.png new file mode 100644 index 0000000..0323886 Binary files /dev/null and b/img/perf_bars_npot.png differ diff --git a/img/perf_lines.png b/img/perf_lines.png new file mode 100644 index 0000000..50c0426 Binary files /dev/null and b/img/perf_lines.png differ diff --git a/img/perf_lines_inset.png b/img/perf_lines_inset.png new file mode 100644 index 0000000..87cd91d Binary files /dev/null and b/img/perf_lines_inset.png differ diff --git a/img/thrust_kernels.png b/img/thrust_kernels.png new file mode 100644 index 0000000..c1de823 Binary files /dev/null and b/img/thrust_kernels.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..99f456c 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,21 +13,14 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array -const int NPOT = SIZE - 3; // Non-Power-Of-Two -int *a = new int[SIZE]; -int *b = new int[SIZE]; -int *c = new int[SIZE]; +// Failures occur for larger sizes. +#define MAX_SIZE 1 << 16 +// Must be 4 so that NPOT >= 1. +#define MIN_SIZE 4 +#define MAX_ELEMENT_VALUE 50 -int main(int argc, char* argv[]) { - // Scan tests - - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case +void scanTests(const int SIZE, const int NPOT, int *a, int *b, int *c, int *d) { + genArray(SIZE - 1, a, MAX_ELEMENT_VALUE); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; printArray(SIZE, a, true); @@ -35,74 +28,74 @@ int main(int argc, char* argv[]) { // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. // At first all cases passed because b && c are all zeroes. zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); + printDesc("cpu scan, power-of-two, serial"); + StreamCompaction::CPU::scan(SIZE, b, a, /*simulateGPUScan=*/false); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printArray(SIZE, b, true); zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); + printDesc("cpu scan, non-power-of-two, serial"); + StreamCompaction::CPU::scan(NPOT, c, a, /*simulateGPUScan=*/false); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); + if (!printCmpResult(NPOT, b, c)) printArray(NPOT, c, true); + + zeroArray(SIZE, d); + printDesc("cpu scan, power-of-two, simulated GPU scan"); + StreamCompaction::CPU::scan(SIZE, d, a, /*simulateGPUScan=*/true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + if (!printCmpResult(SIZE, b, d)) printArray(SIZE, d, true); + + zeroArray(SIZE, d); + printDesc("cpu scan, non-power-of-two, simulated GPU scan"); + StreamCompaction::CPU::scan(NPOT, d, a, /*simulateGPUScan=*/true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + if (!printCmpResult(NPOT, c, d)) printArray(NPOT, d, true); zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); + if (!printCmpResult(SIZE, b, c)) printArray(SIZE, c, true); - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ + // For bug-finding only: Array of 1s to help find bugs in stream compaction or scan + // onesArray(SIZE, c); + // printDesc("1s array for finding bugs"); + // StreamCompaction::Naive::scan(SIZE, c, a); + // printArray(SIZE, c, true); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); + if (!printCmpResult(NPOT, b, c)) printArray(SIZE, c, true); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); + if (!printCmpResult(SIZE, b, c)) printArray(SIZE, c, true); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); + if (!printCmpResult(NPOT, b, c)) printArray(NPOT, c, true); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); + if (!printCmpResult(SIZE, b, c)) printArray(SIZE, c, true); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests + if (!printCmpResult(NPOT, b, c)) printArray(NPOT, c, true); +} - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case +void compactionTests(const int SIZE, const int NPOT, int *a, int *b, int *c, int *d) { + genArray(SIZE - 1, a, MAX_ELEMENT_VALUE); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; printArray(SIZE, a, true); @@ -115,40 +108,183 @@ int main(int argc, char* argv[]) { count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); + if (!printCmpLenResult(count, expectedCount, b, b)) printArray(count, b, true); zeroArray(SIZE, c); printDesc("cpu compact without scan, non-power-of-two"); count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); + if (!printCmpLenResult(count, expectedNPOT, b, c)) printArray(count, c, true); zeroArray(SIZE, c); printDesc("cpu compact with scan"); count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); + if (!printCmpLenResult(count, expectedCount, b, c)) printArray(count, c, true); zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); + if (!printCmpLenResult(count, expectedCount, b, c)) printArray(count, c, true); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); + if (!printCmpLenResult(count, expectedNPOT, b, c)) printArray(count, c, true); +} + +typedef enum { + NAIVE, + EFFICIENT, + THRUST +} ScanType; +void profileScan(const int SIZE, ScanType scanType, bool testCorrectness = false) { + int *input = new int[SIZE]; + int *output = new int[SIZE]; + int *reference; + genArray(SIZE - 1, input, MAX_ELEMENT_VALUE); + input[SIZE - 1] = 0; + + if (testCorrectness) { + reference = new int[SIZE]; + StreamCompaction::CPU::scan(SIZE, reference, input, /*simulateGPUScan=*/false); + } + + switch (scanType) { + case NAIVE: + StreamCompaction::Naive::scan(SIZE, output, input); + break; + case EFFICIENT: + StreamCompaction::Efficient::scan(SIZE, output, input); + break; + case THRUST: + StreamCompaction::Thrust::scan(SIZE, output, input); + break; + } + + if (testCorrectness) { + printCmpResult(SIZE, reference, output); + free(reference); + } + + free(output); + free(input); +} + +void scanPerformance(const int SIZE) { + int *input = new int[SIZE]; + int *output = new int[SIZE]; + int *reference = new int[SIZE]; + genArray(SIZE - 1, input, MAX_ELEMENT_VALUE); + input[SIZE - 1] = 0; + + zeroArray(SIZE, output); + StreamCompaction::CPU::scan(SIZE, reference, input, /*simulateGPUScan=*/false); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "CPU::scan"); + + zeroArray(SIZE, output); + StreamCompaction::CPU::scan(SIZE, reference, input, /*simulateGPUScan=*/true); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "CPU::scan, simulated GPU scan"); + + zeroArray(SIZE, output); + StreamCompaction::Naive::scan(SIZE, output, input); + printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "Naive::scan"); + + zeroArray(SIZE, output); + StreamCompaction::Efficient::scan(SIZE, output, input); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "Efficient::scan"); + + zeroArray(SIZE, output); + StreamCompaction::Thrust::scan(SIZE, output, input); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "Thrust::scan"); + + free(reference); + free(output); + free(input); +} + +int main(int argc, char* argv[]) { + printf("\n"); + printf("****************\n"); + printf("** SCAN TESTS **\n"); + printf("****************\n"); + + for (int SIZE = MIN_SIZE; SIZE <= MAX_SIZE; SIZE <<= 1) { + // Non-power-of-two. + const int NPOT = SIZE - 3; + + printf("\n"); + printf("***********************************************************************\n"); + printf("**SCAN, SIZE = 2^%d = %d NPOT = %d **\n", ilog2(SIZE), SIZE, NPOT); + printf("***********************************************************************\n"); + + int *a = new int[SIZE]; + int *b = new int[SIZE]; + int *c = new int[SIZE]; + int *d = new int[SIZE]; + + scanTests(SIZE, NPOT, a, b, c, d); + + delete[] d; + delete[] c; + delete[] b; + delete[] a; + } + + printf("\n"); + printf("*****************************\n"); + printf("** STREAM COMPACTION TESTS **\n"); + printf("*****************************\n"); + + for (int SIZE = MIN_SIZE; SIZE <= MAX_SIZE; SIZE <<= 1) { + // Non-power-of-two. + const int NPOT = SIZE - 3; + + printf("\n"); + printf("***********************************************************************\n"); + printf("**COMPACT, SIZE = 2^%d = %d NPOT = %d **\n", ilog2(SIZE), SIZE, NPOT); + printf("***********************************************************************\n"); + + int *a = new int[SIZE]; + int *b = new int[SIZE]; + int *c = new int[SIZE]; + int *d = new int[SIZE]; + + compactionTests(SIZE, NPOT, a, b, c, d); + + delete[] d; + delete[] c; + delete[] b; + delete[] a; + } + + printf("\n"); + printf("****************\n"); + printf("** PERF TESTS **\n"); + printf("****************\n"); + + for (int SIZE = MIN_SIZE; SIZE <= MAX_SIZE; SIZE <<= 1) { + const int NPOT = SIZE - 3; + + printf("\n"); + printf("***********************************************************************\n"); + printf("**SIZE = 2^%d = %d**\n", ilog2(SIZE), SIZE); + printf("***********************************************************************\n"); + scanPerformance(SIZE); + + printf("\n"); + printf("***********************************************************************\n"); + printf("**NPOT = 2^%d-3 = %d **\n", ilog2(SIZE), NPOT); + printf("***********************************************************************\n"); + scanPerformance(NPOT); + } + + int SIZE = 1 << 16; + profileScan(SIZE, EFFICIENT, /*testCorrectness=*/true); - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; + // Stop Win32 console from closing on exit. + system("pause"); } diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index 025e94a..34e919c 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -22,19 +22,22 @@ void printDesc(const char *desc) { } template -void printCmpResult(int n, T *a, T *b) { - printf(" %s \n", - cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); +bool printCmpResult(int n, T *a, T *b) { + int passed = 0 == cmpArrays(n, a, b); + printf(" %s \n", !passed ? "FAIL VALUE" : "passed"); + return passed; } template -void printCmpLenResult(int n, int expN, T *a, T *b) { +bool printCmpLenResult(int n, int expN, T *a, T *b) { + bool passed = n == expN && 0 == cmpArrays(n, a, b); if (n != expN) { printf(" expected %d elements, got %d\n", expN, n); } printf(" %s \n", (n == -1 || n != expN) ? "FAIL COUNT" : cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); + return passed; } void zeroArray(int n, int *a) { diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..6dbf252 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,16 +23,27 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + return; + } + bools[globalThreadIdx] = idata[globalThreadIdx] != 0 ? 1 : 0; } /** * Performs scatter on an array. That is, for each element in idata, * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. */ - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { - // TODO + __global__ void kernScatter( + int n, int *odata, const int *idata, const int *bools, const int *indices + ) { + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + return; + } + if (bools[globalThreadIdx] == 1) { + odata[indices[globalThreadIdx]] = idata[globalThreadIdx]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..ea5afb4 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -2,6 +2,7 @@ #include "cpu.h" #include "common.h" +#include namespace StreamCompaction { namespace CPU { @@ -13,16 +14,91 @@ namespace StreamCompaction { } /** - * CPU scan (prefix sum). + * CPU exlusive scan (prefix sum). * For performance analysis, this is supposed to be a simple for loop. - * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. + * (Optional) For better understanding before starting moving to GPU, you can + * simulate your GPU scan in this function first. + * + * If simulateGPUScan is true, the algorithm mimics the GPU parallel algorithm + * of StreamCompaction::Naive::scan to some extent; differences lie around how + * the GPU version deals with arbitrary length inputs across more than 1 block. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int *odata, const int *idata, bool simulateGPUScan) { timer().startCpuTimer(); - // TODO + + if (simulateGPUScan) { + scanExclusiveSimulateGPU(n, odata, idata); + } else { + scanExclusiveSerial(n, odata, idata); + } + timer().endCpuTimer(); } + inline void scanExclusiveSerial(int n, int *odata, const int *idata) { + // Put addition identity in first element. + odata[0] = 0; + // Serial version. + for (int i = 1; i < n; ++i) { + odata[i] = odata[i-1] + idata[i-1]; + } + } + + // CPU version of parallel algorithm. Mimics the GPU parallel algorithm in + // StreamCompaction::Naive::scan to some extent. The intention is not to be + // efficient, but to help understand the parallel algorithm. + inline void scanExclusiveSimulateGPU(int n, int *odata, const int *idata) { + // For each depth d, iterInput is read from and iterOutput is written to + // and then swapped. + std::vector auxBuffer(n); + int *iterInput = auxBuffer.data(); + int *iterOutput = odata; + + // Put addition identity in first element. + iterInput[0] = 0; + + // Copy input data to iterInput, shifting by 1. This effectively turns + // an inclusive scan into an exclusive scan. + for (int k = 1; k < n; ++k) { + iterInput[k] = idata[k-1]; + } + + for (int d = 1; d <= ilog2ceil(n); ++d) { + // Elements to be added are this much apart. + int delta = 1 << (d-1); + + // At the beginning of each new iteration: + // - partial sums [0, 2^(d-1) - 1] are complete; + // - the rest are of the form x[k - 2^d - 1] + ... + x[k]. + for (int k = 0; k < n; ++k) { + // Each new iteration should update k in [2^d-1, ...] only. + if (k > delta) { + // Note that if k = delta, then iterInput[k - delta] = 0, so + // that's handled by the other case. + iterOutput[k] = iterInput[k - delta] + iterInput[k]; + } else { + iterOutput[k] = iterInput[k]; + } + + // y[k] is now: + // = x[k] + x[k - 1] + x[k - 2] + ... + x[k - 4] + .... + x[k - 2^(d-1)] + // = x[max(0, k - 2^(d) + 1), k - 2^(d-1)] + x[k - 2^(d-1) + 1, k] + // for d >= 1. + } + + // Processing [2^d, n) completely before moving on to next d is equivalent + // to waiting on a barrier for all threads to reach it, in the parallel case. + + std::swap(iterInput, iterOutput); + } + + // Depending on the number of iterations, the final result (iterInput) may already + // be in odata. + if (iterInput != odata) { + memcpy(odata, iterInput, n * sizeof(int)); + } + } + /** * CPU stream compaction without using the scan function. * @@ -30,9 +106,18 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + int nonzeroCount = 0; + + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) { + odata[nonzeroCount++] = idata[i]; + } + } + timer().endCpuTimer(); - return -1; + + return nonzeroCount; } /** @@ -42,9 +127,32 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + // Identify non-zero elements by marking them with 1 in the mask. + int *nonzeroMask = new int[n]; + for (int i = 0; i < n; ++i) { + nonzeroMask[i] = (idata[i] != 0) ? 1 : 0; + } + + // Exclusive scan the mask. + int *nonzeroMaskPrefixSum = new int[n]; + scanExclusiveSerial(n, nonzeroMaskPrefixSum, nonzeroMask); + + // Scatter the non-zero elements. + int nonzeroCount = 0; + for (int i = 0; i < n; ++i) { + if (nonzeroMask[i]) { + odata[nonzeroMaskPrefixSum[i]] = idata[i]; + nonzeroCount++; + } + } + + free(nonzeroMaskPrefixSum); + free(nonzeroMask); + timer().endCpuTimer(); - return -1; + + return nonzeroCount; } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 873c047..234f40a 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -6,7 +6,16 @@ namespace StreamCompaction { namespace CPU { StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + // Specify simulateGPUScan to choose between scanExclusiveSerial and + // scanExclusiveSimulateGPU. + void scan(int n, int *odata, const int *idata, bool simulateGPUScan = true); + + // Completely serial version of exclusive scan in CPU. + void scanExclusiveSerial(int n, int *odata, const int *idata); + + // CPU version of parallel algorithm. Mimics the GPU parallel algorithm in + // StreamCompaction::Naive::scan to some extent. + void scanExclusiveSimulateGPU(int n, int *odata, const int *idata); int compactWithoutScan(int n, int *odata, const int *idata); diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..d80e466 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,8 @@ #include "common.h" #include "efficient.h" +#define BLOCK_SIZE 256 + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,12 +14,178 @@ namespace StreamCompaction { return timer; } + __device__ inline int ilog2(int x) { + int lg = 0; + while (x >>= 1) { + ++lg; + } + return lg; + } + + __device__ inline int ilog2ceil(int x) { + return x == 1 ? 0 : ilog2(x - 1) + 1; + } + + __global__ void exclusiveScanKernel(int n, int *odata, int *idata, int *blockSumDevice = nullptr) { + // Block-level scan is done in shared memory. + extern __shared__ int blockSharedMem[]; + + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + // Threads that don't correspond to actual elements in the input array pad the + // buffer with 0s so that the algorithm can proceed as if the input size were + // a power of 2. + blockSharedMem[threadIdx.x] = 0; + } else { + blockSharedMem[threadIdx.x] = idata[globalThreadIdx]; + } + + __syncthreads(); + + // Up-sweep begins. + + for (int d = 0; d < ilog2ceil(blockDim.x); ++d) { + int delta = 1 << (d + 1); + // The condition detects parent nodes in the binary tree of the scan, which + // are the only ones that need to be updated. The second condition ensures that + // the left child is within the block's bounds. + if ((threadIdx.x + 1) % delta == 0) { + int leftChildValue = blockSharedMem[threadIdx.x - (delta / 2)]; + int rightChildValue = blockSharedMem[threadIdx.x]; + blockSharedMem[threadIdx.x] = leftChildValue + rightChildValue; + } + + __syncthreads(); + } + + // Down-sweep begins. + + if (threadIdx.x == blockDim.x - 1) { + blockSharedMem[threadIdx.x] = 0; + } + + __syncthreads(); + + for (int d = ilog2ceil(blockDim.x) - 1; d >= 0; --d) { + int delta = 1 << (d + 1); + // The condition detects right child nodes in the binary tree of the scan. + if ((threadIdx.x + 1) % delta == 0) { + int leftChildIdx = threadIdx.x - (delta / 2); + int parentIdx = threadIdx.x; + int rightChildIdx = parentIdx; + int oldLeftChildValue = blockSharedMem[leftChildIdx]; + blockSharedMem[leftChildIdx] = blockSharedMem[parentIdx]; + blockSharedMem[rightChildIdx] += oldLeftChildValue; + } + __syncthreads(); + } + + if (globalThreadIdx < n) { + odata[globalThreadIdx] = blockSharedMem[threadIdx.x]; + } + + // blockSumDevice will store the final prefix sums computed by all blocks to + // later combine all blocks' results. + if (blockSumDevice != nullptr && threadIdx.x == blockDim.x - 1) { + // An exclusive scan doesn't include the last element, so we need to add it. + blockSumDevice[blockIdx.x] = blockSharedMem[threadIdx.x] + idata[globalThreadIdx]; + } + } + + // Adds the per-block final prefix sums stored in blockSums to the elements of the + // corresponding blocks in odata. + __global__ void addBlockIncrementsKernel(int n, int *odata, int *blockSums) { + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + return; + } + + odata[globalThreadIdx] += blockSums[blockIdx.x]; + } + + void scanNoTimer(int n, int *odata, const int *idata) { + // Round up the input size to the next power of 2 to handle non-power-of-2 inputs + // and inputs smaller than the block size. Arrays will be padded with 0s to fill + // the extra space. Since the algorithm arranges its computations in a balanced + // binary tree, it needs enough input elements; care must be taken not to needlessly + // process 0-padding elements. + // + // Padding with 0s is done in the kernel. + int nNextPow2 = 1 << (::ilog2ceil(n)); + + // Kernel configuration for the block-level scan and block sum computation. + const int blockSize = BLOCK_SIZE; + dim3 blockCount = (nNextPow2 + blockSize - 1) / blockSize; + // No double buffering, so just one block-sized shared memory chunk is needed. + const int blockSharedMemSize = blockSize * sizeof(int); + + // Kernel configuration for the block sum scan (all blocks' final prefix sums). + const int blockSumBlockSize = blockCount.x; + dim3 blockSumBlockCount = (blockCount.x + blockSumBlockSize - 1) / blockSumBlockSize; + blockSumBlockCount = 1; + // No double buffering, so just one block-sized shared memory chunk is needed. + const int blockSumSharedMemSize = blockSumBlockSize * sizeof(int); + + int *idataDevice = nullptr; + cudaMalloc(&idataDevice, n * sizeof(int)); + checkCUDAError("failed to malloc idataDevice"); + + int *odataDevice = nullptr; + cudaMalloc(&odataDevice, n * sizeof(int)); + checkCUDAError("failed to malloc odataDevice"); + + int *iblockSumDevice = nullptr; + cudaMalloc(&iblockSumDevice, blockCount.x * sizeof(int)); + checkCUDAError("failed to malloc iblockSumDevice"); + + int *oblockSumDevice = nullptr; + cudaMalloc(&oblockSumDevice, blockCount.x * sizeof(int)); + checkCUDAError("failed to malloc oblockSumDevice"); + + // If input needs to be padded with 0s, that'll be done in the kernel. + cudaMemcpy(idataDevice, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("failed to mempcy idata to idataDevice"); + + // No need to synchronize CPU-GPU here: the CPU blocks until the transfer is complete. + // "The function will return once the pageable buffer has been copied to the staging + // memory for DMA transfer to device memory ...", according to + // https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-sync. + + // Per-block exclusive scan of the original input. iblockSumDevice will store the + // final prefix sums computed by each block. + exclusiveScanKernel<<>>(n, odataDevice, idataDevice, iblockSumDevice); + cudaDeviceSynchronize(); + + // Exclusive scan of the final prefix sums computed by each block. oblockSumDevice + // will store the increments to be added to each block's scan results. + exclusiveScanKernel<<>>(blockCount.x, oblockSumDevice, iblockSumDevice); + cudaDeviceSynchronize(); + + // Add the block increments to the original scan results to obtain final results. + addBlockIncrementsKernel<<>>(n, odataDevice, oblockSumDevice); + cudaDeviceSynchronize(); + + cudaMemcpy(odata, odataDevice, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("failed to mempcy odataDevice to odata"); + + cudaFree(oblockSumDevice); + checkCUDAError("failed to free oblockSumDevice"); + cudaFree(iblockSumDevice); + checkCUDAError("failed to free iblockSumDevice"); + cudaFree(odataDevice); + checkCUDAError("failed to free odataDevice"); + cudaFree(idataDevice); + checkCUDAError("failed to free idataDevice"); + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { timer().startGpuTimer(); - // TODO + + scanNoTimer(n, odata, idata); + timer().endGpuTimer(); } @@ -32,9 +200,113 @@ namespace StreamCompaction { */ int compact(int n, int *odata, const int *idata) { timer().startGpuTimer(); - // TODO + + int blockSize = 256; + int blockCount = (n + blockSize - 1) / blockSize; + + int *idataDevice = nullptr; + cudaMalloc(&idataDevice, n * sizeof(int)); + checkCUDAError("failed to malloc idataDevice"); + + int *nonzeroMaskDevice = nullptr; + cudaMalloc(&nonzeroMaskDevice, n * sizeof(int)); + checkCUDAError("failed to malloc nonzeroMaskDevice"); + + int *nonzeroMaskPrefixSumDevice = nullptr; + cudaMalloc(&nonzeroMaskPrefixSumDevice, n * sizeof(int)); + checkCUDAError("failed to malloc nonzeroMaskPrefixSumDevice"); + + cudaMemcpy(idataDevice, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("failed to memcpy idata to idataDevice"); + + // No need to synchronize CPU-GPU here: the CPU blocks until the transfer is complete. + // "The function will return once the pageable buffer has been copied to the staging + // memory for DMA transfer to device memory ...", according to + // https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-sync. + + // Identify non-zero elements by marking them with 1 in the mask. + StreamCompaction::Common::kernMapToBoolean<<>>(n, nonzeroMaskDevice, idataDevice); + cudaDeviceSynchronize(); + + // Exclusive scan the mask. + scanNoTimer(n, nonzeroMaskPrefixSumDevice, nonzeroMaskDevice); + cudaDeviceSynchronize(); + + int lastScanValue = 0; + int lastMaskValue = 0; + cudaMemcpy(&lastScanValue, &nonzeroMaskPrefixSumDevice[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + // An exclusive scan doesn't include the last element, so we need to add it. + cudaMemcpy(&lastMaskValue, &nonzeroMaskDevice[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + int nonzeroCount = lastScanValue + lastMaskValue; + + if (nonzeroCount > 0) { + int *odataDevice = nullptr; + cudaMalloc(&odataDevice, nonzeroCount * sizeof(int)); + checkCUDAError("failed to malloc odataDevice"); + + // Scatter the non-zero elements. + StreamCompaction::Common::kernScatter<<>>( + n, odataDevice, idataDevice, nonzeroMaskDevice, nonzeroMaskPrefixSumDevice + ); + cudaDeviceSynchronize(); + + cudaMemcpy(odata, odataDevice, nonzeroCount * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("failed to memcpy odataDevice to odata"); + + cudaFree(odataDevice); + checkCUDAError("failed to free odataDevice"); + } + // odata will be left untouched if there are no nonzero elements. + + cudaFree(nonzeroMaskPrefixSumDevice); + checkCUDAError("failed to free nonzeroMaskPrefixSumDevice"); + cudaFree(nonzeroMaskDevice); + checkCUDAError("failed to free nonzeroMaskDevice"); + cudaFree(idataDevice); + checkCUDAError("failed to free idataDevice"); + timer().endGpuTimer(); - return -1; + return nonzeroCount; + } + + // Parallel reduction corresponding to up-sweep phase of scan. Standalone version + // to test it. + __global__ void reductionKernel(int n, int *odata, int *idata) { + extern __shared__ int blockSharedMem[]; + + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + blockSharedMem[threadIdx.x] = 0; + } else { + blockSharedMem[threadIdx.x] = idata[globalThreadIdx]; + } + + __syncthreads(); + + for (int d = 0; d < ilog2ceil(n); ++d) { + if ((threadIdx.x + 1) % (1 << (d+1)) == 0) { + int leftChildValue = blockSharedMem[threadIdx.x - (1 << d)]; + int rightChildValue = blockSharedMem[threadIdx.x]; + blockSharedMem[threadIdx.x] = leftChildValue + rightChildValue; + } + + __syncthreads(); + } + + odata[globalThreadIdx] = blockSharedMem[threadIdx.x]; + } + + // CPU version of the reduction kernel (up-sweep), for testing and comparison. + void reduce(int n, int *odata, const int *idata) { + for (int i = 0; i < n; ++i) { + odata[i] = idata[i]; + } + + for (int d = 0; d < ::ilog2(n); ++d) { + for (int k = 0; k < n; k += (1 << (d+1))) { + odata[k + (1 << (d+1)) - 1] += odata[k + (1 << d) - 1]; + } + } } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..083a218 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -8,6 +8,8 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scanNoTimer(int n, int *odata, const int *idata); + int compact(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..8b6e134 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define BLOCK_SIZE 1 << 7 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,14 +13,172 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __device__ inline int ilog2(int x) { + int lg = 0; + while (x >>= 1) { + ++lg; + } + return lg; + } + + __device__ inline int ilog2ceil(int x) { + return x == 1 ? 0 : ilog2(x - 1) + 1; + } + + // Performs exclusive scan on idata *per block*, not on the whole n elements at a time. + // Uses shared memory to perform the scan at the block level, alternating 2 buffers for + // reads and writes each iteration, before copying the resulting partial sums to odata + // at the end. + // + // If blockSumDevice is not nullptr, the final prefix sum computed by the block is stored + // in blockSumDevice[blockIdx.x]. + __global__ void exclusiveScanKernel(int n, int *odata, int *idata, int *blockSumDevice = nullptr) { + // Shared memory buffer should be twice the size of the block, so that we may have 2 + // buffers to alternate reads and writes. + extern __shared__ int blockSharedMem[]; + + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + // Extra threads in the last block. + return; + } + + // For each depth d, iterInput is read from and iterOutput is written to + // and then swapped. + int *iterInput = blockSharedMem; + int *iterOutput = blockSharedMem + blockDim.x; + + if (threadIdx.x == 0) { + // Put addition identity in first element. + iterInput[threadIdx.x] = 0; + } else { + // Copy input data to iterInput, shifting by 1. This effectively turns + // an inclusive scan into an exclusive scan. + iterInput[threadIdx.x] = idata[globalThreadIdx-1]; + } + + __syncthreads(); + + int k = threadIdx.x; + for (int d = 1; d <= ilog2ceil(n); ++d) { + // Elements to be added are this much apart. + int delta = 1 << (d-1); + + // At the beginning of each new iteration: + // - partial sums [0, 2^(d-1) - 1] are complete; + // - the rest are of the form x[k - 2^d - 1] + ... + x[k]. + + if (k > delta) { + // Note that if k = delta, then iterInput[k - delta] = 0, so that's handled + // by the other case. + iterOutput[k] = iterInput[k - delta] + iterInput[k]; + } else { + iterOutput[k] = iterInput[k]; + } + + __syncthreads(); + + int *tmp = iterInput; + iterInput = iterOutput; + iterOutput = tmp; + } + + // iterInput now contains the final result of the scan for this block. + // + // The synchronization barrier at the end of the loop ensures that all threads + // have finished writing to iterInput before we copy the results to odata; all + // threads execute exactly the same number of iterations, so we don't need to + // worry about threads in the same block being at different stages of the scan. + odata[globalThreadIdx] = iterInput[threadIdx.x]; + + // blockSumDevice will store the final prefix sums computed by all blocks to + // later combine all blocks' results. + if (blockSumDevice != nullptr && threadIdx.x == blockDim.x - 1) { + // An exclusive scan doesn't include the last element, so we need to add it. + blockSumDevice[blockIdx.x] = iterInput[blockDim.x - 1] + idata[globalThreadIdx]; + } + } + + // Adds the per-block final prefix sums stored in blockSums to the elements of the + // corresponding blocks in odata. + __global__ void addBlockIncrementsKernel(int n, int *odata, int *blockSums) { + int globalThreadIdx = blockIdx.x*blockDim.x + threadIdx.x; + if (globalThreadIdx >= n) { + return; + } + + odata[globalThreadIdx] += blockSums[blockIdx.x]; + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { timer().startGpuTimer(); - // TODO + + // Kernel configuration for the block-level scan and block sum computation. + const int blockSize = BLOCK_SIZE; + dim3 blockCount = (n + blockSize - 1) / blockSize; + // For double buffering, shared memory must be able to host 2 block-sized buffers. + const int blockSharedMemSize = 2 * blockSize * sizeof(int); + + // Kernel configuration for the block sum scan (all blocks' final prefix sums). + const int blockSumBlockSize = blockCount.x; + dim3 blockSumBlockCount = (blockCount.x + blockSumBlockSize - 1) / blockSumBlockSize; + // For double buffering, shared memory must be able to host 2 block-sized buffers. + const int blockSumSharedMemSize = 2 * blockSumBlockSize * sizeof(int); + + int *idataDevice = nullptr; + cudaMalloc(&idataDevice, n * sizeof(int)); + checkCUDAError("failed to malloc idataDevice"); + + int *odataDevice = nullptr; + cudaMalloc(&odataDevice, n * sizeof(int)); + checkCUDAError("failed to malloc odataDevice"); + + int *iblockSumDevice = nullptr; + cudaMalloc(&iblockSumDevice, blockCount.x * sizeof(int)); + checkCUDAError("failed to malloc iblockSumDevice"); + + int *oblockSumDevice = nullptr; + cudaMalloc(&oblockSumDevice, blockCount.x * sizeof(int)); + checkCUDAError("failed to malloc oblockSumDevice"); + + cudaMemcpy(idataDevice, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("failed to mempcy idata to idataDevice"); + + // No need to synchronize CPU-GPU here: the CPU blocks until the transfer is complete. + // "The function will return once the pageable buffer has been copied to the staging + // memory for DMA transfer to device memory ...", according to + // https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-sync. + + // Per-block exclusive scan of the original input. iblockSumDevice will store the + // final prefix sums computed by each block. + exclusiveScanKernel<<>>(n, odataDevice, idataDevice, iblockSumDevice); + cudaDeviceSynchronize(); + + // Exclusive scan of the final prefix sums computed by each block. oblockSumDevice + // will store the increments to be added to each block's scan results. + exclusiveScanKernel<<>>(blockCount.x, oblockSumDevice, iblockSumDevice); + cudaDeviceSynchronize(); + + // Add the block increments to the original scan results to obtain final results. + addBlockIncrementsKernel<<>>(n, odataDevice, oblockSumDevice); + cudaDeviceSynchronize(); + + cudaMemcpy(odata, odataDevice, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("failed to mempcy odataDevice to odata"); + + cudaFree(oblockSumDevice); + checkCUDAError("failed to free oblockSumDevice"); + cudaFree(iblockSumDevice); + checkCUDAError("failed to free iblockSumDevice"); + cudaFree(odataDevice); + checkCUDAError("failed to free odataDevice"); + cudaFree(idataDevice); + checkCUDAError("failed to free idataDevice"); + timer().endGpuTimer(); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..cb44f01 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -19,9 +19,12 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::device_vector idataDevice(idata, idata + n); + thrust::device_vector odataDevice(n); + thrust::exclusive_scan(idataDevice.begin(), idataDevice.end(), odataDevice.begin()); + thrust::copy(odataDevice.begin(), odataDevice.end(), odata); + timer().endGpuTimer(); } }