diff --git a/README.md b/README.md index 0e38ddb..03288a6 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,152 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yuhan Liu + * [LinkedIn](https://www.linkedin.com/in/yuhan-liu-), [personal website](https://liuyuhan.me/), [twitter](https://x.com/yuhanl_?lang=en), etc. +* Tested on: Windows 11 Pro, Ultra 7 155H @ 1.40 GHz 32GB, RTX 4060 8192MB (Personal Laptop) -### (TODO: Your README) +## README! -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +### Project Description +* This project investigates parallel algorithms by implementing various versions of scan (prefix-sum), stream compaction (remove unwated elements), and sort algorithms. We begin with CPU implementations of all the aforementioned algorithms before developing GPU versions, including naive and work-efficient approaches, for comparison purposes. + +### Performance Analysis + +**Optimizing Block Size** + + + +* To decide on one block size for all parallel algorithm implementations, I assessed the naive and work-efficient runtimes with increasing block sizes. In the previous project, we established a tradeoff concerning block size between increasing GPU utilization and limiting resource availability for each block. Thus, we can choose an optimal block size by finding the dip in runtimes, which in this case (although close) we choose and proceed with a block size of 256. + +#### Comparison of GPU Scan Implementations + +| CPU | Naive | Work-Efficient | Thrust | +| :------------------------------: |:------------------------------: |:-----------------------------------------------: |:-----------------------------------------------:| +| For smaller datasets, the overhead of managing parallel execution on a GPU (e.g., kernel launches, thread synchronization) offsets the parallel advantage, making the CPU more efficient. However, as data scales, the complexity of scanning on the CPU increases linearly. In the graph below where array sizes increase exponentially, the complexity of CPU scanning does so as well. | The naive parallel approach rivals the work-efficient approach for most of the array sizes below. However, this method has a memory latency bottleneck: its excessive use of global memory in the summation causes the runtime to swell as the array size grows. | The work-efficient GPU scan starts with more overhead than the naive implementation for smaller arrays, but as they grow larger, it outperforms both the CPU and naive versions. The work-efficient binary tree structure reduces unnecessary memory operations and optimizes thread usage by performing operations in place. The bottleneck here is thread usage: the binary tree structure leaves threads underutilized as the number of operations at each step in the algorithm reduces. Because of this, more primitive methods like naive and CPU come close in runtime for smaller array sizes. | Referencing the NVIDIA developer [documentation](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) for exclusive scan, we notice that the Thrust implementation employs techniques to improve performance, including optimizing shared memory and load balancing. Setting up these optimizations, like bankconflict avoidance and loop unrolling, introduces overhead, which make more primitive methods more efficient for small array sizes. However, even in the graph below, the thrust scan implementation clearly does not scale as significantly as any of the other scan implementations. | + + + +#### Output of Testing +test array SIZE: 2^18, blockSize: 256 + +``` +**************** +** SCAN TESTS ** +**************** + [ 35 49 36 24 1 48 46 6 49 24 44 47 5 ... 2 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.4305ms (std::chrono Measured) + [ 0 35 84 120 144 145 193 239 245 294 318 362 409 ... 6412415 6412417 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.5392ms (std::chrono Measured) + [ 0 35 84 120 144 145 193 239 245 294 318 362 409 ... 6412308 6412343 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.368768ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.219936ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.333856ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.366688ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.55728ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.344288ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 3 0 0 3 2 0 2 1 2 0 1 3 ... 2 1 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.6932ms (std::chrono Measured) + [ 1 3 3 2 2 1 2 1 3 3 2 2 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.7058ms (std::chrono Measured) + [ 1 3 3 2 2 1 2 1 3 3 2 2 3 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 1.3968ms (std::chrono Measured) + [ 1 3 3 2 2 1 2 1 3 3 2 2 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.445152ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.665728ms (CUDA Measured) + passed + +***************************** +***** RADIX SORT TESTS ****** +***************************** + +Radix Sort, hard-coded lecture example test + + [ 4 7 2 6 3 5 1 0 ] +==== cpu sort (std::sort) ==== + elapsed time: 0.0002ms (std::chrono Measured) + [ 0 1 2 3 4 5 6 7 ] +==== radix sort ==== + elapsed time: 5.5552ms (CUDA Measured) + [ 0 1 2 3 4 5 6 7 ] + passed + +Radix Sort, pow2 consecutive ints + + [ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262142 262143 ] +==== cpu sort (std::sort) ==== + elapsed time: 1.4753ms (std::chrono Measured) + [ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262142 262143 ] +==== radix sort ==== + elapsed time: 13.777ms (CUDA Measured) + [ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262142 262143 ] + passed + +Radix Sort, non-pow2 consecutive ints + + [ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262139 262140 ] +==== cpu sort (std::sort) ==== + elapsed time: 1.3184ms (std::chrono Measured) + [ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262139 262140 ] +==== radix sort ==== + elapsed time: 13.37ms (CUDA Measured) + [ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262139 262140 ] + passed + +Radix Sort, non-pow2 shuffled ints + + [ 4785 499 16736 27824 14951 31998 8696 23806 7549 30974 31344 14697 29955 ... 21785 7497 ] +==== cpu sort (std::sort) ==== + elapsed time: 14.9359ms (std::chrono Measured) + [ 0 0 0 0 0 0 0 0 1 1 1 1 2 ... 32767 32767 ] +==== radix sort ==== + elapsed time: 18.8853ms (CUDA Measured) + [ 0 0 0 0 0 0 0 0 1 1 1 1 2 ... 32767 32767 ] + passed +``` + +### Additional Feature: Radix Sort + +* I implemented parallel radix sort as an additional module to the `stream_compaction` subproject (and additionally, a CPU std::sort function for comparison). The implementation can be found in `radix.cu` and `radix.h`. + +* The radix sort function takes as input, the size of the array, along with pointers to the input and output arrays. Below is a code snippet from a radix test in the main method, showing how it is called: + +``` +StreamCompaction::Radix::sort(SIZE, c, a); // a is the array to be sorted, which is already-existing +printArray(SIZE, c, true); // print the output sorted array, which is saved in c +``` + +* I wrote several tests for radix sort, in which I compare both the correctness and runtime of my implementation to the CPU equivalent (std::sort). First, I compared radix sort with CPU sort on the fixed array [4, 7, 2, 6, 3, 5, 1, 0] from lecture. Then, I evaluated radix sort on sequential arrays of size power-of-two and non-power-of-two sizes. Lastly, I tested radix sort on a shuffled arrays. For each test, I compared the results and performance of radix sort with CPU sort, measuring CUDA performance where applicable. + +**Radix sort Performance Evaluation** +* Radix sort performs worse than std::sort for smaller array sizes due to overhead and initialization costs associated with GPU processing and memory. However, as the array size grows, radix sort's performance improves relative to std::sort. This makes sense with the performance graph as radix sort has a complexity of O(n⋅k), where k is the number of bits, while standard CPU sort has a complexity of O(n*log(n)). + + diff --git a/img/block_opt.png b/img/block_opt.png new file mode 100644 index 0000000..31dc9d8 Binary files /dev/null and b/img/block_opt.png differ diff --git a/img/radix_perf.png b/img/radix_perf.png new file mode 100644 index 0000000..137876e Binary files /dev/null and b/img/radix_perf.png differ diff --git a/img/scan_perf.png b/img/scan_perf.png new file mode 100644 index 0000000..5e03616 Binary files /dev/null and b/img/scan_perf.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..1848896 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,16 +11,18 @@ #include #include #include +#include #include "testing_helpers.hpp" +#include -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 22; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; int main(int argc, char* argv[]) { - // Scan tests + // Scan tests -------------------------------------- printf("\n"); printf("****************\n"); @@ -100,10 +102,10 @@ int main(int argc, char* argv[]) { printf("** STREAM COMPACTION TESTS **\n"); printf("*****************************\n"); - // Compaction tests + // Compaction tests -------------------------------------- - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; + genArray(SIZE - 1, a, 4); + a[SIZE - 1] = 1; // Leave a 1 at the end to test that edge case printArray(SIZE, a, true); int count, expectedCount, expectedNPOT; @@ -147,7 +149,123 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); - system("pause"); // stop Win32 console from closing on exit + + // radix sort tests -------------------------------------- + + printf("\n"); + printf("*****************************\n"); + printf("***** RADIX SORT TESTS ******\n"); + printf("*****************************\n"); + + // pow2 test + + zeroArray(8, a); + std::cout << " " << std::endl; + std::cout << "Radix Sort, hard-coded lecture example test" << std::endl; + std::cout << " " << std::endl; + + // example from slides for debugging + a[0] = 4; + a[1] = 7; + a[2] = 2; + a[3] = 6; + a[4] = 3; + a[5] = 5; + a[6] = 1; + a[7] = 0; + + printArray(8, a, true); + + zeroArray(8, b); + printDesc("cpu sort (std::sort)"); + StreamCompaction::CPU::sort(8, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(8, b, true); + + zeroArray(8, c); + printDesc("radix sort"); + StreamCompaction::Radix::sort(8, c, a); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(8, c, true); + printCmpResult(8, b, c); + + // pow2 test + + zeroArray(SIZE, a); + std::cout << " " << std::endl; + std::cout << "Radix Sort, pow2 consecutive ints" << std::endl; + std::cout << " " << std::endl; + + for (int i = 0; i < SIZE; i++) { + a[i] = i; + } + + printArray(SIZE, a, true); + + zeroArray(SIZE, b); + printDesc("cpu sort (std::sort)"); + StreamCompaction::CPU::sort(SIZE, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(SIZE, b, true); + + zeroArray(SIZE, c); + printDesc("radix sort"); + StreamCompaction::Radix::sort(SIZE, c, a); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + // npot test + + zeroArray(SIZE, a); + std::cout << " " << std::endl; + std::cout << "Radix Sort, non-pow2 consecutive ints" << std::endl; + std::cout << " " << std::endl; + + for (int i = 0; i < NPOT; i++) { + a[i] = i; + } + + printArray(NPOT, a, true); + + zeroArray(NPOT, b); + printDesc("cpu sort (std::sort)"); + StreamCompaction::CPU::sort(NPOT, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(NPOT, b, true); + + zeroArray(NPOT, c); + printDesc("radix sort"); + StreamCompaction::Radix::sort(NPOT, c, a); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + // npot shuffled + + genArray(NPOT, a, NPOT); + std::cout << " " << std::endl; + std::cout << "Radix Sort, non-pow2 shuffled ints" << std::endl; + std::cout << " " << std::endl; + + printArray(NPOT, a, true); + + zeroArray(NPOT, b); + printDesc("cpu sort (std::sort)"); + StreamCompaction::CPU::sort(NPOT, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + printArray(NPOT, b, true); + + zeroArray(NPOT, c); + printDesc("radix sort"); + StreamCompaction::Radix::sort(NPOT, c, a); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + + // END TESTS ---------------------------------------- + + //system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; delete[] c; diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index 025e94a..8f08d88 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -51,7 +51,6 @@ void onesArray(int n, int *a) { void genArray(int n, int *a, int maxval) { srand(time(nullptr)); - for (int i = 0; i < n; i++) { a[i] = rand() % maxval; } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index 19511ca..7378625 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -4,6 +4,7 @@ set(headers "naive.h" "efficient.h" "thrust.h" + "radix.h" ) set(sources @@ -12,6 +13,7 @@ set(sources "naive.cu" "efficient.cu" "thrust.cu" + "radix.cu" ) list(SORT headers) diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..243f712 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,4 +1,5 @@ #include "common.h" +#include void checkCUDAErrorFn(const char *msg, const char *file, int line) { cudaError_t err = cudaGetLastError(); @@ -23,7 +24,13 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int i = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (i >= n) { + return; + } else { + bools[i] = (idata[i] == 0) ? 0 : 1; + } } /** @@ -32,7 +39,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + + int i = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (i >= n) { + return; + } else if (bools[i] == 1) { + odata[indices[i]] = idata[i]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..137e695 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -12,6 +12,18 @@ namespace StreamCompaction { return timer; } + /** + * Standard sort + * (for parallel radix comparison) + */ + void sort(int n, int* odata, const int* idata) { + memcpy(odata, idata, n * sizeof(int)); + + timer().startCpuTimer(); + std::sort(odata, odata + n); + timer().endCpuTimer(); + } + /** * CPU scan (prefix sum). * For performance analysis, this is supposed to be a simple for loop. @@ -19,20 +31,34 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + odata[0] = 0; + for (int i = 0; i < n - 1; i++) { + odata[i + 1] = odata[i] + idata[i]; + } + timer().endCpuTimer(); } /** * CPU stream compaction without using the scan function. + * remove 0s from an array of ints * * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { + int idx = 0; timer().startCpuTimer(); - // TODO + + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[idx] = idata[i]; + idx++; + } + } + timer().endCpuTimer(); - return -1; + return idx; } /** @@ -41,10 +67,33 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int* tmp = new int[n]; + int* scanned = new int[n]; timer().startCpuTimer(); - // TODO + + // temporary array + for (int i = 0; i < n; i++) { + tmp[i] = (idata[i] == 0) ? 0 : 1; + } + + // copied scan function (to not call timer twice) + scanned[0] = 0; + for (int i = 0; i < n - 1; i++) { + scanned[i + 1] = scanned[i] + tmp[i]; + } + + // scatter + for (int i = 0; i < n; i++) { + if (tmp[i] == 1) { + odata[scanned[i]] = idata[i]; + } + } + timer().endCpuTimer(); - return -1; + int num = scanned[n - 1] + tmp[n - 1]; + delete[] tmp; + delete[] scanned; + return num; } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 873c047..ae8d3be 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -6,6 +6,8 @@ namespace StreamCompaction { namespace CPU { StreamCompaction::Common::PerformanceTimer& timer(); + void sort(int n, int* odata, const int* idata); + void scan(int n, int *odata, const int *idata); int compactWithoutScan(int n, int *odata, const int *idata); diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..fd4355c 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,10 @@ #include "common.h" #include "efficient.h" +#include + +#define blockSize 256 + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +16,83 @@ namespace StreamCompaction { return timer; } + /** + * Kernel function for up-sweep + */ + __global__ void kernUpSweep(int n, int d, int* idata) { + + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + k *= (1 << (d + 1)); + + if (k >= n) { + return; + } else { + // from notes: x[k + 2^(d+1) - 1] += x[k + 2^(d) -1] + idata[k + (1 << (d + 1)) - 1] += idata[k + (1 << d) - 1]; + } + } + + /** + * Kernal function for down-sweep + */ + __global__ void kernDownSweep(int n, int d, int* data) { + + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + k *= (1 << (d + 1)); + + if (k >= n) { + return; + } else { + int t = data[k + (1 << d) - 1]; // save left child + data[k + (1 << d) - 1] = data[k + (1 << (d + 1)) - 1]; // set left child to this node's val + data[k + (1 << (d + 1)) - 1] += t; // set right child to old left val + this node's val + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + + int* dev_idata; // one buffer for in-place scan + + int nextPowSpace = 1 << ilog2ceil(n); + + cudaMalloc((void**)&dev_idata, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + + cudaMemcpy(dev_idata + (nextPowSpace - n), idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); // ----------------------------- + + // up-sweep + for (int d = 0; d < ilog2ceil(n); d++) { + + int n_adj = nextPowSpace / (1 << (d + 1)); + dim3 fullBlocksPerGrid((n_adj + blockSize - 1) / blockSize); + + kernUpSweep<<>>(nextPowSpace, d, dev_idata); + checkCUDAError("kernUpSweep failed!"); + cudaDeviceSynchronize(); + } + + cudaMemset(dev_idata + (nextPowSpace - 1), 0, sizeof(int)); + + // down-sweep + for (int d = ilog2ceil(n) - 1; d >= 0; d--) { + + int n_adj = nextPowSpace / (1 << (d + 1)); + dim3 fullBlocksPerGrid((n_adj + blockSize - 1) / blockSize); + + kernDownSweep<<>>(nextPowSpace, d, dev_idata); + checkCUDAError("kernDownSweep failed!"); + cudaDeviceSynchronize(); + } + + timer().endGpuTimer(); // -------------------------------- + + cudaMemcpy(odata, dev_idata + (nextPowSpace - n), n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_idata); } /** @@ -31,10 +105,80 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; + + int* dev_idata; + int* dev_scanned; + int* dev_bools; + int* dev_odata; + + int nextPowSpace = 1 << ilog2ceil(n); + + cudaMalloc((void**)&dev_idata, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + cudaMalloc((void**)&dev_bools, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_bools failed!"); + + cudaMalloc((void**)&dev_scanned, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_scattered failed!"); + + cudaMalloc((void**)&dev_odata, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + timer().startGpuTimer(); // -------------- + + // create temporary binary array + StreamCompaction::Common::kernMapToBoolean <<>> (n, dev_bools, dev_idata); + cudaMemcpy(dev_scanned, dev_bools, nextPowSpace * sizeof(int), cudaMemcpyDeviceToDevice); + + // scan (copied and pasted) -------| + + // up-sweep + for (int d = 0; d < ilog2ceil(n); d++) { + + int n_adj = nextPowSpace / (1 << (d + 1)); + dim3 fullBlocksPerGrid_adj((n_adj + blockSize - 1) / blockSize); + + kernUpSweep << > > (nextPowSpace, d, dev_scanned); + checkCUDAError("kernUpSweep failed!"); + cudaDeviceSynchronize(); + } + + cudaMemset(dev_scanned + (nextPowSpace - 1), 0, sizeof(int)); + + // down-sweep + for (int d = ilog2ceil(n) - 1; d >= 0; d--) { + + int n_adj = nextPowSpace / (1 << (d + 1)); + dim3 fullBlocksPerGrid_adj((n_adj + blockSize - 1) / blockSize); + + kernDownSweep <<>> (nextPowSpace, d, dev_scanned); + checkCUDAError("kernDownSweep failed!"); + cudaDeviceSynchronize(); + } + + // end scan stuff ------------------| + + // scatter + StreamCompaction::Common::kernScatter<<>> (n, dev_odata, dev_idata, dev_bools, dev_scanned); + + timer().endGpuTimer(); // ---------------- + + int count = 0; + cudaMemcpy(&count, dev_scanned + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + count += (idata[n - 1] != 0); + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_bools); + cudaFree(dev_scanned); + cudaFree(dev_odata); + + return count; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..ae8b408 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,10 @@ #include "common.h" #include "naive.h" +#include + +#define blockSize 256 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +15,61 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + /** + * Kernel function for scan + */ + __global__ void kernScan(int n, int d, int* odata, int* idata) { + + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + int off = 1 << (d - 1); + + if (k >= n) { + return; + } + else if (k >= off) { + odata[k] = idata[k] + idata[k - off]; + } + else { + odata[k] = idata[k]; + } + + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + int* dev_idata; + int* dev_odata; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + timer().startGpuTimer(); - // TODO + + // kernel invocations + for (int d = 1; d <= ilog2ceil(n); d++) { + kernScan<<>>(n, d, dev_odata, dev_idata); + checkCUDAError("kernScan failed!"); + + std::swap(dev_odata, dev_idata); + } + timer().endGpuTimer(); + + cudaMemcpy(odata + 1, dev_idata, (n-1) * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu new file mode 100644 index 0000000..4964c26 --- /dev/null +++ b/stream_compaction/radix.cu @@ -0,0 +1,192 @@ +#include +#include +#include "common.h" +#include "radix.h" + +#include + +#define blockSize 256 + +namespace StreamCompaction { + namespace Radix { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + // populate dev_b with current bit, but reversed + __global__ void kernIsolateBit(int n, int bit, int* i, int* b) { + + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) { + return; + } else { + + int bitVal = (i[idx] >> bit) & 1; + b[idx] = bitVal ^ 1; + } + } + + // Kernel function for up-sweep + __global__ void kernUpSweep(int n, int d, int* idata) { + + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + k *= (1 << (d + 1)); + + if (k >= n) { + return; + } + else { + // from notes: x[k + 2^(d+1) - 1] += x[k + 2^(d) -1] + idata[k + (1 << (d + 1)) - 1] += idata[k + (1 << d) - 1]; + } + } + + // Kernal function for down-sweep + __global__ void kernDownSweep(int n, int d, int* data) { + + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + k *= (1 << (d + 1)); + + if (k >= n) { + return; + } + else { + int t = data[k + (1 << d) - 1]; // save left child + data[k + (1 << d) - 1] = data[k + (1 << (d + 1)) - 1]; // set left child to this node's val + data[k + (1 << (d + 1)) - 1] += t; // set right child to old left val + this node's val + } + } + + // Kernal function for getting address for writing true keys + __global__ void kernCalcAddress(int n, int tf, int* f, int* t) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) { + return; + } + else { + t[idx] = idx - f[idx] + tf; + } + } + + // Kernal function for getting address for writing true keys + __global__ void kernScatter(int n, int* d, int* b, int* t, int* f) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) { + return; + } + else { + d[idx] = b[idx] ? f[idx] : t[idx]; + } + } + + // kernel function to reorder original numbers using index d array + __global__ void kernReorder(int n, int* d, int* i, int* out) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) { + return; + } + else { + int newIdx = d[idx]; + out[newIdx] = i[idx]; + } + } + + // radix sort + void sort(int n, int *odata, const int *idata) { + + // labels are from parallel radix sort slides + int* dev_i; + int* dev_b; + int* dev_f; + int* dev_d; + int* dev_o; + + int tf; // total falses + + int nextPowSpace = 1 << ilog2ceil(n); + + cudaMalloc((void**)&dev_i, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_i failed!"); + + cudaMalloc((void**)&dev_b, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_b failed!"); + + cudaMalloc((void**)&dev_f, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_f failed!"); + + cudaMalloc((void**)&dev_d, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_d failed!"); + + cudaMalloc((void**)&dev_o, nextPowSpace * sizeof(int)); + checkCUDAError("cudaMalloc dev_o failed!"); + + cudaMemcpy(dev_i, idata, n * sizeof(int), cudaMemcpyHostToDevice); + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + timer().startGpuTimer(); // -------------- + + for (int i = 0; i < 32; i++) { + + // isolate current bit to get i array + kernIsolateBit << > > (nextPowSpace, i, dev_i, dev_b); + + int temp; + cudaMemcpy(&temp, dev_b + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(dev_f, dev_b, nextPowSpace * sizeof(int), cudaMemcpyDeviceToDevice); + + // get f array by running scan on e + + // up-sweep + for (int d = 0; d < ilog2ceil(n); d++) { + + int n_adj = nextPowSpace / (1 << (d + 1)); + dim3 fullBlocksPerGrid_adj((n_adj + blockSize - 1) / blockSize); + + kernUpSweep <<>> (nextPowSpace, d, dev_f); + checkCUDAError("kernUpSweep failed!"); + } + + cudaMemset(dev_f + (nextPowSpace - 1), 0, sizeof(int)); + + // down-sweep + for (int d = ilog2ceil(n) - 1; d >= 0; d--) { + + int n_adj = nextPowSpace / (1 << (d + 1)); + dim3 fullBlocksPerGrid_adj((n_adj + blockSize - 1) / blockSize); + + kernDownSweep << > > (nextPowSpace, d, dev_f); + checkCUDAError("kernDownSweep failed!"); + } + + // calculate total falses + cudaMemcpy(&tf, dev_f + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + tf += temp; + + // t array (formula) + kernCalcAddress <<>> (n, tf, dev_f, dev_d); + + // scatter based on address to get d + kernScatter<<>> (n, dev_d, dev_b, dev_d, dev_f); + + // match indices back + kernReorder<<> > (n, dev_d, dev_i, dev_o); + + cudaMemcpy(dev_i, dev_o, nextPowSpace * sizeof(int), cudaMemcpyDeviceToDevice); + } + + timer().endGpuTimer(); // ---------------- + + cudaMemcpy(odata, dev_o, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_i); + cudaFree(dev_b); + cudaFree(dev_f); + cudaFree(dev_d); + cudaFree(dev_o); + + } + } +} diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h new file mode 100644 index 0000000..b81733c --- /dev/null +++ b/stream_compaction/radix.h @@ -0,0 +1,11 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace Radix { + StreamCompaction::Common::PerformanceTimer& timer(); + + void sort(int n, int* odata, const int* idata); + } +} \ No newline at end of file diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..41a2ea0 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,18 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + thrust::host_vector host_in(idata, idata + n); + thrust::device_vector dv_in = host_in; + thrust::device_vector dv_out = host_in; + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + // for device_vectors dv_in and dv_out: + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + timer().endGpuTimer(); + cudaMemcpy(odata, dv_out.data().get(), n * sizeof(int), cudaMemcpyDeviceToHost); } } }