diff --git a/README.md b/README.md index 0e38ddb..9021546 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,153 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Kevin Dong + * [LinkedIn](www.linkedin.com/in/xingyu-dong) +* Tested on: Windows 11, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz, GTX 2060 -### (TODO: Your README) +This repo implements several versions of parallel exclusive scan algorithms and used them to implement stream +compaction. The work-efficient scan algorithm has also been used in implementing a parallel radix sort algorithm. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +### Features +#### CPU Scan and Compaction +The CPU scan implementation is fairly straightforward $O(n)$ algorithm. The compaction algorithms, with and without +scan, also run in $O(n)$ time. All of them perform fairly well on small input sizes and become slow as the input size +increases. +#### Naive GPU Scan +The naive parallel scan is a $O(log_n)$ algorithm that still performs $O(nlog_n)$ number of adds, which makes it not +really better than the sequential algorithm. +#### Work-Efficient GPU Scan and Compaction +The work-efficient scan improves upon the naive algorithm by changing the array into a balanced binary tree. It first +performs a parallel reduction (the up-sweep) and a down-sweep that yields the final scan result. This algorithm has +$O(n)$ number of adds and performs much better than the naive scan. To accommodate input sizes that are not powers of 2, +we enlarge the input array to the nearest power of 2 and pad the rest of the array with 0s. +#### Work-Efficient Scan Optimization (Part 5 Extra Credit) +We can further optimize the work-efficient scan by reducing the number of works needed to be done. A lot of threads +are not needed in the process because only some nodes are required to be updated during each iteration. +#### Thrust Scan +The thrust implementation calls the thrust api to perform the exclusive scan. It is very straightforward. +#### Radix Sort (Extra Credit 1) +The parallel radix sort uses the work-efficient exclusive scan as part of its implementation. We iterate through $k$ +bits and the exclusive scan take $O(log_n)$ time. The total time complexity is $O(k\cdot log_n)$ because the generation of +bit array and scatter operation are $O(1)$ due to the parallelism. + +### Performance Analysis +We will compare the performance of the different scan algorithms on different input sizes. The graph is shown below in +log scale. +![Performance Graph](img/Figure_2.png) +From the graph, we can see that the thrust scan algorithm performs the best among all the scan algorithms, followed +by the work-efficient algorithm. When the input size is small, the CPU algorithm is the fastest algorithm, +but it quickly becomes slower as the input size increases. The naive scan algorithm also becomes slow when the size +becomes very large. + +#### Performance Bottlenecks +Our timer for all the algorithms does not contain the memory allocation process and the process of copying the final +result array back to the host. Therefore, the execution time should mainly reflect the computation time as well as the +memory access time while doing the computation. + +The CPU implementation is fairly efficient and can hardly be further improved, since it is already a linear time +algorithm. From the runtime result we can also observe the fact that it runs very fast when the input size is small, +since comparing to its GPU counterparts it doesn't cause a lot of overheads. + +The naive algorithm generally performs worse than the work-efficient algorithm, since it requires more adds operations +even comparing to the CPU implementation and have many idle threads that are not actively working. Since our +implementation uses global memory, the memory access could also potentially slow down the algorithm. It is expected +that a shared-memory model may further improve the algorithm's performance. + +The work-efficient algorithm improves based on the naive algorithm by doing computations like a balanced binary tree. +The number of adds for both the up-sweep and the down-sweep in this case becomes $O(n)$, thus making this algorithm +more efficient. The work-efficient algorithm does require more memory access, since there are more steps involved and +all the data are stored as global memory, so changing it to a shared-memory model should greatly improve its +performance. In part 5, we implemented an optimization that reduces the number of threads generated and replaced the +modular operation with a value comparison, which should make the algorithm more efficient. + +The thrust algorithm is the most efficient algorithm when the input size is very large. It is not hard to imagine that +the thrust implementation uses shared-memory model and properly optimizes the use of warps and registers to make the +algorithm very efficient on large scale data. + +![Thrust Analysis](img/Nsight.png) + +From the Nsight System timeline, we can see that the thrust implementation has a very high SM Warp Occupancy, indicating +that the algorithm is very efficient in utilizing the GPU resources. There are also some DRAM read and write operations, +which could be the memory access operations during the up-sweep and down-sweep processes. Overall, the thrust +implementation looks very efficient and well optimized. + + +### Output +This output is generated with $2^{20}$ input size and $256$ block size. +``` +**************** +** SCAN TESTS ** +**************** + [ 28 28 11 44 31 11 4 25 23 43 12 12 28 ... 48 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 1.5931ms (std::chrono Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ] + passed +==== cpu scan, non-power-of-two ==== + elapsed time: 1.4801ms (std::chrono Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 1.85376ms (CUDA Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.916128ms (CUDA Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 1.6073ms (CUDA Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 1.02006ms (CUDA Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.83968ms (CUDA Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.6672ms (CUDA Measured) + [ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 2 1 2 3 2 3 2 0 3 2 2 1 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 2.2588ms (std::chrono Measured) + [ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 1 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 2.1694ms (std::chrono Measured) + [ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 3 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 5.4433ms (std::chrono Measured) + [ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 1 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 1.83184ms (CUDA Measured) + [ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 1 2 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 1.1128ms (CUDA Measured) + [ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 3 3 ] + passed + +********************** +** RADIX SORT TESTS ** +********************** + [ 32 26 157 158 195 138 167 198 116 95 114 106 149 ... 30 0 ] +==== radix sort ==== + elapsed time: 17.5402ms (CUDA Measured) + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ] +==== thrust sort ==== + elapsed time: 1.57117ms (CUDA Measured) + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ] + passed +``` \ No newline at end of file diff --git a/img/Figure_1.png b/img/Figure_1.png new file mode 100644 index 0000000..42c4bb8 Binary files /dev/null and b/img/Figure_1.png differ diff --git a/img/Figure_2.png b/img/Figure_2.png new file mode 100644 index 0000000..2686593 Binary files /dev/null and b/img/Figure_2.png differ diff --git a/img/Nsight.png b/img/Nsight.png new file mode 100644 index 0000000..d269e81 Binary files /dev/null and b/img/Nsight.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..3418c1f 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,14 +11,17 @@ #include #include #include +#include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 20; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; +int *d = new int[SIZE]; + int main(int argc, char* argv[]) { // Scan tests @@ -39,6 +42,7 @@ int main(int argc, char* argv[]) { StreamCompaction::CPU::scan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printArray(SIZE, b, true); + printCmpResult(SIZE, b, b); zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); @@ -51,7 +55,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan @@ -64,37 +68,50 @@ int main(int argc, char* argv[]) { printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); + // compare results between thrust array and cpu array +// zeroArray(SIZE, b); +// printDesc("compare thrust array and cpu array - CPU"); +// StreamCompaction::CPU::scan(NPOT, b, a); +// printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); +// printCmpResult(NPOT, b, b); +// +// zeroArray(SIZE, d); +// printDesc("compare thrust array and cpu array - Thrust"); +// StreamCompaction::Thrust::scan(NPOT, d, a); +// printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); +// printCmpResult(NPOT, d, b); + printf("\n"); printf("*****************************\n"); printf("** STREAM COMPACTION TESTS **\n"); @@ -137,18 +154,43 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + printf("\n"); + printf("**********************\n"); + printf("** RADIX SORT TESTS **\n"); + printf("**********************\n"); + + genArray(SIZE - 1, a, 200); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + int maxBits = StreamCompaction::Radix::getMaxBits(SIZE, a); + + zeroArray(SIZE, b); + printDesc("radix sort"); + StreamCompaction::Radix::sort(SIZE, b, a, maxBits); + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(SIZE, b, true); + + zeroArray(SIZE, c); + printDesc("thrust sort"); + StreamCompaction::Thrust::sort(SIZE, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; delete[] c; + delete[] d; } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index 19511ca..d5529e7 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -4,6 +4,7 @@ set(headers "naive.h" "efficient.h" "thrust.h" + "radix.h" ) set(sources @@ -12,7 +13,8 @@ set(sources "naive.cu" "efficient.cu" "thrust.cu" - ) + "radix.cu" +) list(SORT headers) list(SORT sources) diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..9b9f14a 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,12 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + bools[index] = idata[index] == 0 ? 0 : 1; } /** @@ -32,7 +37,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..941bca7 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,12 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + if (n > 0) { + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } timer().endCpuTimer(); } @@ -30,9 +35,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int pos = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[pos] = idata[i]; + pos++; + } + } timer().endCpuTimer(); - return -1; + return pos; } /** @@ -41,10 +52,40 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + // construct the temporary array + int *temp = new int[n]; + int *scanResult = new int[n]; + timer().startCpuTimer(); - // TODO + // map + for (int i = 0; i < n; i++) { + temp[i] = idata[i] == 0 ? 0 : 1; + } + + // scan + if (n > 0) { + scanResult[0] = 0; + for (int i = 1; i < n; i++) { + scanResult[i] = scanResult[i - 1] + temp[i - 1]; + } + } + + // scatter + int count = 0; + for (int i = 0; i < n; i++) { + if (temp[i] == 1) { + odata[scanResult[i]] = idata[i]; + count++; + } + } + count = scanResult[n - 1] + temp[n - 1]; timer().endCpuTimer(); - return -1; + + // free memory + delete[] temp; + delete[] scanResult; + + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..953f664 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,8 @@ #include "common.h" #include "efficient.h" +#define blockSize 256 + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +14,65 @@ namespace StreamCompaction { return timer; } + __global__ void kernUpSweep(int n, int d, int* data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + int offset = 1 << (d + 1); // 2^(d+1) + int pos = index * offset; + + if (pos >= n) { + return; + } + + data[pos + offset - 1] += data[pos + (offset >> 1) - 1]; + } + + __global__ void kernDownSweep(int n, int d, int* data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + int offset = 1 << (d + 1); // 2^(d+1) + int pos = index * offset; + + if (pos >= n) { + return; + } + + int t = data[pos + (offset >> 1) - 1]; + data[pos + (offset >> 1) - 1] = data[pos + offset - 1]; + data[pos + offset - 1] += t; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int enlargedSize = 1 << ilog2ceil(n); // enlarge the size to the nearest power of 2 + int* dev_idata; + + cudaMalloc((void**)&dev_idata, enlargedSize * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + + // copy the input to GPU (size n data) + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + // up-sweep + for (int d = 0; d <= ilog2ceil(enlargedSize) - 1; d++) { + int fullBlocksPerGrid = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize; + kernUpSweep<<>> (enlargedSize, d, dev_idata); + } + + // down-sweep + cudaMemset(dev_idata + enlargedSize - 1, 0, sizeof(int)); + for (int d = ilog2ceil(enlargedSize) - 1; d >= 0; d--) { + int fullBlocksPerGrid = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize; + kernDownSweep<<>> (enlargedSize, d, dev_idata); + } timer().endGpuTimer(); + + // copy the result to odata (size n data) + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + + // free memory + cudaFree(dev_idata); } /** @@ -31,10 +85,66 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int* dev_idata; + int* dev_odata; + int* dev_temp; + int* dev_scan; + int fullBlocksPerGrid = (n + blockSize - 1) / blockSize; + + int enlargedSize = 1 << ilog2ceil(n); // enlarge the size to the nearest power of 2 + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + cudaMalloc((void**)&dev_temp, enlargedSize * sizeof(int)); + checkCUDAError("cudaMalloc dev_temp failed!"); + cudaMalloc((void**)&dev_scan, enlargedSize * sizeof(int)); + checkCUDAError("cudaMalloc dev_scan failed!"); + + // copy the input to GPU + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + // map + StreamCompaction::Common::kernMapToBoolean<<>> (n, dev_temp, dev_idata); + + // scan (implemented again) + cudaMemcpy(dev_scan, dev_temp, n * sizeof(int), cudaMemcpyDeviceToDevice); + // up-sweep + for (int d = 0; d <= ilog2ceil(enlargedSize) - 1; d++) { + int fullBlocksPerGridEnlarged = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize; + kernUpSweep<<>> (enlargedSize, d, dev_scan); + } + + // down-sweep + cudaMemset(dev_scan + enlargedSize - 1, 0, sizeof(int)); + for (int d = ilog2ceil(enlargedSize) - 1; d >= 0; d--) { + int fullBlocksPerGridEnlarged = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize; + kernDownSweep<<>> (enlargedSize, d, dev_scan); + } + + // scatter + StreamCompaction::Common::kernScatter<<>> (n, dev_odata, dev_idata, dev_temp, dev_scan); timer().endGpuTimer(); - return -1; + + // calculate count + int lastScanValue = 0; + int lastTempValue = 0; + cudaMemcpy(&lastScanValue, dev_scan + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastTempValue, dev_temp + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + int count = lastScanValue + lastTempValue; + + // copy the result to odata + cudaMemcpy(odata, dev_odata, count * sizeof(int), cudaMemcpyDeviceToHost); + + // free memory + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_temp); + cudaFree(dev_scan); + + return count; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..8cb8787 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +#define blockSize 256 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -11,15 +13,62 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void kernNaiveScan(int n, int d, int* odata, const int* idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + int offset = 1 << (d - 1); + if (index >= offset) { + odata[index] = idata[index - offset] + idata[index]; + } + else { + odata[index] = idata[index]; + } + } + + void ChangeToExclusive(int n, int* odata) { + for (int i = n - 1; i > 0; i--) { + odata[i] = odata[i - 1]; + } + odata[0] = 0; + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dev_idata; + int* dev_odata; + int fullBlocksPerGrid = (n + blockSize - 1) / blockSize; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + // copy the input to GPU + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + // outer loop + for (int d = 1; d <= ilog2ceil(n); d++) { + // parallel process + kernNaiveScan<<>> (n, d, dev_odata, dev_idata); + std::swap(dev_odata, dev_idata); + } timer().endGpuTimer(); + + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + + // shift right and add 0 to the beginning to acquire the exclusive scan + ChangeToExclusive(n, odata); + + // free memory + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu new file mode 100644 index 0000000..e0a682b --- /dev/null +++ b/stream_compaction/radix.cu @@ -0,0 +1,139 @@ +#include +#include +#include +#include "radix.h" + +#include "common.h" +#include "efficient.h" + +#define blockSize 128 + +namespace StreamCompaction { + namespace Radix { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __global__ void kernMapToCurrBits(int n, int *odata, const int *idata, int bit) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + // get the bit-th bit of the number, but also flip the bit to produce the e array + odata[index] = !((idata[index] >> bit) & 1); + } + + __global__ void kernComputeT(int n, int *odata, const int *idata, const int totalFalses) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + odata[index] = index - idata[index] + totalFalses; + } + + __global__ void kernComputeD(int n, int *odata, const int *be, const int *t, const int *f) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + // be array is stored as e array. To access b, reverse the condition + odata[index] = be[index] ? f[index] : t[index]; + } + + __global__ void scatter(int n, int *odata, const int *idata, const int *d) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + odata[d[index]] = idata[index]; + } + + int getMaxBits(int n, const int *idata) { + // find maximum number in the array + int maxNum = 0; + for (int i = 0; i < n; i++) { + maxNum = std::max(maxNum, idata[i]); + } + // calculate the number of bits of the maximum number + int maxBits = 0; + while (maxNum > 0) { + maxNum >>= 1; + maxBits++; + } + return maxBits; + } + + /** + * Performs radix sort on idata, storing the result into odata. + * @param n the number of elements in idata + * @param odata output.txt data + * @param idata input data + * @param maxBits the maximum number of bits of a given number in the array + */ + void sort(int n, int *odata, const int *idata, int maxBits) { + int* dev_idata; + int* dev_odata; + int* dev_be; + int* dev_f; + int* dev_t; + int* dev_d; + int fullBlocksPerGrid = (n + blockSize - 1) / blockSize; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + cudaMalloc((void**)&dev_be, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_b failed!"); + cudaMalloc((void**)&dev_f, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_f failed!"); + cudaMalloc((void**)&dev_t, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_t failed!"); + cudaMalloc((void**)&dev_d, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_d failed!"); + + // copy the input to GPU (size n data) + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + for (int i = 0; i < maxBits; i++) { + // b & e array (b can be acquired by flipping back) + kernMapToCurrBits<<>>(n, dev_be, dev_idata, i); + // exclusive scan f array + StreamCompaction::Efficient::scan(n, dev_f, dev_be); + // calculate totalFalses + int lastFVal = 0; + int lastEVal = 0; + cudaMemcpy(&lastEVal, dev_be + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastFVal, dev_f + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + int totalFalses = lastFVal + lastEVal; + // t array + kernComputeT<<>>(n, dev_t, dev_f, totalFalses); + // d array + kernComputeD<<>>(n, dev_d, dev_be, dev_t, dev_f); + // scatter + scatter<<>>(n, dev_odata, dev_idata, dev_d); + // swap + std::swap(dev_idata, dev_odata); + } + timer().endGpuTimer(); + + // copy the result to odata (size n data) + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_be); + cudaFree(dev_f); + cudaFree(dev_t); + cudaFree(dev_d); + } + } +} \ No newline at end of file diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h new file mode 100644 index 0000000..3942817 --- /dev/null +++ b/stream_compaction/radix.h @@ -0,0 +1,13 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace Radix { + StreamCompaction::Common::PerformanceTimer& timer(); + + int getMaxBits(int n, const int *idata); + + void sort(int n, int *odata, const int *idata, int maxBits); + } +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..b74bb69 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -3,6 +3,7 @@ #include #include #include +#include #include "common.h" #include "thrust.h" @@ -18,11 +19,21 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::device_vector dv_in(idata, idata + n); + thrust::device_vector dv_out(n); timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); timer().endGpuTimer(); + thrust::copy(dv_out.begin(), dv_out.end(), odata); + } + + void sort(int n, int *odata, const int *idata) { + thrust::device_vector dv_in(idata, idata + n); + thrust::device_vector dv_out(n); + timer().startGpuTimer(); + thrust::sort(dv_in.begin(), dv_in.end()); + timer().endGpuTimer(); + thrust::copy(dv_in.begin(), dv_in.end(), odata); } } } diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index fe98206..e12c10f 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -7,5 +7,7 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + + void sort(int n, int *odata, const int *idata); } }