diff --git a/README.md b/README.md index 0e38ddb..b8d2b0b 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,133 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Shreyas Singh + * [LinkedIn](https://linkedin.com/in/shreyassinghiitr), [Personal Website](https://github.com/shreyas3156). +* Tested on: Windows 10, i7-12700 @ 2.1GHz 32GB, T1000 (CETS Lab) -### (TODO: Your README) +### About +This project features an implementation of all-prefix-sums operation on an array of data, often known as +_Scan_, followed by _Stream Compaction_, which refers to creating a new array with elements from the input that are filtered using a given criteria, preserving the order of elements in the process. We use _scan_ interchangeably with _exclusive scan_ throughout the project, where each element _k_ in the result array is the sum +of all elements up to but excluding _k_ itself in the input array. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +We would see three different implementations of Scan, the first being a trivial sequential CPU-Scan and the other two being +"Naive" and "Work-Efficient" CUDA implementations of a parallel scan algorithm that leverage GPU's data parallelism, thus giving huge performance improvement +for large inputs. + +Our Stream Compaction algorithm would remove `0`s from an array of `int`s, using CPU and Work-Efficient parallel scans using CUDA. +For a detailed version of Scan and further reading, check out [GPU Gems 3, Ch-39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda). + + +## Performance Analysis + +The algorithms have been timed with respect to various parameters like blockSize and input length using the custom `timer()` API. We have only timed the algorithms and not the memory allocations and other computations +since we assume that compute on GPU is for free. Time units are in 'milliseconds(ms)'. + +### Optimal Block size +The `blockSize` parameter was roughly optimized for both parallel algorithms for an array length of 2^20. I chose to take an average of the time for power-of-two (POT) inputs and non-power-of-two (NPOT) inputs. +The optimal block size was found to be 512 for Naive parallel scan and 128 for Work-Efficient scan. +![](img/blocksizeopt.png) + + +### Scan implementations vs Array Size +We compare the various implementations of Scan (Naive, Work-Efficient and Thrust) with respect to the CPU Scan algorithm. +Thrust is a C++ template library for CUDA based on the Standard Template Library (STL) and it allows you +to implement high performance parallel applications. + +Both GPU and CPU timing functions were wrapped up as a performance timer class. We have used +`std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance. + +The power-of-two inputs and non-power-of-two inputs have been analyzed separately for better +comparability of the CPU, GPU and Thrust implementations. + + +![](img/scan_pot.png) +![](img/perfpot.png) + +Analysis for NPOT arrays follows: +![](img/scan_npot.png) +![](img/perfnpot.png) + +Clearly, the CPU implementation is faster for shorter input lengths until around an array size +of 2^13. Following this, the CPU Scan is slightly slower than Naive Parallel Scan but still faster than Work-Efficient Parallel Scan +until an array size of 2^20. However, the Thrust scan always has the +fastest performance among all the GPU algorithms and the CPU scan beyond 2^13. + +The *Work-Efficient scan has been optimized for thread usage* such that it only launches the +required number of threads and no unused thread have to wait for other threads to terminate. This +is why it is faster than all but the Thrust Scan algorithm at arrays of large lengths. + +The major *performance bottleneck* here is the global memory I/O for all parallel algorithms. This can be improved upon +by utilising the shared memory as the size of shared memory is dynamic and is related to the block size. This overhead is overcome +at arrays of larger sizes when the Work-Efficient and Thrust Scan run faster than the CPU Scan. + +The *Thrust* library's performance through `exclusive_scan()` shows remarkable consistency as the input size increases upto +2^17, following which there is a marginal increase in performance time. The possible explanation is optimal memory management +by Thrust, avoiding possible global memory I/O overheads. The slower performance at larger array sizes could be because of +memory allocation and copying latency between the host and the device. + + +### Stream Compaction Performance + +Here, we show a comparison of the stream compaction algorithms implemented through CPU (With and Without Scan) and through the Work-Efficient Algorithm. +![](img/compact_perf.png). + +As one would expect from the discussion above, the CPU algorithms take much lesser time than their GPU +counterparts for smaller array sizes (<2^20). This trend appears to flip at array sizes > 2^20 as the optimized +thread usage for the efficient-algorithm produces improved performance. +``` +**************** +** SCAN TESTS ** +**************** + [ 10 32 23 43 28 33 17 0 35 37 44 0 23 ... 33 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 1.7358ms (std::chrono Measured) + [ 0 10 42 65 108 136 169 186 186 221 258 302 302 ... 25672547 25672580 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 1.7335ms (std::chrono Measured) + [ 0 10 42 65 108 136 169 186 186 221 258 302 302 ... 25672508 25672513 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 1.57238ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 1.51142ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 1.42707ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 1.36602ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.596ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.176128ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 3 1 1 2 2 0 1 3 1 1 0 0 0 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 2.146ms (std::chrono Measured) + [ 3 1 1 2 2 1 3 1 1 3 3 2 1 ... 3 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 2.171ms (std::chrono Measured) + [ 3 1 1 2 2 1 3 1 1 3 3 2 1 ... 1 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 5.0626ms (std::chrono Measured) + [ 3 1 1 2 2 1 3 1 1 3 3 2 1 ... 3 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 2.25693ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 1.92234ms (CUDA Measured) + passed +Press any key to continue . . . +``` diff --git a/img/blocksizeopt.png b/img/blocksizeopt.png new file mode 100644 index 0000000..e93953e Binary files /dev/null and b/img/blocksizeopt.png differ diff --git a/img/compact_perf.png b/img/compact_perf.png new file mode 100644 index 0000000..2795578 Binary files /dev/null and b/img/compact_perf.png differ diff --git a/img/perfnpot.png b/img/perfnpot.png new file mode 100644 index 0000000..e99835a Binary files /dev/null and b/img/perfnpot.png differ diff --git a/img/perfpot.png b/img/perfpot.png new file mode 100644 index 0000000..574e4a7 Binary files /dev/null and b/img/perfpot.png differ diff --git a/img/scan_npot.png b/img/scan_npot.png new file mode 100644 index 0000000..4ad67b8 Binary files /dev/null and b/img/scan_npot.png differ diff --git a/img/scan_pot.png b/img/scan_pot.png new file mode 100644 index 0000000..f725648 Binary files /dev/null and b/img/scan_pot.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..600c29f 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 20; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..de8e11a 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,6 +1,6 @@ #include "common.h" -void checkCUDAErrorFn(const char *msg, const char *file, int line) { +void checkCUDAErrorFn(const char* msg, const char* file, int line) { cudaError_t err = cudaGetLastError(); if (cudaSuccess == err) { return; @@ -22,17 +22,29 @@ namespace StreamCompaction { * Maps an array to an array of 0s and 1s for stream compaction. Elements * which map to 0 will be removed, and elements which map to 1 will be kept. */ - __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { + __global__ void kernMapToBoolean(int n, int* bools, const int* idata) { // TODO + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx >= n) { + return; + } + bools[idx] = (idata[idx]) ? 1 : 0; } /** * Performs scatter on an array. That is, for each element in idata, * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. */ - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { + __global__ void kernScatter(int n, int* odata, + const int* idata, const int* bools, const int* indices) { // TODO + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx >= n) { + return; + } + if (bools[idx]) { + odata[indices[idx]] = idata[idx]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..36ae628 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,9 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } timer().endCpuTimer(); } @@ -30,9 +32,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int j = 0; + for (int i = 0; i < n; i++) { + if (idata[i]) { + odata[j] = idata[i]; + j++; + } + } timer().endCpuTimer(); - return -1; + return (j) ? j : -1; } /** @@ -42,9 +50,29 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + // map the input array to an array of 0s and 1s + int* temp = new int[n]; + for (int i = 0; i < n; i++) { + temp[i] = idata[i] == 0 ? 0 : 1; + } + + int* scanOutput = new int[n]; + scanOutput[0] = 0; + + // scan the temp array + for (int i = 1; i < n; i++) { + scanOutput[i] = scanOutput[i - 1] + temp[i - 1]; + } + + int compactLen = scanOutput[n - 1]; + + for (int i = 0; i < n; i++) { + if (temp[i]) { + odata[scanOutput[i]] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return (compactLen) ? compactLen : -1; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..80f5502 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -12,13 +12,84 @@ namespace StreamCompaction { return timer; } + __global__ void kernUpSweep(int n, int* A, int offset) { + int idx = blockDim.x * blockIdx.x + threadIdx.x; + + // Dividing the input into groups each of offset size + if (idx >= n / offset) { + return; + } + idx *= offset; + A[idx + offset - 1] += A[idx + offset / 2 - 1]; + } + + __global__ void kernDownSweep(int n, int* A, int offset) { + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx >= n/offset) { + return; + } + idx *= offset; + + int temp = A[idx + offset / 2 - 1]; + A[idx + offset / 2 - 1] = A[idx + offset - 1]; + A[idx + offset - 1] += temp; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + void scan(int n, int* odata, const int* idata, bool timeFlag) { + + unsigned int blockSize = 128; + + int padding = 1 << ilog2ceil(n); + + int* dev_odata; + size_t arraySize = n * sizeof(int); + size_t paddedSize = padding * sizeof(int); + cudaMalloc((void**)&dev_odata, paddedSize); + checkCUDAError("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_odata, idata, arraySize, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy idata to dev_odata failed!"); + + cudaMemset(dev_odata + n, 0, (paddedSize - arraySize)); + checkCUDAError("cudaMemcpy padding dev_odata failed!"); + + int numThreads = padding; + + if (timeFlag) + timer().startGpuTimer(); + for (int i = 0; i < ilog2ceil(n); i++) { + int offset = 1 << (i + 1); + numThreads /= 2; + + dim3 fullBlocksPerGrid = ((numThreads + blockSize - 1) / blockSize); + kernUpSweep << > > (padding, dev_odata, offset); + cudaDeviceSynchronize(); + checkCUDAError("kernUpSweep failed!"); + } + + // assign 0 to the root of the tree for Down-Sweep + cudaMemset(dev_odata + padding - 1, 0, sizeof(int)); + cudaDeviceSynchronize(); + checkCUDAError("cudaMemset to dev_odata failed!"); + + for (int i = ilog2ceil(n) - 1; i >= 0; i--) { + int offset = 1 << (i + 1); + numThreads *= 2; + dim3 fullBlocksPerGrid = ((numThreads + blockSize - 1) / blockSize); + kernDownSweep << > > (padding, dev_odata, offset); + checkCUDAError("kernDownSweep failed!"); + } + if (timeFlag) + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, arraySize, cudaMemcpyDeviceToHost); + cudaDeviceSynchronize(); + checkCUDAError("cudaMemcpy dev_odata to odata failed!"); + + cudaFree(dev_odata); } /** @@ -30,11 +101,68 @@ namespace StreamCompaction { * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. */ - int compact(int n, int *odata, const int *idata) { + int compact(int n, int* odata, const int* idata) { + + // Create device arrays + int* dev_idata; + int* dev_odata; + int* dev_bools; + int* dev_indices; + int padding = 1 << ilog2ceil(n); + size_t arraySize = n * sizeof(int); + size_t paddedSize = padding * sizeof(int); + + cudaMalloc((void**)&dev_idata, arraySize); + cudaDeviceSynchronize(); + checkCUDAError("cudaMalloc1 failed!"); + + cudaMalloc((void**)&dev_bools, paddedSize); + cudaDeviceSynchronize(); + checkCUDAError("cudaMalloc2 failed!"); + + cudaMalloc((void**)&dev_indices, paddedSize); + cudaMalloc((void**)&dev_odata, arraySize); + cudaDeviceSynchronize(); + checkCUDAError("cudaMalloc failed!"); + + cudaMemcpy(dev_idata, idata, arraySize, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy idata failed!"); + + cudaMemset(dev_bools + n, 0, (paddedSize - arraySize)); + checkCUDAError("cudaMemset dev_bools failed!"); + + unsigned int blockSize = 128; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + timer().startGpuTimer(); - // TODO + + StreamCompaction::Common::kernMapToBoolean << > > (n, dev_bools, dev_idata); + + scan(n, dev_indices, dev_bools, 0); + StreamCompaction::Common::kernScatter << > > (n, dev_odata, dev_idata, dev_bools, dev_indices); + cudaDeviceSynchronize(); + checkCUDAError("kernel calls failed!"); + timer().endGpuTimer(); - return -1; + cudaMemcpy(odata, dev_odata, arraySize, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy dev_odata to odata failed!"); + + // check if last element of idata is valid, by checking dev_bools. + // If yes, then its index is compactLen - 1. If not, its index is compactLen. + int isLastElemValid; + cudaMemcpy(&isLastElemValid, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + + int lastElemIdx; + cudaMemcpy(&lastElemIdx, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + + int compactLen = (isLastElemValid) ? lastElemIdx + 1 : lastElemIdx; + + cudaFree(dev_bools); + cudaFree(dev_indices); + cudaFree(dev_idata); + cudaFree(dev_odata); + + return (compactLen) ? compactLen : -1; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..4341042 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -6,7 +6,7 @@ namespace StreamCompaction { namespace Efficient { StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + void scan(int n, int* odata, const int* idata, bool timeFlag = 1); int compact(int n, int *odata, const int *idata); } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..6163109 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -1,6 +1,6 @@ #include #include -#include "common.h" +#include "common.h" #include "naive.h" namespace StreamCompaction { @@ -11,15 +11,61 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void kernNaiveScan(int n, int *B, int *A, int d) { + int idx = blockDim.x * blockIdx.x + threadIdx.x; + + if (idx >= n) { + return; + } + int offset = 1 << (d - 1); + + if (idx >= offset) { + B[idx] = A[idx - offset] + A[idx]; + } + else { + B[idx] = A[idx]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int *odata, const int *idata) { + + unsigned int blockSize = 128; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + // define two device arrays + int* A; + int* B; + + size_t arraySize = n * sizeof(int); + cudaMalloc((void**)&A, arraySize); + checkCUDAError("cudaMalloc A failed!"); + cudaMalloc((void**)&B, arraySize); + checkCUDAError("cudaMalloc B failed!"); + + cudaMemcpy(A, idata, arraySize, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy idata to A failed!"); + timer().startGpuTimer(); - // TODO + + for (int i = 1; i <= ilog2ceil(n); i++) { + kernNaiveScan << > > (n, B, A, i); + + // swap the two arrays + std::swap(B, A); + } + timer().endGpuTimer(); + + odata[0] = 0; + cudaMemcpy(odata + 1, A, arraySize, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata failed!"); + + cudaFree(A); + cudaFree(B); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..58f5d8c 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,22 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + thrust::host_vector h_odata(odata, odata + n); + thrust::host_vector h_idata(idata, idata + n); + + thrust::device_vector dv_in = h_idata; + thrust::device_vector dv_out = h_odata; + + cudaDeviceSynchronize(); timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); timer().endGpuTimer(); + + thrust::copy(dv_out.begin(), dv_out.end(), odata); } } }