diff --git a/README.md b/README.md index 0e38ddb..5daaa7a 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,89 @@ CUDA Stream Compaction ====================== +**University of Pennsylvania, CIS 5650: GPU Programming and Architecture, +Project 2 - Stream Compaction** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +* Joshua Smith + * [LinkedIn](https://www.linkedin.com/in/joshua-smith-32b165158/) +* Tested on: Ubuntu 20.04, Ryzen 9 3900x @ 4.6GHz, 24GB RTX 4090 (Personal) +---- +### README +**Project Description** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +In this project, we aim to implement efficient efficient scan (prefix sum) and stream compaction (conditional array inclusion). -### (TODO: Your README) +We first solve both problems with a simple cpu algorithm. This serves as our reference implementation for correctness and time efficiency. Additionally, stream compaction scatter indices can be determined via taking the prefix sum of the original array converted to a binary array indicating whether the corresponding element should be included. Because of this, our prefix sum implementation can be used for both cpu and gpu scan. We implement cpu scan with and without using prefix sum. We implement a 'work efficient' scan implementation that does less total additions compared to our naive implementation. Naive and work-efficient scan correspond to algorithms 39.2.1 and 39.2.2 respectively in [GPUGems39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda). Finally, we implement scan using thrust::exclusive_scan to compare to a professionally optimized solution. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +**Items Completed**: + * CPU Scan + * CPU Stream Compaction (with/without scan) + * Naive GPU Scan + * Work Efficient GPU Scan + * Thrust Scan + * Performance Analysis +---- + +**Performance Analysis** + +In order to fairly compare our various algorithms considered, we must first optimize the hyperparameters of the respective algorithms. Specifically, for Naive and Work-Efficient GPU scan, we must set the blocksize. In both of these implementations, we heavily read from global memory. Too small of a block size may not allow efficient hiding of memory latency. However, too high of a block size, and we may exhaust registers leading to fewer active blocks and reducing gpu occupancy. Below, we test the runtime of scan for an array length of 2^22. + +![test]() + +Here we can see that a block size of 512 is performant for both naive and work-efficient scan. This block size will be chosen for further benchmarking. + +**Comparing Scan Implementations** +* CPU scan + * This method performs the prefix sum sequentially with by accumlating the running sum and writing to a seperate array as it iterates through the list. This method is simple to implement and can be surpisingly fast for smaller array sizes given that it avoids the overhead of specialized parallel algorithms and CUDA invokations. + +* Naive GPU scan + * This method parallelizes the prefix sum operation by parallely computing sums of seperate subsets of the array such that multiple threads can perform additions simultaneously. It only requires log2(n) iterations to completely compute the prefix sum. However, this method does many more total adds than the simple cpu implementation, so while it is parallelized, it is doing significantly more total work. Parallel depth: log2(n). Illustrated Below: + + ![alt text](img/naiveIMG.png) + +* Work-Efficient GPU scan + * This method improves upon the Naive GPU method by doing less total work. The amount of total adds done is similar to that of the sequential implementation. It performs an up and down sweep operation each taking log2(n) iterations. Parallel depth: 2 * log2(n). + + UpSweep: + + ![alt text](img/upsweep.png) + + DownSweep: + + ![alt text](img/downsweep.png) + + * Thrust Scan + * This method is simply a wrapper around the thrust::exclusive_scan kernel. First, the cpu arrays are moved onto the gpu as device vectors. The gputimer is then used to only time the function call on the gpu. Finally, the output scan array is moved back to the cpu. There is likely advanced optimizations that have been made to improve the runtime of this method. However, this likely causes significant overhead that may make the approach slower for small array sizes. + +**Runtime Analysis** + +![alt text]() + +| Array Size log2 | CPU | Naive GPU | Work Eff GPU | Thrust | +| --- | --- | --- | --- | --- | +| 8 | 0.00042 | 0.155 | 0.178 | 0.064 | +| 12 | 0.00312 | 0.184 | 0.228 | 0.058 | +| 16 | 0.0309 | 0.200 | 0.282 | 0.187 | +| 18 | 0.127 | 0.190 | 0.277 | 0.195 | +| 20 | 0.677 | 0.368 | 0.396 | 0.198 | +| 22 | 2.70 | 0.552 | 0.823 | 0.285 | +| 24 | 12.0 | 4.02 | 2.31 | 0.316 | +| 26 | 45.4 | 17.6 | 9.85 | 0.646 | + +* CPU Scan: + * Unsurprisingly cpu scan scales fairly linearly with input size. This is expected as it is a sequential algorithm. We also see that cpu scan outperforms all gpu methods for array sizes <= 2^18. This is unsurpising as their is no gpu invocation overhead, as well as overhead related to modified parallel algorithms. However, for larger array sizes, all gpu methods eventually outperform the naive cpu method. + +* Naive&Work Efficient GPU Scan: + * Naive and Work Efficient gpu scan performed relatively similarly compared to the CPU and Thrust methods. This suggests that the improvement made by work-efficient scan over naive scan was not a significant improvements relative to the improvements made in Thrust. + * Both methods make many global memory reads without the use of shared memory which may be a performance bottleneck effecting both methods relative performance compared to Thrust. + * Both methods run many more threads than are necessary for a given sweep layer call. This causes many threads in a given warp to be waiting for the others to finish their conditional execution. This lowers the occupancy of the gpu and causes all gpu resources to not be used efficiently. Improving this aspect of the algorithm (num threads launched and indexing) would likely help close the gap between these custom implementations and Thrust. + * Given that layers yielded deltas that required modding by a power of 2, I tried a faster mod operation using bitwise and. This yielded negligible performance improvement, suggesting that either the kernel was not compute bound or the compile recognizes this performance improvement. + * Both were consistently outperformed by Thrust, which likely had further optimizations made to the scan algorithm. + +* Thrust: + * Thrust was consistenly the most efficient algorithm beyond 2^18 array size. While there was likely some overhead, this was eventually insignificant once array sizes that could fully make use of the gpu were tested. + * Thrust likely uses shared memory and launches neighboring threads with similar terminal depths in the scan tree. This would maximum non-waiting threads in a given warp and also reduce memory read and write times. + +Term Output with Arr Size (2^22) + +![alt text](img/termScreenshot.png) diff --git a/img/CPU, Naive GPU, Work Eff GPU and Thrust Runtime vs Arr Size.png b/img/CPU, Naive GPU, Work Eff GPU and Thrust Runtime vs Arr Size.png new file mode 100644 index 0000000..7de76b6 Binary files /dev/null and b/img/CPU, Naive GPU, Work Eff GPU and Thrust Runtime vs Arr Size.png differ diff --git a/img/Naive and Work Efficient Scan (4.2M length array).png b/img/Naive and Work Efficient Scan (4.2M length array).png new file mode 100644 index 0000000..d47d4a8 Binary files /dev/null and b/img/Naive and Work Efficient Scan (4.2M length array).png differ diff --git a/img/downsweep.png b/img/downsweep.png new file mode 100644 index 0000000..806839c Binary files /dev/null and b/img/downsweep.png differ diff --git a/img/naiveIMG.png b/img/naiveIMG.png new file mode 100644 index 0000000..7cb8366 Binary files /dev/null and b/img/naiveIMG.png differ diff --git a/img/scancpu.png b/img/scancpu.png new file mode 100644 index 0000000..70d33ec Binary files /dev/null and b/img/scancpu.png differ diff --git a/img/termScreenshot.png b/img/termScreenshot.png new file mode 100644 index 0000000..90714ec Binary files /dev/null and b/img/termScreenshot.png differ diff --git a/img/upsweep.png b/img/upsweep.png new file mode 100644 index 0000000..f1549b5 Binary files /dev/null and b/img/upsweep.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..a70a598 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 22; //26 max was 8 originally feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -147,7 +147,7 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); - system("pause"); // stop Win32 console from closing on exit + //system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; delete[] c; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..5011356 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,12 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + if (k >= n){ + return; + } + + bools[k] = idata[k] != 0; } /** @@ -32,7 +37,12 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + if (k >= n || bools[k] == 0){ + return; + } + + odata[indices[k]] = idata[k]; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..a7c1fc9 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,10 +19,22 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int curr_sum = 0; + for(int i = 0; i < n; ++i){ + odata[i] = curr_sum; + curr_sum += idata[i]; + } timer().endCpuTimer(); } + void scanNoTimer(int n, int *odata, const int *idata) { + int curr_sum = 0; + for(int i = 0; i < n; ++i){ + odata[i] = curr_sum; + curr_sum += idata[i]; + } + } + /** * CPU stream compaction without using the scan function. * @@ -30,9 +42,15 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int curr_index = 0; + for(int i = 0; i < n; ++i){ + if(idata[i] != 0){ + odata[curr_index] = idata[i]; + curr_index += 1; + } + } timer().endCpuTimer(); - return -1; + return curr_index; } /** @@ -41,10 +59,24 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int* boolCheck = new int[n]; + int* intPos = new int[n]; + timer().startCpuTimer(); - // TODO + + for(int i = 0; i < n; ++i){ + boolCheck[i] = (int)(idata[i] != 0); + } + scanNoTimer(n, intPos, boolCheck); + int numPos = intPos[n-1] + boolCheck[n-1]; // get total positives + + for(int i = 0; i < n; ++i){ + if(boolCheck[i]){ + odata[intPos[i]] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return numPos; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..399a487 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -2,6 +2,9 @@ #include #include "common.h" #include "efficient.h" +#include + +#define blocksize 512 namespace StreamCompaction { namespace Efficient { @@ -12,13 +15,90 @@ namespace StreamCompaction { return timer; } + __device__ inline bool base_2_mod_is_0(int x, int mod){ + return !((bool)(x & (mod-1))); + } + + __global__ void kernEfficientScanUp(int n, int delta, int *data){ + int k = threadIdx.x + blockIdx.x * blockDim.x; + if(k >= n){ + return; + } + + if(base_2_mod_is_0(k+1, 2 * delta)){ + data[k] += data[k - delta]; + } + + } + + __global__ void kernEfficientScanDown(int n, int delta, int *data){ + int k = threadIdx.x + blockIdx.x * blockDim.x; + if(k >= n){ + return; + } + + if(base_2_mod_is_0(k+1, 2 * delta)){ + int prev_ind = k-delta; + int tmp = data[prev_ind]; + data[prev_ind] = data[k]; + data[k] += tmp; + } + } + + void printArray(int arr[], int size) { + for(int i = 0; i < size; i++) { + std::cout << arr[i] << " "; + } + std::cout << std::endl; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int *dev_data; + int numBlocks = (n + (blocksize - 1)) / blocksize; + int num_layers = ilog2ceil(n); + int evenN = 1 << num_layers; + cudaMalloc((void**)&dev_data, evenN * sizeof(int)); + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + if(evenN > n){ + cudaMemset(dev_data + (n), 0, (evenN - n) * sizeof(int)); + } + std::cout << "Even N" << evenN << std::endl; timer().startGpuTimer(); - // TODO + int delta = 1; + //std::cout << "start" << std::endl; + //cudaMemcpy(odata, dev_data, (n) * sizeof(int), cudaMemcpyDeviceToHost); + //printArray(odata, n); + for(int i = 0; i < num_layers - 1; ++i){ + //std::cout << "delta" << delta << std::endl; + kernEfficientScanUp<<>>(evenN, delta, dev_data); + delta = delta << 1; + cudaDeviceSynchronize(); + //cudaMemcpy(odata, dev_data, (n) * sizeof(int), cudaMemcpyDeviceToHost); + //printArray(odata, n); + } + cudaMemset(dev_data + (evenN-1), 0, sizeof(int)); + //std::cout << "Done Scan Up" << std::endl; + //cudaMemcpy(odata, dev_data, (n) * sizeof(int), cudaMemcpyDeviceToHost); + //printArray(odata, n); + for(int i = 0; i < num_layers; ++i){ + kernEfficientScanDown<<>>(evenN, delta, dev_data); + delta = delta >> 1; + cudaDeviceSynchronize(); + //cudaMemcpy(odata, dev_data, (n) * sizeof(int), cudaMemcpyDeviceToHost); + //printArray(odata, n); + } + + timer().endGpuTimer(); + cudaMemcpy(odata, dev_data, (n) * sizeof(int), cudaMemcpyDeviceToHost); + cudaDeviceSynchronize(); + cudaError_t err = cudaGetLastError(); + if (err != cudaSuccess) { + std::cout << "CUDA Error: " << cudaGetErrorString(err) << std::endl; + } } /** @@ -31,10 +111,90 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + + int *dev_data_bool; + int *dev_data_indices; + int *dev_idata; + int *dev_odata; + int numBlocks = (n + (blocksize - 1)) / blocksize; + int num_layers = ilog2ceil(n); + int evenN = 1 << num_layers; + int bool_sum; + + /* + std::cout << "Inp Arr [ "; + for(int i = 0; i < n; ++i){ + std::cout << idata[i] << ", "; + } + std::cout << " ]" << std::endl; + */ + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + + cudaMalloc((void**)&dev_data_indices, evenN * sizeof(int)); + if(evenN > n){ + cudaMemset(dev_data_indices + (n), 0, (evenN - n) * sizeof(int)); + } + + cudaMalloc((void**)&dev_data_bool, n * sizeof(int)); + + /*std::cout << "I Data" << std::endl; + cudaMemcpy(odata, dev_idata, (n) * sizeof(int), cudaMemcpyDeviceToHost); + printArray(odata, n);*/ + + timer().startGpuTimer(); - // TODO + + Common::kernMapToBoolean<<>>(n, dev_data_bool, dev_idata); + /*std::cout << "Bool" << std::endl; + cudaMemcpy(odata, dev_data_bool, (n) * sizeof(int), cudaMemcpyDeviceToHost); + printArray(odata, n); + */ + + cudaMemcpy(dev_data_indices, dev_data_bool, n * sizeof(int), cudaMemcpyDeviceToDevice); + + int delta = 1; + for(int i = 0; i < num_layers; ++i){ + kernEfficientScanUp<<>>(evenN, delta, dev_data_indices); + delta = delta << 1; + cudaDeviceSynchronize(); + } + /*std::cout << "Sum Inds" << std::endl; + cudaMemcpy(odata, dev_data_indices, (evenN) * sizeof(int), cudaMemcpyDeviceToHost); + printArray(odata, evenN); + */ + delta = delta >> 1; + cudaMemcpy(&bool_sum, &dev_data_indices[evenN - 1], sizeof(int), cudaMemcpyDeviceToHost); + cudaMemset(dev_data_indices + (evenN-1), 0, sizeof(int)); + for(int i = 0; i < num_layers; ++i){ + kernEfficientScanDown<<>>(evenN, delta, dev_data_indices); + delta = delta >> 1; + cudaDeviceSynchronize(); + } + + /*std::cout << "Final Inds" << std::endl; + cudaMemcpy(odata, dev_data_indices, (evenN) * sizeof(int), cudaMemcpyDeviceToHost); + printArray(odata, evenN); + */ + + Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_data_bool, dev_data_indices); timer().endGpuTimer(); - return -1; + + cudaMemcpy(odata, dev_odata, (bool_sum) * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(odata, dev_odata, (n) * sizeof(int), cudaMemcpyDeviceToHost); + /*std::cout << "compacted with total " << bool_sum << std::endl; + printArray(odata, bool_sum); + */ + cudaDeviceSynchronize(); + cudaError_t err = cudaGetLastError(); + if (err != cudaSuccess) { + std::cout << "CUDA Error: " << cudaGetErrorString(err) << std::endl; + } + + return bool_sum; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..3c10fcb 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -2,6 +2,9 @@ #include #include "common.h" #include "naive.h" +#include + +#define blocksize 512 namespace StreamCompaction { namespace Naive { @@ -11,15 +14,70 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + __global__ void kernNaiveScanStep(int n, int d, int *odata, int *idata){ + int k = threadIdx.x + blockIdx.x * blockDim.x; + int backward_ind = k - d; + if (k >= n) { + return; + } + + if(backward_ind < 0){ + odata[k] = idata[k]; + } else { + odata[k] = idata[k] + idata[backward_ind]; + } + + } + void printArray(int arr[], int size) { + for(int i = 0; i < size; i++) { + std::cout << arr[i] << " "; + } + std::cout << std::endl; + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ + void scan(int n, int *odata, const int *idata) { + + int *dev_idata, *dev_odata; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + + int numBlocks = (n + (blocksize - 1)) / blocksize; + int d = 1; + for(int i = 0; i < ilog2ceil(n); ++i){ + kernNaiveScanStep<<>>(n, d, dev_odata, dev_idata); + cudaDeviceSynchronize(); + /* + cudaError_t err = cudaGetLastError(); + if (err != cudaSuccess) { + std::cout << "CUDA Error: " << cudaGetErrorString(err) << std::endl; + } + */ + + d *= 2; + + int *tmp = dev_idata; + dev_idata = dev_odata; + dev_odata = tmp; + + /* + odata[0] = 0; + cudaMemcpy(odata + 1, dev_idata, (n-1) * sizeof(int), cudaMemcpyDeviceToHost); + + printArray(odata, n); + */ + } + + timer().endGpuTimer(); + odata[0] = 0; + cudaMemcpy(odata + 1, dev_idata, (n-1) * sizeof(int), cudaMemcpyDeviceToHost); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..55c8230 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,18 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + //Try using device vectors instead of device ptr as mentioned in comments below, so no Memcpy + thrust::device_vector dev_idata(idata, idata+n); + thrust::device_vector dev_odata(n); + cudaDeviceSynchronize(); timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin()); timer().endGpuTimer(); + thrust::copy(dev_odata.begin(), dev_odata.end(), odata); } } }