diff --git a/README.md b/README.md index b71c458..1e4a136 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,139 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Rudraksha D. Shah +* Tested on: Windows 10, i7-7700HQ @ 2.80GHz 16GB, GTX 1050 4096MB (Personal Computer) -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +## Overview +------------------- + +__Scan:__ In scan we iterates through an input array and based on a given operator (which could be any mathematical operator) produce a new array as output where each value at a given index `i` in the output array is a result of performing the operator on every preceeding value. + +There are two types of scan: +- Inclusive Scan: The operator is applied on every preceeding value and the value at the index `i` as well to produce the output. + +- Exclusive Scan: The operator is applied on every preceeding value excluding the value at the index `i` to produce the output. + +I implemented the scan algorithm on the CPU and the GPU. The scan operator is `+` (addition) so we will be performing inclusive and exclusive sum. + +- CPU: The CPU implementation of the scan is straightforward the scan iteration is a simple for loop that goes through every element and keeps the accumulated sum of all the previous iterations in a variable adding it to each new element and placing it in the second array. + + The algorithm performs `N` additions and has a time complexity of `O(n)`, where N,n - No of elements in the array. + +- GPU Naive: The Naive GPU implentation is based on the scan algorithm presented by Hillis and Steele [1986](http://www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf). The figure below shows the iteration steps for the algorythm. We iterate over each element and take the addition of the next element basd on the depth where with each depth we skip 2^depth values and add. + + The drawback of this method is that it performs `n*log(n)` addition operations as compared to `N` operations in case of the CPU implementation. + + ![Naive GPU Scan](img/figure-39-2.jpg) + +- GPU Work-Efficient: This algorithm performs with the efficiency of the secquential algorithm on the CPU with only `N` addition operations. This algorithm is based on the one presented by Blelloch [1990](https://www.mimuw.edu.pl/~ps209291/kgkp/slides/scan.pdf). For this method we will create a psudo balanced tree structure inplace in the array we are performing the scan on. For an array of size `n` the number of levels will be `log(n)` and for each level `l` there will be `2^l` nodes. If we perform one addition per node we will perform a total of `n` additions making the complexity of this algorithm as `O(n)`. + + There are essentially two steps in this algorithm: + - Up-Sweep or parallel reduction: During this step we start from the leaf nodes i.e. the original array and travel up the tree calculating the partial sums at each level. The root node contains the sum of all the elements of the array. + + ![Up-Sweep](img/UpSweep.PNG) + + - Down-Sweep: During this step we start from the root node and replace it with `0`. For each level each node's current value is stored in the left child and the addition of it's value and the former left child's value is placed in the current node's right child. This produces an exclusive sum scan array. + + ![Down Sweep](img/DownSweep.jpg) + +- Thrust: Thrust is a librray that poduces the scan output for any input array that is wrapped in the `thrust::Device_Vector` using the function `thrust::exclusive_scan(first, last, result)`. + +__Scatter:__ This is the final process for string compaction. In the current implementation we want to remove the elements that are `0` from the input array. For this we produce a boolean index array where for each position we store either `0` or `1` based on the value at that position is non-zero or not. We perform scan on this boolean indexed array and use the scanned result array to determine the position/index of the non-zero elements to be placed in the new array. + +![Scatter](img/Scatter.PNG) + + +#### Performance Analysis +---------------------------- + +For the performance analysis I have used a block size of 128 and logged the execution times for the scan with incrementing array sizes. + +* Table: + + ![Table](img/Table.PNG) + +* Chart: + + ![Performance Chart](img/TableChart.png) + + +- Analysis: + + From the performance data and the chart we can see that through all the implementtaions CPU implementation is the fastest. This is to be expected as the CPU implementation even though is synchronous and serial the operation is highly sequential which leads to very low cache misses in the RAM as well as the CPU is highly optimised for such sequential memory fuctions. + + On the other hand the GPU Naive implementation performs `n*log(n)` more computations as compared to the CPU implementation and is expectantly slow compared to the CPU despite being massively parallel. Similarly for the GPU Work-Efficient implemetation, even though it performs the same number of additions as the CPU it is also slower than the CPU version. + + Bottle Neck: The primary bottle neck for the GPU implementation is global memory access. For each thread is relatively light for each thread we do a minimum of two memory calls to the global memory to fetch the two values to be added. The second bottle neck I think is due to the inefficient use of the GPU threads. For each subsequent level the number of threads working per wrap reduces. These inactive threads stay there doing nothing and utilizing the resources of the GPU. Thus the overhead of these threads also slows down the GPU implementation. + + Unexpected Result: Wait... in the GPU implementation the efficient implementation should be faster as compared to the inefficient implementation!!! But for my analysis I find that the GPU Work-Efficient implementation is nearly as fast as the inefficient implementation and even slower as the number of elements in the array increases. This is a very confusing result to me that i am not able to explain. I tried debugging my code and going through it to make sure I was not doning anything wrong but I could not find anything. I believe this should be the expected result because as mentioned above the number of idle threads in the work-efficient implementatioin will be twice that of the naive implementation but I would like to know the exact reasoning. + + Thrust Implementation: Thrust performed more or less consistantly throughout the increasing array sizes. Over the final array size values `2^20 - 2^22` the speed reduces as expected. The implementation overall was the slpwest of them all for all the array sizes which is rather expected as the library would be doing lot many things that may not be necessary for the scan. A proper analysis would be possible with a more clear understanding of the base code. + +``` + +**************** +** SCAN TESTS ** +**************** + [ 30 31 46 44 34 30 14 45 47 29 37 0 38 ... 48 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 1.67895ms (std::chrono Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684633 25684681 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 1.74022ms (std::chrono Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684565 25684600 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 2.16678ms (CUDA Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684633 25684681 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 2.16525ms (CUDA Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 0 0 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 3.21434ms (CUDA Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684633 25684681 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 3.2415ms (CUDA Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684565 25684600 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 4.51072ms (CUDA Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684633 25684681 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.342016ms (CUDA Measured) + [ 0 30 61 107 151 185 215 229 274 321 350 387 387 ... 25684565 25684600 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 3 2 3 1 2 0 2 2 3 3 0 3 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 2.93525ms (std::chrono Measured) + [ 1 3 2 3 1 2 2 2 3 3 3 3 1 ... 2 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 2.90024ms (std::chrono Measured) + [ 1 3 2 3 1 2 2 2 3 3 3 3 1 ... 3 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 10.3574ms (std::chrono Measured) + [ 1 3 2 3 1 2 2 2 3 3 3 3 1 ... 2 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 3.39251ms (CUDA Measured) + [ 1 3 2 3 1 2 2 2 3 3 3 3 1 ... 2 3 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 3.42634ms (CUDA Measured) + [ 1 3 2 3 1 2 2 2 3 3 3 3 1 ... 3 3 ] + passed + +``` + diff --git a/img/figure-39-4.jpg b/img/DownSweep.jpg similarity index 100% rename from img/figure-39-4.jpg rename to img/DownSweep.jpg diff --git a/img/Scatter.PNG b/img/Scatter.PNG new file mode 100644 index 0000000..827d32d Binary files /dev/null and b/img/Scatter.PNG differ diff --git a/img/Table.PNG b/img/Table.PNG new file mode 100644 index 0000000..460ec0e Binary files /dev/null and b/img/Table.PNG differ diff --git a/img/TableChart.png b/img/TableChart.png new file mode 100644 index 0000000..9b14f2d Binary files /dev/null and b/img/TableChart.png differ diff --git a/img/UpSweep.PNG b/img/UpSweep.PNG new file mode 100644 index 0000000..8c5871f Binary files /dev/null and b/img/UpSweep.PNG differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..4eb5d99 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 20; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; @@ -49,42 +49,42 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -129,14 +129,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..0a0731d 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -22,18 +22,32 @@ namespace StreamCompaction { * Maps an array to an array of 0s and 1s for stream compaction. Elements * which map to 0 will be removed, and elements which map to 1 will be kept. */ - __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO - } + + __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { + + // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) return; + + bools[index] = (idata[index] != 0) ? 1 : 0; + + } /** * Performs scatter on an array. That is, for each element in idata, * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. */ - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { + + __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) return; + + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } } } + } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..4b7142e 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,17 +1,18 @@ #pragma once -#include -#include - -#include -#include -#include -#include +#include +#include + +#include +#include +#include +#include #include -#include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define blockSize 128 /** * Check for CUDA errors; print and exit if there was a problem. @@ -37,96 +38,96 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); - /** - * This class is used for timing the performance - * Uncopyable and unmovable - * - * Adapted from WindyDarian(https://github.com/WindyDarian) - */ - class PerformanceTimer - { - public: - PerformanceTimer() - { - cudaEventCreate(&event_start); - cudaEventCreate(&event_end); - } - - ~PerformanceTimer() - { - cudaEventDestroy(event_start); - cudaEventDestroy(event_end); - } - - void startCpuTimer() - { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); - } - - void endCpuTimer() - { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; - } - - void startGpuTimer() - { - if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } - gpu_timer_started = true; - - cudaEventRecord(event_start); - } - - void endGpuTimer() - { - cudaEventRecord(event_end); - cudaEventSynchronize(event_end); - - if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } - - cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); - gpu_timer_started = false; - } - - float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 - { - return prev_elapsed_time_cpu_milliseconds; - } - - float getGpuElapsedTimeForPreviousOperation() //noexcept - { - return prev_elapsed_time_gpu_milliseconds; - } - - // remove copy and move functions - PerformanceTimer(const PerformanceTimer&) = delete; - PerformanceTimer(PerformanceTimer&&) = delete; - PerformanceTimer& operator=(const PerformanceTimer&) = delete; - PerformanceTimer& operator=(PerformanceTimer&&) = delete; - - private: - cudaEvent_t event_start = nullptr; - cudaEvent_t event_end = nullptr; - - using time_point_t = std::chrono::high_resolution_clock::time_point; - time_point_t time_start_cpu; - time_point_t time_end_cpu; - - bool cpu_timer_started = false; - bool gpu_timer_started = false; - - float prev_elapsed_time_cpu_milliseconds = 0.f; - float prev_elapsed_time_gpu_milliseconds = 0.f; + /** + * This class is used for timing the performance + * Uncopyable and unmovable + * + * Adapted from WindyDarian(https://github.com/WindyDarian) + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; }; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..27f9896 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,15 @@ #include #include "cpu.h" -#include "common.h" +#include "common.h" namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** @@ -17,34 +17,115 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + + void scan(int n, int *odata, const int *idata) { + + timer().startCpuTimer(); // TODO + // Exclusive scan, first element is 0 + scanExclusivePrefixSum(n, odata, idata); timer().endCpuTimer(); - } + + } + + // Scan implementation to avoid the CPU timer error + void scanExclusivePrefixSum(int n, int *odata, const int *idata) { + + odata[0] = 0; + for (int i = 1; i < n; ++i) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + + } /** * CPU stream compaction without using the scan function. * * @returns the number of elements remaining after compaction. */ - int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + + int compactWithoutScan(int n, int *odata, const int *idata) { + + timer().startCpuTimer(); // TODO + int count = 0; + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) { + odata[count] = idata[i]; + count++; + } + } timer().endCpuTimer(); - return -1; - } + return count; + + } /** * CPU stream compaction using scan and scatter, like the parallel version. * * @returns the number of elements remaining after compaction. */ - int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + + int compactWithScan(int n, int *odata, const int *idata) { + + timer().startCpuTimer(); // TODO - timer().endCpuTimer(); - return -1; - } - } + int count = 0; + int *mappedArray = new int[n]; + int *scannedArray = new int[n]; + + // Map the data + map(n, mappedArray, idata); + + // Scan the mapped data + scanExclusivePrefixSum(n, scannedArray, mappedArray); + + // Scatter + count = scatter(n, mappedArray, scannedArray, idata, odata); + + timer().endCpuTimer(); + return count; + + } + + /** + * CPU stream compaction part: Mapping + * + * @based on the defined rules (here we map non zero values) assigns 1 or 0 for a given element in the idata array. + */ + + void map(int n, int *mapped, const int *idata) { + + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) { + mapped[i] = 1; + } + else { + mapped[i] = 0; + } + } + + } + + /** + * CPU stream compaction part: Scatter + * + * @scatters the input elements to the output vector using the addresses generated by the scan. + * returns the number of elements remaining after the scan. + */ + + int scatter(int n, int *mapped, int *scanned, const int *idata, int *odata) { + + int count = 0; + for (int i = 0; i < n; ++i) { + if (mapped[i] == 1) { + odata[scanned[i]] = idata[i]; + count++; + } + } + return count; + } + + } + } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 236ce11..fe14ab2 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -8,8 +8,14 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scanExclusivePrefixSum(int n, int *odata, const int *idata); + int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); + + int scatter(int n, int *mapped, int *scanned, const int *idata, int *odata); + + void map(int n, int *mapped, const int *idata); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..c633eca 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,21 +5,111 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + void __global__ kernScanUpSweep (const int N, const int powerv1, const int powerv2, int *dev_oDataArray) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + + if (index % powerv1 != powerv1 - 1) return; + dev_oDataArray[index] += dev_oDataArray[index - powerv2]; + + // After UpSweep is over the last value in the array is given value 0 which will be useful in the DownSweep step + if (index == N - 1) { + dev_oDataArray[index] = 0; + } + + } + + void __global__ kernScanDownSweep (const int N, const int powerv1, const int powerv2, int *dev_oDataArray) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + + if (index % powerv1 != powerv1 - 1) return; + int temp = dev_oDataArray[index - powerv2]; + dev_oDataArray[index - powerv2] = dev_oDataArray[index]; + dev_oDataArray[index] += temp; + + } + + void __global__ kernPaddArrayWithZero(const int N, const int paddedArrayLength, int *dev_oDataArray) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < N || index >= paddedArrayLength) return; + + dev_oDataArray[index] = 0; + + } + + void scanExcusivePrefixSum(int N, int dimension, dim3 fullBlocksPerGrid, dim3 threadsPerBlock, int *dev_oDataArray) { + + // Up Sweep Scan + int powerv1, powerv2; + for (int d = 0; d < dimension; ++d) { + powerv1 = 1 << (d + 1); + powerv2 = 1 << d; + kernScanUpSweep << > > (N, powerv1, powerv2, dev_oDataArray); + } + + // Down Sweep Scans + for (int i = dimension - 1; i >= 0; --i) { + powerv1 = 1 << (i + 1); + powerv2 = 1 << i; + kernScanDownSweep << > > (N, powerv1, powerv2, dev_oDataArray); + } + + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + + void scan(int n, int *odata, const int *idata) { + + // Defining the configuration of the kernel + int dimension = ilog2ceil(n); + int paddedArrayLength = 1 << (dimension); + int paddedArraySize = paddedArrayLength * sizeof(int); + dim3 fullBlocksPerGrid((paddedArrayLength + blockSize - 1) / blockSize); + dim3 threadsPerBlock(blockSize); + int size = n * sizeof(int); + + // Creating array buffers on the device memory + int *dev_oDataArray; + cudaMalloc((void**)&dev_oDataArray, paddedArraySize); + checkCUDAError("cudaMalloc for dev_oDataArray failed!"); + + // Copying array buffers (Host to Device) + cudaMemcpy(dev_oDataArray, idata, size, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy into dev_oDataArray failed!"); + + // For the extra space in the padded array fill it with 0 + kernPaddArrayWithZero <<>> (n, paddedArrayLength, dev_oDataArray); + timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } + + // TODO + scanExcusivePrefixSum(paddedArrayLength, dimension, fullBlocksPerGrid, threadsPerBlock, dev_oDataArray); + checkCUDAError("scanExcusivePrefixSum failed!"); + + timer().endGpuTimer(); + + //Copying array buffers (Device to Host) + cudaMemcpy(odata, dev_oDataArray, size, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy into odata failed!"); + + // Freeing the data buffer stored in device memory + cudaFree(dev_oDataArray); + checkCUDAError("cudaFree on dev_oDataArray failed!"); + + } /** * Performs stream compaction on idata, storing the result into odata. @@ -30,11 +120,82 @@ namespace StreamCompaction { * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. */ - int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; - } - } + + int compact(int n, int *odata, const int *idata) { + + // Defining the configuration of the kernel + int dimension = ilog2ceil(n); + int paddedArrayLength = 1 << (dimension); + int paddedArraySize = paddedArrayLength * sizeof(int); + dim3 fullBlocksPerGridPadded((paddedArrayLength + blockSize - 1) / blockSize); + dim3 threadsPerBlock(blockSize); + int size = n * sizeof(int); + int fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + // Creating buffers on the device memory + int *dev_oScanDataArray; + cudaMalloc((void**)&dev_oScanDataArray, paddedArraySize); + checkCUDAError("cudaMalloc for dev_oScanDataArray failed!"); + + int *dev_oData; + cudaMalloc((void**)&dev_oData, size); + checkCUDAError("cudaMalloc for dev_oData failed!"); + + int *dev_iData; + cudaMalloc((void**)&dev_iData, size); + checkCUDAError("cudaMalloc for dev_iData failed!"); + + int *dev_boolIndexArray; + cudaMalloc((void**)&dev_boolIndexArray, size); + checkCUDAError("cudaMalloc for dev_boolIndexArray failed!"); + + // Copying array buffers idata to dev_iData (Host to Device) + cudaMemcpy(dev_iData, idata, size, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy into dev_iData failed!"); + + // Initialize the bool array: For each index fill 1 in dev_boolIndexArray[index] if corrosponding value in dev_iData is non-zero otherwise fill 0 + StreamCompaction::Common::kernMapToBoolean <<>> (n, dev_boolIndexArray, dev_iData); + checkCUDAError("kernMapToBoolean failed!"); + + // Copy buffer dev_boolIndexArray to buffer dev_oScanDataArray (Device to Device) + cudaMemcpy(dev_oScanDataArray, dev_boolIndexArray, size, cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy into dev_oScanDataArray failed!"); + + // Padd the dev_oScanDataArray with zero's at the end + kernPaddArrayWithZero << > > (n, paddedArrayLength, dev_oScanDataArray); + checkCUDAError("kernPaddArrayWithZero failed!"); + + timer().startGpuTimer(); + + // TODO + scanExcusivePrefixSum(paddedArrayLength, dimension, fullBlocksPerGridPadded, threadsPerBlock, dev_oScanDataArray); + checkCUDAError("scanExcusivePrefixSum failed!"); + StreamCompaction::Common::kernScatter << > > (n, dev_oData, dev_iData, dev_boolIndexArray, dev_oScanDataArray); + checkCUDAError("kernScatter failed!"); + + timer().endGpuTimer(); + + // Copying the data from dev_oData to odata (Device To Host) + cudaMemcpy(odata, dev_oData, size, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy into odata failed!"); + + // Getting the size of the number of elements that were filled (Device To Host) + int valueAtIndexArrayEnd, valueAtBoolArrayEnd, totalSize; + cudaMemcpy(&valueAtIndexArrayEnd, dev_oScanDataArray + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&valueAtBoolArrayEnd, dev_boolIndexArray + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + totalSize = valueAtBoolArrayEnd + valueAtIndexArrayEnd; + + // Freeing Cuda Memory + cudaFree(dev_boolIndexArray); + cudaFree(dev_iData); + cudaFree(dev_oData); + cudaFree(dev_oScanDataArray); + checkCUDAError("cudaFree failed!"); + + return totalSize; + + } + + } + } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..da397a2 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -9,5 +9,7 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); + + void scanExcusivePrefixSum(int N, int dimension, dim3 fullBlocksPerGrid, dim3 threadsPerBlock, int *dev_oDataArray); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..fdbd927 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,23 +3,87 @@ #include "common.h" #include "naive.h" + + namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } - // TODO: __global__ + + // TODO: __global__ + __global__ void scan(int N, int power, int *dev_oDataArray, int *dev_iDataArray) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + + dev_oDataArray[index] = (index >= power) ? dev_iDataArray[index - power] + dev_iDataArray[index] : dev_iDataArray[index]; + + } - /** + __global__ void kernExclusiveFromInclusive(int N, int *dev_oDataArray, int *dev_iDataArray) { + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + + dev_oDataArray[index] = (index == 0) ? 0 : dev_iDataArray[index - 1]; + + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - } + + void scan(int n, int *odata, const int *idata) { + + // Defining the configuration of the kernel + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + dim3 threadsPerBlock(blockSize); + + int size = n * sizeof(int); + + // Creating array buffers on the device memory + int *dev_oDataArray, *dev_iDataArray; + cudaMalloc((void**)&dev_oDataArray, size); + checkCUDAError("cudaMalloc dev_oDataArray failed!"); + cudaMalloc((void**)&dev_iDataArray, size); + checkCUDAError("cudaMalloc dev_iDataArray failed!"); + + // Copying array buffers from Host to Device + cudaMemcpy(dev_iDataArray, idata, size, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_iDataArray failed!"); + + timer().startGpuTimer(); + + // TODO + int dimension = ilog2ceil(n); + for (int d = 1; d <= dimension; ++d) { + // Power of 2^(d-1) + int power = 1 << (d - 1); + scan <<>> (n, power, dev_oDataArray, dev_iDataArray); + checkCUDAError("scan kernel failed!"); + + std::swap(dev_oDataArray, dev_iDataArray); + } + + // Convert the output data array from inclusve to exclusive + kernExclusiveFromInclusive << > > (n, dev_oDataArray, dev_iDataArray); + + timer().endGpuTimer(); + + //Copying array buffers from Device to Host + cudaMemcpy(odata, dev_oDataArray, size, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata failed!"); + + // Freeing the device memory + cudaFree(dev_oDataArray); + cudaFree(dev_iDataArray); + + } + + } + } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..54d1c60 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,21 +8,34 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } - /** + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + + void scan(int n, int *odata, const int *idata) { + + int size = n * sizeof(int); + + // Wrap device buffers into thrust pointers + thrust::device_vector iThrustVector(idata, idata + n); + thrust::device_vector oThrustVector(odata, odata + n); + + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(iThrustVector.begin(), iThrustVector.end(), oThrustVector.begin()); timer().endGpuTimer(); - } - } + + thrust::copy(oThrustVector.begin(), oThrustVector.end(), odata); + } + + } + }