diff --git a/README.md b/README.md index b71c458..8683c5c 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,156 @@ CUDA Stream Compaction ====================== + + **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Daniel Daley-Mongtomery +* Tested on: MacBook Pro, OSX 10.12, i7 @ 2.3GHz, 16GB RAM, GT 750M 2048MB (Personal Machine) + +## Details + +This project is an exploration of a somewhat simple concept, stream compaction, and how it relates to more complex parallel algorithms. Stream compaction consists of removing array elements based on some predicate (in our case 'Is this element equal to 0?') and is used to discard irrelevant data before it reaches a more computationally intensive stage in a program. [My path tracer,](https://github.com/illDivino/Project3-CUDA-Path-Tracer) for example, uses stream compaction to prevent launching unnecessary expensive kernels if they're just going to contribute 0 anyway. + +#### Scan + + This project first tests the efectiveness of several scanning methods, where array element *x* is replaced with the sum of all preceding elements. This will be used later for stream compaction, but also provides another opportunity to explore a parallel system. The high-level approximations for each algortihm for number of elements *n* are as follows: + +###### Basic CPU: +``` +for i < n - 1 { + elements[i+1] = elements[i] + elements[i+1] + i++ +} +``` +This is pretty self explanatory. It sums every pair of elements in series, and will scale linearly with the number of elements. + +###### Naive GPU: +``` +for stride = 1; stride < n { + + for all i < n in parallel { + if (i >= stride) + elements[i] = elements[i-stride] + elements[i] + } + + stride *= 2 +} + +``` + +![](img/naive.png) + +Every iteration of the outer loop will sum an element with the element *stride* away. By log2(n) iterations, every element will be full. Unfortunately, while it will only take log(n) kernel launches, each of this will perform n operations. This nlog(n) runtime makes this less work-efficient than the CPU version. + +###### Work-Efficient GPU: +``` +//upsweep +for stride = 2; stride < n { + for i < (n/stride) in parallel { + index = i * stride - 1 + if (index + stride < n) + elements[index+stride] = elements[index+stride] + elements[index+(stride/2)] + } + stride *= 2 +} +``` + +![](img/upsweep.png) + +``` +// Downsweep +x[n - 1] = 0 +for stride = n; stride >= 2 { + for i < (n/stride) in parallel { + temp = elements[i + stride – 1]; + elements[i + stride – 1] = elements[i + (2*stride) – 1]; + elements[i + (stride * 2) – 1] += temp; + } +} +``` + +![](img/downsweep.png) + +This method allows us to perform n adds on the upsweep, then n adds and n copies on the down, keeping us within the complexity of the CPU version, but still reaping the benefits of the GPU. Let's see how they measure up. This project printed the following text upon execution, which I collected and averaged to create the below graph: + +``` +**************** +** SCAN TESTS ** +**************** +a[SIZE]: + [ 44 34 5 30 42 18 12 23 14 1 12 32 1 ... 7 0 ] +a[NPOT]: + [ 44 34 5 30 42 18 12 23 14 1 12 32 1 ... 27 45 ] +==== cpu scan, power-of-two ==== + elapsed time: 2.31825ms (std::chrono Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 2.28568ms (std::chrono Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 16.4329ms (CUDA Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 16.4599ms (CUDA Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 2.88461ms (CUDA Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 3.04077ms (CUDA Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 2.89341ms (CUDA Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.731264ms (CUDA Measured) + [ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ] + passed +``` + +![](img/scantime.png) + + Unfortunately, my GPU skills are not quite on par with the developers of [thrust](https://github.com/thrust/thrust), whose thrust library outperformed both my implementations and the plain CPU implementation. Because of the simplicity of the operations within the kernel, my expectation is that global memory access bloated my execution time. Shared memory would prevent the immense effort of accessing these global elements and perhaps help me compete with thrust and CPU. + +#### Stream Compaction + + Even if I'm not winning the race, I still wanted to execute the original goal and use my scan to perform stream compaction. The method is simple enough: fill a binary array with 1 (keep) or 0 (remove), then execute an exclusive scan on that array. We can then access each element of this scanned array in parallel, check if we have a freshly incremented value, and use that value as the new index in the compacted array. Let's see how that worked out: + +``` +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 0 1 2 2 0 2 1 0 1 0 0 1 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 2.66989ms (std::chrono Measured) + [ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 2.62928ms (std::chrono Measured) + [ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 3 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 10.7862ms (std::chrono Measured) + [ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 3.89808ms (CUDA Measured) + [ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 1 1 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 3.88435ms (CUDA Measured) + [ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 3 3 ] + passed -### (TODO: Your README) +``` -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +![](img/compactionTime.png) +Again, the CPU wins. Like above, operations this small are especially penalized by the overhead of kernel launches and global memory. Hopefully, as I get into more complex work on the GPU, I'll be able to find more opportunities to get work done faster than the CPU like I did with the [boids](https://github.com/illDivino/Project1-CUDA-Flocking). diff --git a/img/compactionTime.png b/img/compactionTime.png new file mode 100644 index 0000000..4e9942f Binary files /dev/null and b/img/compactionTime.png differ diff --git a/img/downsweep.png b/img/downsweep.png new file mode 100644 index 0000000..186120b Binary files /dev/null and b/img/downsweep.png differ diff --git a/img/naive.png b/img/naive.png new file mode 100644 index 0000000..e6797f9 Binary files /dev/null and b/img/naive.png differ diff --git a/img/scantime.png b/img/scantime.png new file mode 100644 index 0000000..bd6041b Binary files /dev/null and b/img/scantime.png differ diff --git a/img/upsweep.png b/img/upsweep.png new file mode 100644 index 0000000..8ef0d56 Binary files /dev/null and b/img/upsweep.png differ diff --git a/src/main.cpp b/src/main.cpp index 7305641..c5dc4fd 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,12 +13,11 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const long long SIZE = 1 << 20; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int a[SIZE], b[SIZE], c[SIZE]; int main(int argc, char* argv[]) { - // Scan tests printf("\n"); printf("****************\n"); @@ -27,7 +26,10 @@ int main(int argc, char* argv[]) { genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; + printf("a[SIZE]:\n"); printArray(SIZE, a, true); + printf("a[NPOT]:\n"); + printArray(NPOT, a, true); // initialize b using StreamCompaction::CPU::scan you implement // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. @@ -49,42 +51,42 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -129,14 +131,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 8fc0211..25e178e 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,20 +1,19 @@ #include "common.h" void checkCUDAErrorFn(const char *msg, const char *file, int line) { - cudaError_t err = cudaGetLastError(); - if (cudaSuccess == err) { - return; - } + cudaError_t err = cudaGetLastError(); + if (cudaSuccess == err) { + return; + } - fprintf(stderr, "CUDA error"); - if (file) { - fprintf(stderr, " (%s:%d)", file, line); - } - fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); - exit(EXIT_FAILURE); + fprintf(stderr, "CUDA error"); + if (file) { + fprintf(stderr, " (%s:%d)", file, line); + } + fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); + exit(EXIT_FAILURE); } - namespace StreamCompaction { namespace Common { @@ -23,7 +22,9 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) + bools[index] = idata[index] != 0 ? 1 : 0; } /** @@ -32,8 +33,16 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n && bools[index] != 0) + odata[indices[index]] = idata[index]; + } + __global__ void kernInclusiveToExclusive(int n, int *odata, const int *idata) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) odata[index] = index ? idata[index - 1] : 0; + } + } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 55f1b38..e17d735 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -1,15 +1,15 @@ #pragma once -#include -#include - -#include -#include -#include -#include -#include -#include +#include +#include +#include +#include +#include +#include +#include +#include +#define blockSize 128 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) @@ -26,6 +26,19 @@ inline int ilog2(int x) { return lg; } +inline int int_pow(int base, int exp) +{ + int result = 1; + while (exp) + { + if (exp & 1) + result *= base; + exp /= 2; + base *= base; + } + return result; +} + inline int ilog2ceil(int x) { return ilog2(x - 1) + 1; } @@ -34,99 +47,101 @@ namespace StreamCompaction { namespace Common { __global__ void kernMapToBoolean(int n, int *bools, const int *idata); + __global__ void kernInclusiveToExclusive(int n, int *odata, const int *idata); + __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); - /** - * This class is used for timing the performance - * Uncopyable and unmovable - * - * Adapted from WindyDarian(https://github.com/WindyDarian) - */ - class PerformanceTimer - { - public: - PerformanceTimer() - { - cudaEventCreate(&event_start); - cudaEventCreate(&event_end); - } - - ~PerformanceTimer() - { - cudaEventDestroy(event_start); - cudaEventDestroy(event_end); - } - - void startCpuTimer() - { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); - } - - void endCpuTimer() - { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; - } - - void startGpuTimer() - { - if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } - gpu_timer_started = true; - - cudaEventRecord(event_start); - } - - void endGpuTimer() - { - cudaEventRecord(event_end); - cudaEventSynchronize(event_end); - - if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } - - cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); - gpu_timer_started = false; - } - - float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 - { - return prev_elapsed_time_cpu_milliseconds; - } - - float getGpuElapsedTimeForPreviousOperation() //noexcept - { - return prev_elapsed_time_gpu_milliseconds; - } - - // remove copy and move functions - PerformanceTimer(const PerformanceTimer&) = delete; - PerformanceTimer(PerformanceTimer&&) = delete; - PerformanceTimer& operator=(const PerformanceTimer&) = delete; - PerformanceTimer& operator=(PerformanceTimer&&) = delete; - - private: - cudaEvent_t event_start = nullptr; - cudaEvent_t event_end = nullptr; - - using time_point_t = std::chrono::high_resolution_clock::time_point; - time_point_t time_start_cpu; - time_point_t time_end_cpu; - - bool cpu_timer_started = false; - bool gpu_timer_started = false; - - float prev_elapsed_time_cpu_milliseconds = 0.f; - float prev_elapsed_time_gpu_milliseconds = 0.f; + /** + * This class is used for timing the performance + * Uncopyable and unmovable + * + * Adapted from WindyDarian(https://github.com/WindyDarian) + */ + class PerformanceTimer + { + public: + PerformanceTimer() + { + cudaEventCreate(&event_start); + cudaEventCreate(&event_end); + } + + ~PerformanceTimer() + { + cudaEventDestroy(event_start); + cudaEventDestroy(event_end); + } + + void startCpuTimer() + { + if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } + cpu_timer_started = true; + + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + + void endCpuTimer() + { + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + + void startGpuTimer() + { + if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); } + gpu_timer_started = true; + + cudaEventRecord(event_start); + } + + void endGpuTimer() + { + cudaEventRecord(event_end); + cudaEventSynchronize(event_end); + + if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); } + + cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end); + gpu_timer_started = false; + } + + float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015 + { + return prev_elapsed_time_cpu_milliseconds; + } + + float getGpuElapsedTimeForPreviousOperation() //noexcept + { + return prev_elapsed_time_gpu_milliseconds; + } + + // remove copy and move functions + PerformanceTimer(const PerformanceTimer&) = delete; + PerformanceTimer(PerformanceTimer&&) = delete; + PerformanceTimer& operator=(const PerformanceTimer&) = delete; + PerformanceTimer& operator=(PerformanceTimer&&) = delete; + + private: + cudaEvent_t event_start = nullptr; + cudaEvent_t event_end = nullptr; + + using time_point_t = std::chrono::high_resolution_clock::time_point; + time_point_t time_start_cpu; + time_point_t time_end_cpu; + + bool cpu_timer_started = false; + bool gpu_timer_started = false; + + float prev_elapsed_time_cpu_milliseconds = 0.f; + float prev_elapsed_time_gpu_milliseconds = 0.f; }; } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 05ce667..9e24466 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,15 @@ #include #include "cpu.h" -#include "common.h" +#include "common.h" namespace StreamCompaction { namespace CPU { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** @@ -17,10 +17,21 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ - void scan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); + void scan(int n, int *odata, const int *idata, bool time) { + + if (time) timer().startCpuTimer(); + odata[0] = 0; + + //inclusive + for (int i = 1; i < n; i++) { + odata[i] = idata[i-1] + odata[i - 1]; + } + + if (time) timer().endCpuTimer(); + } + + void scan(int n, int *odata, const int *idata) { + scan(n, odata, idata, true); } /** @@ -29,10 +40,15 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; + int output = 0; + + timer().startCpuTimer(); + for (int i = 0; i < n; i++) { + if (idata[i] != 0) odata[output++] = idata[i]; + } + timer().endCpuTimer(); + + return output; } /** @@ -41,10 +57,32 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int *binaryValues = new int[n]; + int *scanValues = new int[n]; + int output = 0; + timer().startCpuTimer(); - // TODO + + //create binary array + for (int i = 0; i < n; i++) { + binaryValues[i] = (idata[i] == 0) ? 0 : 1; + } + + //exclusive scan + scan(n, scanValues, binaryValues, false); + + //populate odata + for (int i = 0; i < n; i++) { + if (binaryValues[i] == 1) + odata[scanValues[i]] = idata[i]; + } + timer().endCpuTimer(); - return -1; + + output = binaryValues[n-1] == 0 ? scanValues[n-1] : scanValues[n-1] + 1; + delete[] scanValues; + delete[] binaryValues; + return output; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 36c5ef2..8c62a6c 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -5,22 +5,118 @@ namespace StreamCompaction { namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + void printArray(int n, int *a, bool abridged = false) { + printf(" [ "); + for (int i = 0; i < n; i++) { + if (abridged && i + 2 == 5 && n > 10) { + i = n - 5; + printf("... "); + } + printf("%3d ", a[i]); + } + printf("]\n"); + } + + __global__ void kernelUpsweep(int *data, int n, int d) { + int stride = 1 << d; + //step to every (2*stride) indicies + int index = ((blockIdx.x * blockDim.x) + threadIdx.x + 1) * 2 * stride; + //sum accross the stride + if (index < n+1) { + data[index-1] += data[index-stride-1]; + } + } + + __global__ void kernelDownsweep(int *data, int n, int d) { + int stride = 1 << d; + //step accross every (2*stride) indicies + int index = ((blockIdx.x * blockDim.x) + threadIdx.x + 1) * 2 * stride; + //switch right to left, add left to right + if (index < n+1) { + int temp = data[index - stride - 1]; + data[index - stride - 1] = data[index - 1]; + data[index - 1] += temp; + } + } + + __global__ void kernelBinary(int* odata, int* idata, int n) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) { + odata[index] = idata[index] ? 1 : 0; + } + } + + __global__ void kernelScatter(int *binary, int *scan, int *idata, int *odata, int n) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n && binary[index] != 0) { + odata[scan[index]] = idata[index]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + + void scan(int n, int *odata, const int *idata, bool isDevice) { + //device array will be as long as next power of 2 + int maxStride = ilog2ceil(n); + int POT_size = 1 << maxStride; + + //allocate array to length POT_size, and copy the first n values of odata + int *dev_odata; + if (isDevice) { + cudaMalloc((void**)&dev_odata, POT_size * sizeof(int)); + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyDeviceToDevice); + } + else { + cudaMalloc((void**)&dev_odata, POT_size * sizeof(int)); + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); + } + cudaMemset(&dev_odata[n], 0, (POT_size-n) * sizeof(int)); + + //determine resource needs + dim3 fullBlocksPerGrid(1); + dim3 threadsPerBlock(blockSize); + + //upsweep + for (int i = 0; i < maxStride; i++) { + fullBlocksPerGrid = dim3((POT_size / (1 << i) + blockSize - 1) / blockSize); + kernelUpsweep <<>> (dev_odata, POT_size, i); + } + + //set odata[POT_size-1] to 0 + int just_a_zero = 0; + cudaMemcpy(&dev_odata[POT_size-1], &just_a_zero, sizeof(int), cudaMemcpyHostToDevice); + + //downsweep + for (int i = maxStride; i >= 0; i--) { + fullBlocksPerGrid = dim3((POT_size / (1 << i) + blockSize - 1) / blockSize); + kernelDownsweep <<>> (dev_odata, POT_size, i); + } + + //send data back to cpu memory + if (!isDevice) { + timer().endGpuTimer(); + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_odata); + } + else { + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToDevice); + } } + void scan(int n, int *odata, const int *idata) { + scan(n, odata, idata, false); + } + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +127,42 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO + + int *dev_scan, *dev_binary, *dev_odata, *dev_idata; + int lastBinary, lastScan; + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_scan, n * sizeof(int)); + cudaMalloc((void**)&dev_binary, n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + + dim3 blocksToMakeN((n + blockSize - 1) / blockSize); + + timer().startGpuTimer(); + //create binary array + kernelBinary <<>> (dev_binary, dev_idata, n); + + //scan said array + scan(n, dev_scan, dev_binary, true); + + //scatter + kernelScatter << > > (dev_binary, dev_scan, dev_idata, dev_odata, n); + + //our odata buffer should now reflect the compacted stream + timer().endGpuTimer(); - return -1; + + cudaMemcpy(&lastBinary, dev_binary + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastScan, dev_scan + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(odata, dev_odata, (lastScan + 1) * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_binary); + cudaFree(dev_scan); + cudaFree(dev_odata); + cudaFree(dev_idata); + + return lastBinary ? lastScan + 1 : lastScan; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 9218f8e..72f826b 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -5,21 +5,58 @@ namespace StreamCompaction { namespace Naive { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - // TODO: __global__ - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + #define blockSize 32 + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } + + __global__ void kernelScan(int *odata, int *idata, int n, int d) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + int stride = 1 << (d - 1); + if (index < n) { + odata[index] = (index >= stride) ? idata[index - stride] + idata[index] : idata[index]; + } + } + + void scan(int n, int *odata, const int *idata) { + + //GPU prep + int *dev_odata, *dev_idata; + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + dim3 threadsPerBlock(blockSize); + + + timer().startGpuTimer(); + //actual kernel invocation + int maxStride = ilog2ceil(n); + for (int i = 1; i <= maxStride; i++) { + dim3 blocks((n + blockSize - 1) / blockSize); + kernelScan <<>> (dev_odata, dev_idata, n, i); + int *temp = dev_idata; + dev_idata = dev_odata; + dev_odata = temp; + } + + dim3 blocks((n + blockSize - 1) / blockSize); + + //this will end with idata holding the info, convert to exclusive + Common::kernInclusiveToExclusive <<>> (n, dev_odata, dev_idata); + + timer().endGpuTimer(); + + //send data back to cpu memory + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + //delete GPU arrays + cudaFree(dev_odata); + cudaFree(dev_idata); + } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 36b732d..29d5b5e 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -8,21 +8,24 @@ namespace StreamCompaction { namespace Thrust { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::device_vector thrust_dev_odata = thrust::device_vector(odata, odata + n); + thrust::device_vector thrust_dev_idata = thrust::device_vector(idata, idata + n); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); - timer().endGpuTimer(); + thrust::exclusive_scan(thrust_dev_idata.begin(),thrust_dev_idata.end(), thrust_dev_odata.begin()); + timer().endGpuTimer(); + + thrust::copy(thrust_dev_odata.begin(), thrust_dev_odata.end(), odata); } } }