Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 148 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,156 @@
CUDA Stream Compaction
======================

<img src="https://developer.nvidia.com/sites/all/modules/custom/gpugems/books/GPUGems3/elementLinks/39fig09.jpg" width="500">

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Daniel Daley-Mongtomery
* Tested on: MacBook Pro, OSX 10.12, i7 @ 2.3GHz, 16GB RAM, GT 750M 2048MB (Personal Machine)

## Details

This project is an exploration of a somewhat simple concept, stream compaction, and how it relates to more complex parallel algorithms. Stream compaction consists of removing array elements based on some predicate (in our case 'Is this element equal to 0?') and is used to discard irrelevant data before it reaches a more computationally intensive stage in a program. [My path tracer,](https://github.com/illDivino/Project3-CUDA-Path-Tracer) for example, uses stream compaction to prevent launching unnecessary expensive kernels if they're just going to contribute 0 anyway.

#### Scan

This project first tests the efectiveness of several scanning methods, where array element *x* is replaced with the sum of all preceding elements. This will be used later for stream compaction, but also provides another opportunity to explore a parallel system. The high-level approximations for each algortihm for number of elements *n* are as follows:

###### Basic CPU:
```
for i < n - 1 {
elements[i+1] = elements[i] + elements[i+1]
i++
}
```
This is pretty self explanatory. It sums every pair of elements in series, and will scale linearly with the number of elements.

###### Naive GPU:
```
for stride = 1; stride < n {

for all i < n in parallel {
if (i >= stride)
elements[i] = elements[i-stride] + elements[i]
}

stride *= 2
}

```

![](img/naive.png)

Every iteration of the outer loop will sum an element with the element *stride* away. By log2(n) iterations, every element will be full. Unfortunately, while it will only take log(n) kernel launches, each of this will perform n operations. This nlog(n) runtime makes this less work-efficient than the CPU version.

###### Work-Efficient GPU:
```
//upsweep
for stride = 2; stride < n {
for i < (n/stride) in parallel {
index = i * stride - 1
if (index + stride < n)
elements[index+stride] = elements[index+stride] + elements[index+(stride/2)]
}
stride *= 2
}
```

![](img/upsweep.png)

```
// Downsweep
x[n - 1] = 0
for stride = n; stride >= 2 {
for i < (n/stride) in parallel {
temp = elements[i + stride – 1];
elements[i + stride – 1] = elements[i + (2*stride) – 1];
elements[i + (stride * 2) – 1] += temp;
}
}
```

![](img/downsweep.png)

This method allows us to perform n adds on the upsweep, then n adds and n copies on the down, keeping us within the complexity of the CPU version, but still reaping the benefits of the GPU. Let's see how they measure up. This project printed the following text upon execution, which I collected and averaged to create the below graph:

```
****************
** SCAN TESTS **
****************
a[SIZE]:
[ 44 34 5 30 42 18 12 23 14 1 12 32 1 ... 7 0 ]
a[NPOT]:
[ 44 34 5 30 42 18 12 23 14 1 12 32 1 ... 27 45 ]
==== cpu scan, power-of-two ====
elapsed time: 2.31825ms (std::chrono Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ]
==== cpu scan, non-power-of-two ====
elapsed time: 2.28568ms (std::chrono Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ]
passed
==== naive scan, power-of-two ====
elapsed time: 16.4329ms (CUDA Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 16.4599ms (CUDA Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 2.88461ms (CUDA Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 3.04077ms (CUDA Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 2.89341ms (CUDA Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25692019 25692026 ]
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.731264ms (CUDA Measured)
[ 0 44 78 83 113 155 173 185 208 222 223 235 267 ... 25691912 25691939 ]
passed
```

![](img/scantime.png)

Unfortunately, my GPU skills are not quite on par with the developers of [thrust](https://github.com/thrust/thrust), whose thrust library outperformed both my implementations and the plain CPU implementation. Because of the simplicity of the operations within the kernel, my expectation is that global memory access bloated my execution time. Shared memory would prevent the immense effort of accessing these global elements and perhaps help me compete with thrust and CPU.

#### Stream Compaction

Even if I'm not winning the race, I still wanted to execute the original goal and use my scan to perform stream compaction. The method is simple enough: fill a binary array with 1 (keep) or 0 (remove), then execute an exclusive scan on that array. We can then access each element of this scanned array in parallel, check if we have a freshly incremented value, and use that value as the new index in the compacted array. Let's see how that worked out:

```
*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 0 1 2 2 0 2 1 0 1 0 0 1 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 2.66989ms (std::chrono Measured)
[ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.62928ms (std::chrono Measured)
[ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 3 3 ]
passed
==== cpu compact with scan ====
elapsed time: 10.7862ms (std::chrono Measured)
[ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 3.89808ms (CUDA Measured)
[ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 1 1 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 3.88435ms (CUDA Measured)
[ 1 2 2 2 1 1 1 3 2 1 3 2 2 ... 3 3 ]
passed

### (TODO: Your README)
```

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
![](img/compactionTime.png)

Again, the CPU wins. Like above, operations this small are especially penalized by the overhead of kernel launches and global memory. Hopefully, as I get into more complex work on the GPU, I'll be able to find more opportunities to get work done faster than the CPU like I did with the [boids](https://github.com/illDivino/Project1-CUDA-Flocking).
Binary file added img/compactionTime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/downsweep.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/naive.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scantime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/upsweep.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 12 additions & 10 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,11 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const long long SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int a[SIZE], b[SIZE], c[SIZE];

int main(int argc, char* argv[]) {
// Scan tests

printf("\n");
printf("****************\n");
Expand All @@ -27,7 +26,10 @@ int main(int argc, char* argv[]) {

genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printf("a[SIZE]:\n");
printArray(SIZE, a, true);
printf("a[NPOT]:\n");
printArray(NPOT, a, true);

// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
Expand All @@ -49,42 +51,42 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
Expand Down Expand Up @@ -129,14 +131,14 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
Expand Down
35 changes: 22 additions & 13 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
#include "common.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}


namespace StreamCompaction {
namespace Common {

Expand All @@ -23,7 +22,9 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index < n)
bools[index] = idata[index] != 0 ? 1 : 0;
}

/**
Expand All @@ -32,8 +33,16 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index < n && bools[index] != 0)
odata[indices[index]] = idata[index];

}

__global__ void kernInclusiveToExclusive(int n, int *odata, const int *idata) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index < n) odata[index] = index ? idata[index - 1] : 0;
}

}
}
Loading