Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 127 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,133 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Shreyas Singh
* [LinkedIn](https://linkedin.com/in/shreyassinghiitr), [Personal Website](https://github.com/shreyas3156).
* Tested on: Windows 10, i7-12700 @ 2.1GHz 32GB, T1000 (CETS Lab)

### (TODO: Your README)
### About
This project features an implementation of all-prefix-sums operation on an array of data, often known as
_Scan_, followed by _Stream Compaction_, which refers to creating a new array with elements from the input that are filtered using a given criteria, preserving the order of elements in the process. We use _scan_ interchangeably with _exclusive scan_ throughout the project, where each element _k_ in the result array is the sum
of all elements up to but excluding _k_ itself in the input array.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
We would see three different implementations of Scan, the first being a trivial sequential CPU-Scan and the other two being
"Naive" and "Work-Efficient" CUDA implementations of a parallel scan algorithm that leverage GPU's data parallelism, thus giving huge performance improvement
for large inputs.

Our Stream Compaction algorithm would remove `0`s from an array of `int`s, using CPU and Work-Efficient parallel scans using CUDA.
For a detailed version of Scan and further reading, check out [GPU Gems 3, Ch-39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).


## Performance Analysis

The algorithms have been timed with respect to various parameters like blockSize and input length using the custom `timer()` API. We have only timed the algorithms and not the memory allocations and other computations
since we assume that compute on GPU is for free. Time units are in 'milliseconds(ms)'.

### Optimal Block size
The `blockSize` parameter was roughly optimized for both parallel algorithms for an array length of 2^20. I chose to take an average of the time for power-of-two (POT) inputs and non-power-of-two (NPOT) inputs.
The optimal block size was found to be 512 for Naive parallel scan and 128 for Work-Efficient scan.
![](img/blocksizeopt.png)


### Scan implementations vs Array Size
We compare the various implementations of Scan (Naive, Work-Efficient and Thrust) with respect to the CPU Scan algorithm.
Thrust is a C++ template library for CUDA based on the Standard Template Library (STL) and it allows you
to implement high performance parallel applications.

Both GPU and CPU timing functions were wrapped up as a performance timer class. We have used
`std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance.

The power-of-two inputs and non-power-of-two inputs have been analyzed separately for better
comparability of the CPU, GPU and Thrust implementations.


![](img/scan_pot.png)
![](img/perfpot.png)

Analysis for NPOT arrays follows:
![](img/scan_npot.png)
![](img/perfnpot.png)

Clearly, the CPU implementation is faster for shorter input lengths until around an array size
of 2^13. Following this, the CPU Scan is slightly slower than Naive Parallel Scan but still faster than Work-Efficient Parallel Scan
until an array size of 2^20. However, the Thrust scan always has the
fastest performance among all the GPU algorithms and the CPU scan beyond 2^13.

The *Work-Efficient scan has been optimized for thread usage* such that it only launches the
required number of threads and no unused thread have to wait for other threads to terminate. This
is why it is faster than all but the Thrust Scan algorithm at arrays of large lengths.

The major *performance bottleneck* here is the global memory I/O for all parallel algorithms. This can be improved upon
by utilising the shared memory as the size of shared memory is dynamic and is related to the block size. This overhead is overcome
at arrays of larger sizes when the Work-Efficient and Thrust Scan run faster than the CPU Scan.

The *Thrust* library's performance through `exclusive_scan()` shows remarkable consistency as the input size increases upto
2^17, following which there is a marginal increase in performance time. The possible explanation is optimal memory management
by Thrust, avoiding possible global memory I/O overheads. The slower performance at larger array sizes could be because of
memory allocation and copying latency between the host and the device.


### Stream Compaction Performance

Here, we show a comparison of the stream compaction algorithms implemented through CPU (With and Without Scan) and through the Work-Efficient Algorithm.
![](img/compact_perf.png).

As one would expect from the discussion above, the CPU algorithms take much lesser time than their GPU
counterparts for smaller array sizes (<2^20). This trend appears to flip at array sizes > 2^20 as the optimized
thread usage for the efficient-algorithm produces improved performance.
```
****************
** SCAN TESTS **
****************
[ 10 32 23 43 28 33 17 0 35 37 44 0 23 ... 33 0 ]
==== cpu scan, power-of-two ====
elapsed time: 1.7358ms (std::chrono Measured)
[ 0 10 42 65 108 136 169 186 186 221 258 302 302 ... 25672547 25672580 ]
==== cpu scan, non-power-of-two ====
elapsed time: 1.7335ms (std::chrono Measured)
[ 0 10 42 65 108 136 169 186 186 221 258 302 302 ... 25672508 25672513 ]
passed
==== naive scan, power-of-two ====
elapsed time: 1.57238ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 1.51142ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 1.42707ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 1.36602ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.596ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.176128ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 1 1 2 2 0 1 3 1 1 0 0 0 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 2.146ms (std::chrono Measured)
[ 3 1 1 2 2 1 3 1 1 3 3 2 1 ... 3 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.171ms (std::chrono Measured)
[ 3 1 1 2 2 1 3 1 1 3 3 2 1 ... 1 3 ]
passed
==== cpu compact with scan ====
elapsed time: 5.0626ms (std::chrono Measured)
[ 3 1 1 2 2 1 3 1 1 3 3 2 1 ... 3 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 2.25693ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 1.92234ms (CUDA Measured)
passed
Press any key to continue . . .
```

Binary file added img/blocksizeopt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compact_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/perfnpot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/perfpot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_npot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_pot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down
20 changes: 16 additions & 4 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#include "common.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
void checkCUDAErrorFn(const char* msg, const char* file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
Expand All @@ -22,17 +22,29 @@ namespace StreamCompaction {
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
__global__ void kernMapToBoolean(int n, int* bools, const int* idata) {
// TODO
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= n) {
return;
}
bools[idx] = (idata[idx]) ? 1 : 0;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
__global__ void kernScatter(int n, int* odata,
const int* idata, const int* bools, const int* indices) {
// TODO
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= n) {
return;
}
if (bools[idx]) {
odata[indices[idx]] = idata[idx];
}
}

}
Expand Down
38 changes: 33 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +32,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int j = 0;
for (int i = 0; i < n; i++) {
if (idata[i]) {
odata[j] = idata[i];
j++;
}
}
timer().endCpuTimer();
return -1;
return (j) ? j : -1;
}

/**
Expand All @@ -42,9 +50,29 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
// map the input array to an array of 0s and 1s
int* temp = new int[n];
for (int i = 0; i < n; i++) {
temp[i] = idata[i] == 0 ? 0 : 1;
}

int* scanOutput = new int[n];
scanOutput[0] = 0;

// scan the temp array
for (int i = 1; i < n; i++) {
scanOutput[i] = scanOutput[i - 1] + temp[i - 1];
}

int compactLen = scanOutput[n - 1];

for (int i = 0; i < n; i++) {
if (temp[i]) {
odata[scanOutput[i]] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return (compactLen) ? compactLen : -1;
}
}
}
Loading