Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 147 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,153 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Kevin Dong
* [LinkedIn](www.linkedin.com/in/xingyu-dong)
* Tested on: Windows 11, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz, GTX 2060

### (TODO: Your README)
This repo implements several versions of parallel exclusive scan algorithms and used them to implement stream
compaction. The work-efficient scan algorithm has also been used in implementing a parallel radix sort algorithm.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Features

#### CPU Scan and Compaction
The CPU scan implementation is fairly straightforward $O(n)$ algorithm. The compaction algorithms, with and without
scan, also run in $O(n)$ time. All of them perform fairly well on small input sizes and become slow as the input size
increases.
#### Naive GPU Scan
The naive parallel scan is a $O(log_n)$ algorithm that still performs $O(nlog_n)$ number of adds, which makes it not
really better than the sequential algorithm.
#### Work-Efficient GPU Scan and Compaction
The work-efficient scan improves upon the naive algorithm by changing the array into a balanced binary tree. It first
performs a parallel reduction (the up-sweep) and a down-sweep that yields the final scan result. This algorithm has
$O(n)$ number of adds and performs much better than the naive scan. To accommodate input sizes that are not powers of 2,
we enlarge the input array to the nearest power of 2 and pad the rest of the array with 0s.
#### Work-Efficient Scan Optimization (Part 5 Extra Credit)
We can further optimize the work-efficient scan by reducing the number of works needed to be done. A lot of threads
are not needed in the process because only some nodes are required to be updated during each iteration.
#### Thrust Scan
The thrust implementation calls the thrust api to perform the exclusive scan. It is very straightforward.
#### Radix Sort (Extra Credit 1)
The parallel radix sort uses the work-efficient exclusive scan as part of its implementation. We iterate through $k$
bits and the exclusive scan take $O(log_n)$ time. The total time complexity is $O(k\cdot log_n)$ because the generation of
bit array and scatter operation are $O(1)$ due to the parallelism.

### Performance Analysis
We will compare the performance of the different scan algorithms on different input sizes. The graph is shown below in
log scale.
![Performance Graph](img/Figure_2.png)
From the graph, we can see that the thrust scan algorithm performs the best among all the scan algorithms, followed
by the work-efficient algorithm. When the input size is small, the CPU algorithm is the fastest algorithm,
but it quickly becomes slower as the input size increases. The naive scan algorithm also becomes slow when the size
becomes very large.

#### Performance Bottlenecks
Our timer for all the algorithms does not contain the memory allocation process and the process of copying the final
result array back to the host. Therefore, the execution time should mainly reflect the computation time as well as the
memory access time while doing the computation.

The CPU implementation is fairly efficient and can hardly be further improved, since it is already a linear time
algorithm. From the runtime result we can also observe the fact that it runs very fast when the input size is small,
since comparing to its GPU counterparts it doesn't cause a lot of overheads.

The naive algorithm generally performs worse than the work-efficient algorithm, since it requires more adds operations
even comparing to the CPU implementation and have many idle threads that are not actively working. Since our
implementation uses global memory, the memory access could also potentially slow down the algorithm. It is expected
that a shared-memory model may further improve the algorithm's performance.

The work-efficient algorithm improves based on the naive algorithm by doing computations like a balanced binary tree.
The number of adds for both the up-sweep and the down-sweep in this case becomes $O(n)$, thus making this algorithm
more efficient. The work-efficient algorithm does require more memory access, since there are more steps involved and
all the data are stored as global memory, so changing it to a shared-memory model should greatly improve its
performance. In part 5, we implemented an optimization that reduces the number of threads generated and replaced the
modular operation with a value comparison, which should make the algorithm more efficient.

The thrust algorithm is the most efficient algorithm when the input size is very large. It is not hard to imagine that
the thrust implementation uses shared-memory model and properly optimizes the use of warps and registers to make the
algorithm very efficient on large scale data.

![Thrust Analysis](img/Nsight.png)

From the Nsight System timeline, we can see that the thrust implementation has a very high SM Warp Occupancy, indicating
that the algorithm is very efficient in utilizing the GPU resources. There are also some DRAM read and write operations,
which could be the memory access operations during the up-sweep and down-sweep processes. Overall, the thrust
implementation looks very efficient and well optimized.


### Output
This output is generated with $2^{20}$ input size and $256$ block size.
```
****************
** SCAN TESTS **
****************
[ 28 28 11 44 31 11 4 25 23 43 12 12 28 ... 48 0 ]
==== cpu scan, power-of-two ====
elapsed time: 1.5931ms (std::chrono Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
passed
==== cpu scan, non-power-of-two ====
elapsed time: 1.4801ms (std::chrono Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
passed
==== naive scan, power-of-two ====
elapsed time: 1.85376ms (CUDA Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.916128ms (CUDA Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 1.6073ms (CUDA Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 1.02006ms (CUDA Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 0.83968ms (CUDA Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.6672ms (CUDA Measured)
[ 0 28 56 67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 2 1 2 3 2 3 2 0 3 2 2 1 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 2.2588ms (std::chrono Measured)
[ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 1 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.1694ms (std::chrono Measured)
[ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 3 3 ]
passed
==== cpu compact with scan ====
elapsed time: 5.4433ms (std::chrono Measured)
[ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 1 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 1.83184ms (CUDA Measured)
[ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 1 2 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 1.1128ms (CUDA Measured)
[ 2 1 2 3 2 3 2 3 2 2 1 1 2 ... 3 3 ]
passed

**********************
** RADIX SORT TESTS **
**********************
[ 32 26 157 158 195 138 167 198 116 95 114 106 149 ... 30 0 ]
==== radix sort ====
elapsed time: 17.5402ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
==== thrust sort ====
elapsed time: 1.57117ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
passed
```
Binary file added img/Figure_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/Figure_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/Nsight.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
60 changes: 51 additions & 9 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,17 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radix.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];

int *d = new int[SIZE];

int main(int argc, char* argv[]) {
// Scan tests

Expand All @@ -39,6 +42,7 @@ int main(int argc, char* argv[]) {
StreamCompaction::CPU::scan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);
printCmpResult(SIZE, b, b);

zeroArray(SIZE, c);
printDesc("cpu scan, non-power-of-two");
Expand All @@ -51,7 +55,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand All @@ -64,37 +68,50 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

// compare results between thrust array and cpu array
// zeroArray(SIZE, b);
// printDesc("compare thrust array and cpu array - CPU");
// StreamCompaction::CPU::scan(NPOT, b, a);
// printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
// printCmpResult(NPOT, b, b);
//
// zeroArray(SIZE, d);
// printDesc("compare thrust array and cpu array - Thrust");
// StreamCompaction::Thrust::scan(NPOT, d, a);
// printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
// printCmpResult(NPOT, d, b);

printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
Expand Down Expand Up @@ -137,18 +154,43 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("**********************\n");
printf("** RADIX SORT TESTS **\n");
printf("**********************\n");

genArray(SIZE - 1, a, 200); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

int maxBits = StreamCompaction::Radix::getMaxBits(SIZE, a);

zeroArray(SIZE, b);
printDesc("radix sort");
StreamCompaction::Radix::sort(SIZE, b, a, maxBits);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("thrust sort");
StreamCompaction::Thrust::sort(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
delete[] c;
delete[] d;
}
4 changes: 3 additions & 1 deletion stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix.h"
)

set(sources
Expand All @@ -12,7 +13,8 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
)
"radix.cu"
)

list(SORT headers)
list(SORT sources)
Expand Down
16 changes: 14 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,12 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
Expand All @@ -32,7 +37,14 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
Loading