Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 146 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,152 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Yuhan Liu
* [LinkedIn](https://www.linkedin.com/in/yuhan-liu-), [personal website](https://liuyuhan.me/), [twitter](https://x.com/yuhanl_?lang=en), etc.
* Tested on: Windows 11 Pro, Ultra 7 155H @ 1.40 GHz 32GB, RTX 4060 8192MB (Personal Laptop)

### (TODO: Your README)
## README!

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Project Description

* This project investigates parallel algorithms by implementing various versions of scan (prefix-sum), stream compaction (remove unwated elements), and sort algorithms. We begin with CPU implementations of all the aforementioned algorithms before developing GPU versions, including naive and work-efficient approaches, for comparison purposes.

### Performance Analysis

**Optimizing Block Size**

<img src="https://github.com/yuhanliu-tech/GPU-Stream-Compaction/blob/main/img/block_opt.png" width="600"/>

* To decide on one block size for all parallel algorithm implementations, I assessed the naive and work-efficient runtimes with increasing block sizes. In the previous project, we established a tradeoff concerning block size between increasing GPU utilization and limiting resource availability for each block. Thus, we can choose an optimal block size by finding the dip in runtimes, which in this case (although close) we choose and proceed with a block size of 256.

#### Comparison of GPU Scan Implementations

| CPU | Naive | Work-Efficient | Thrust |
| :------------------------------: |:------------------------------: |:-----------------------------------------------: |:-----------------------------------------------:|
| For smaller datasets, the overhead of managing parallel execution on a GPU (e.g., kernel launches, thread synchronization) offsets the parallel advantage, making the CPU more efficient. However, as data scales, the complexity of scanning on the CPU increases linearly. In the graph below where array sizes increase exponentially, the complexity of CPU scanning does so as well. | The naive parallel approach rivals the work-efficient approach for most of the array sizes below. However, this method has a memory latency bottleneck: its excessive use of global memory in the summation causes the runtime to swell as the array size grows. | The work-efficient GPU scan starts with more overhead than the naive implementation for smaller arrays, but as they grow larger, it outperforms both the CPU and naive versions. The work-efficient binary tree structure reduces unnecessary memory operations and optimizes thread usage by performing operations in place. The bottleneck here is thread usage: the binary tree structure leaves threads underutilized as the number of operations at each step in the algorithm reduces. Because of this, more primitive methods like naive and CPU come close in runtime for smaller array sizes. | Referencing the NVIDIA developer [documentation](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) for exclusive scan, we notice that the Thrust implementation employs techniques to improve performance, including optimizing shared memory and load balancing. Setting up these optimizations, like bankconflict avoidance and loop unrolling, introduces overhead, which make more primitive methods more efficient for small array sizes. However, even in the graph below, the thrust scan implementation clearly does not scale as significantly as any of the other scan implementations. |

<img src="https://github.com/yuhanliu-tech/GPU-Stream-Compaction/blob/main/img/scan_perf.png" width="600"/>

#### Output of Testing
test array SIZE: 2^18, blockSize: 256

```
****************
** SCAN TESTS **
****************
[ 35 49 36 24 1 48 46 6 49 24 44 47 5 ... 2 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.4305ms (std::chrono Measured)
[ 0 35 84 120 144 145 193 239 245 294 318 362 409 ... 6412415 6412417 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.5392ms (std::chrono Measured)
[ 0 35 84 120 144 145 193 239 245 294 318 362 409 ... 6412308 6412343 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.368768ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.219936ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.333856ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.366688ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 1.55728ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.344288ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 3 0 0 3 2 0 2 1 2 0 1 3 ... 2 1 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.6932ms (std::chrono Measured)
[ 1 3 3 2 2 1 2 1 3 3 2 2 3 ... 2 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.7058ms (std::chrono Measured)
[ 1 3 3 2 2 1 2 1 3 3 2 2 3 ... 1 1 ]
passed
==== cpu compact with scan ====
elapsed time: 1.3968ms (std::chrono Measured)
[ 1 3 3 2 2 1 2 1 3 3 2 2 3 ... 2 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.445152ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.665728ms (CUDA Measured)
passed

*****************************
***** RADIX SORT TESTS ******
*****************************

Radix Sort, hard-coded lecture example test

[ 4 7 2 6 3 5 1 0 ]
==== cpu sort (std::sort) ====
elapsed time: 0.0002ms (std::chrono Measured)
[ 0 1 2 3 4 5 6 7 ]
==== radix sort ====
elapsed time: 5.5552ms (CUDA Measured)
[ 0 1 2 3 4 5 6 7 ]
passed

Radix Sort, pow2 consecutive ints

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262142 262143 ]
==== cpu sort (std::sort) ====
elapsed time: 1.4753ms (std::chrono Measured)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262142 262143 ]
==== radix sort ====
elapsed time: 13.777ms (CUDA Measured)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262142 262143 ]
passed

Radix Sort, non-pow2 consecutive ints

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262139 262140 ]
==== cpu sort (std::sort) ====
elapsed time: 1.3184ms (std::chrono Measured)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262139 262140 ]
==== radix sort ====
elapsed time: 13.37ms (CUDA Measured)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 262139 262140 ]
passed

Radix Sort, non-pow2 shuffled ints

[ 4785 499 16736 27824 14951 31998 8696 23806 7549 30974 31344 14697 29955 ... 21785 7497 ]
==== cpu sort (std::sort) ====
elapsed time: 14.9359ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 1 1 1 1 2 ... 32767 32767 ]
==== radix sort ====
elapsed time: 18.8853ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 1 1 1 1 2 ... 32767 32767 ]
passed
```

### Additional Feature: Radix Sort

* I implemented parallel radix sort as an additional module to the `stream_compaction` subproject (and additionally, a CPU std::sort function for comparison). The implementation can be found in `radix.cu` and `radix.h`.

* The radix sort function takes as input, the size of the array, along with pointers to the input and output arrays. Below is a code snippet from a radix test in the main method, showing how it is called:

```
StreamCompaction::Radix::sort(SIZE, c, a); // a is the array to be sorted, which is already-existing
printArray(SIZE, c, true); // print the output sorted array, which is saved in c
```

* I wrote several tests for radix sort, in which I compare both the correctness and runtime of my implementation to the CPU equivalent (std::sort). First, I compared radix sort with CPU sort on the fixed array [4, 7, 2, 6, 3, 5, 1, 0] from lecture. Then, I evaluated radix sort on sequential arrays of size power-of-two and non-power-of-two sizes. Lastly, I tested radix sort on a shuffled arrays. For each test, I compared the results and performance of radix sort with CPU sort, measuring CUDA performance where applicable.

**Radix sort Performance Evaluation**
* Radix sort performs worse than std::sort for smaller array sizes due to overhead and initialization costs associated with GPU processing and memory. However, as the array size grows, radix sort's performance improves relative to std::sort. This makes sense with the performance graph as radix sort has a complexity of O(n⋅k), where k is the number of bits, while standard CPU sort has a complexity of O(n*log(n)).

<img src="https://github.com/yuhanliu-tech/GPU-Stream-Compaction/blob/main/img/radix_perf.png" width="600"/>
Binary file added img/block_opt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/radix_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_perf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
130 changes: 124 additions & 6 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,18 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radix.h>
#include "testing_helpers.hpp"
#include <iostream>

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 22; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];

int main(int argc, char* argv[]) {
// Scan tests
// Scan tests --------------------------------------

printf("\n");
printf("****************\n");
Expand Down Expand Up @@ -100,10 +102,10 @@ int main(int argc, char* argv[]) {
printf("** STREAM COMPACTION TESTS **\n");
printf("*****************************\n");

// Compaction tests
// Compaction tests --------------------------------------

genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
genArray(SIZE - 1, a, 4);
a[SIZE - 1] = 1; // Leave a 1 at the end to test that edge case
printArray(SIZE, a, true);

int count, expectedCount, expectedNPOT;
Expand Down Expand Up @@ -147,7 +149,123 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit

// radix sort tests --------------------------------------

printf("\n");
printf("*****************************\n");
printf("***** RADIX SORT TESTS ******\n");
printf("*****************************\n");

// pow2 test

zeroArray(8, a);
std::cout << " " << std::endl;
std::cout << "Radix Sort, hard-coded lecture example test" << std::endl;
std::cout << " " << std::endl;

// example from slides for debugging
a[0] = 4;
a[1] = 7;
a[2] = 2;
a[3] = 6;
a[4] = 3;
a[5] = 5;
a[6] = 1;
a[7] = 0;

printArray(8, a, true);

zeroArray(8, b);
printDesc("cpu sort (std::sort)");
StreamCompaction::CPU::sort(8, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(8, b, true);

zeroArray(8, c);
printDesc("radix sort");
StreamCompaction::Radix::sort(8, c, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(8, c, true);
printCmpResult(8, b, c);

// pow2 test

zeroArray(SIZE, a);
std::cout << " " << std::endl;
std::cout << "Radix Sort, pow2 consecutive ints" << std::endl;
std::cout << " " << std::endl;

for (int i = 0; i < SIZE; i++) {
a[i] = i;
}

printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("cpu sort (std::sort)");
StreamCompaction::CPU::sort(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("radix sort");
StreamCompaction::Radix::sort(SIZE, c, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

// npot test

zeroArray(SIZE, a);
std::cout << " " << std::endl;
std::cout << "Radix Sort, non-pow2 consecutive ints" << std::endl;
std::cout << " " << std::endl;

for (int i = 0; i < NPOT; i++) {
a[i] = i;
}

printArray(NPOT, a, true);

zeroArray(NPOT, b);
printDesc("cpu sort (std::sort)");
StreamCompaction::CPU::sort(NPOT, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);

zeroArray(NPOT, c);
printDesc("radix sort");
StreamCompaction::Radix::sort(NPOT, c, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

// npot shuffled

genArray(NPOT, a, NPOT);
std::cout << " " << std::endl;
std::cout << "Radix Sort, non-pow2 shuffled ints" << std::endl;
std::cout << " " << std::endl;

printArray(NPOT, a, true);

zeroArray(NPOT, b);
printDesc("cpu sort (std::sort)");
StreamCompaction::CPU::sort(NPOT, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);

zeroArray(NPOT, c);
printDesc("radix sort");
StreamCompaction::Radix::sort(NPOT, c, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

// END TESTS ----------------------------------------

//system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
delete[] c;
Expand Down
1 change: 0 additions & 1 deletion src/testing_helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ void onesArray(int n, int *a) {

void genArray(int n, int *a, int maxval) {
srand(time(nullptr));

for (int i = 0; i < n; i++) {
a[i] = rand() % maxval;
}
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"radix.cu"
)

list(SORT headers)
Expand Down
18 changes: 16 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include "common.h"
#include <device_launch_parameters.h>

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
Expand All @@ -23,7 +24,13 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int i = (blockIdx.x * blockDim.x) + threadIdx.x;

if (i >= n) {
return;
} else {
bools[i] = (idata[i] == 0) ? 0 : 1;
}
}

/**
Expand All @@ -32,7 +39,14 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO

int i = (blockIdx.x * blockDim.x) + threadIdx.x;

if (i >= n) {
return;
} else if (bools[i] == 1) {
odata[indices[i]] = idata[i];
}
}

}
Expand Down
Loading