Skip to content
Open
144 changes: 138 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,144 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Yifan Lu
* [LinkedIn](https://www.linkedin.com/in/yifan-lu-495559231/), [personal website](http://portfolio.samielouse.icu/)
* Tested on: Windows 11, AMD Ryzen 7 5800H 3.20 GHz, Nvidia GeForce RTX 3060 Laptop GPU (Personal Laptop)

### (TODO: Your README)
## Project Feature
- CPU Scan & Stream Compaction
- Naive GPU Scan Algorithm
- Work-Efficient GPU Scan & Stream Compaction
- Thrust Scan
- **[Extra Credit]** Optimized work-efficient with bitwise kernel operations and dynamic launching block numbers

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
## Introduction

### Scan
Scan is to process elements of an array to generate a new array where each position contains an aggregation like a sum, product, or logical operation of elements up to that position in the original array. in this project, we are doing a scan for sum.

There are two main variaty of scan:

- **Inclusive Scan**

Each element in the output array includes the corresponding element from the input array in the sum.

- **Exclusive Scan**

Each element in the output array does not include the corresponding element from the input array, starting with an initial value (like zero for sum).


### Compaction
Compaction is to remove or filter out unwanted elements from an array or list, creating a new array that contains only the elements that satisfy a specific condition.

### Parallelization
Parallelizing the scan and compaction process can significantly improve its performance, especially on large datasets. In parallel implementations, the array can be divided among multiple processors or threads, each performing the mapping, scanning, and scattering on a segment of the array.


## Performance Analysis

### Blocksize Optimization

For my implement and hardware setting, when block size reaches 256, the operations are showing better performance. The performance will not change significantly if the block size continues increasing.

This is the GPU time for work-efficient method on array size $2^{15}$, which reaches its lowest point at block size 256.

![](img/blocksize.png)

### Compare GPU Scan Implementations

The following chart shows the time for CPU, GPU Navie, GPU work-efficient and thrust scan.

![](img/time.png)
![](img/timelog2.png)

Naive scan needs $O(log2(n))$ passes. Each pass has $O(n)$ computations. Work-efficient uses a 'binary tree' structure and we only need to do $O(n)$ computations for a single traverse of the tree.

Work-efficient has a significant upgrade when the array size are getting larger.

### Performance Bottlenecks

To trace the GPU bottlenecks when doing scanning, I used Nsight Systems to launch release build.

![](img/b4numblock.png)

The above screenshots are taken from nsight systems which time period are related to work-efficient scan.

From the graph we can see that there are time gaps between GPU SM executions. That is probably because non-coalesced memory accesses to global memory.

In order for each kernel to have faster computations, I switch operations such as mod and multiply/divide into bit-wise operations.

For example, change ``` index % offset == 0 ``` into ``` (index & (offset - 1)) == 0 ```. This is because ```offset``` is always the power of 2, so the ```index``` we are looking for is also power of 2. By AND the ```index``` and ```offset - 1```, which is a bunch of 1s, we can check the result to see if it is 0. If it is, then ```index``` can be divided by ```offset```.

Also to avoid too much threads idling, I **dynamically change the number of blocks that kernel will launch in the work-efficient method**.

After the optimization, the timeline looks like the graph below, which has longer SM active time. However, the time gaps will exists because the memory latency.

![](img/afternumblock.png)

### Test Output

The test is done under the configuration of array size 2^10 and block size 256.

```

****************
** SCAN TESTS **
****************
[ 16 47 5 34 25 42 48 11 36 49 39 23 0 ... 42 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0008ms (std::chrono Measured)
[ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0008ms (std::chrono Measured)
[ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25372 25394 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.342016ms (CUDA Measured)
[ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ]
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.105472ms (CUDA Measured)
[ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.897024ms (CUDA Measured)
[ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.27648ms (CUDA Measured)
[ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25372 25394 ]
passed
==== thrust scan, power-of-two ====
elapsed time: 0.121856ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.04096ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 1 3 1 1 3 1 1 0 1 1 1 3 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.0028ms (std::chrono Measured)
[ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.0022ms (std::chrono Measured)
[ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0005ms (std::chrono Measured)
[ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.27648ms (CUDA Measured)
[ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ]
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.27648ms (CUDA Measured)
[ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ]
passed
```

Binary file added img/afternumblock.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/b4numblock.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/nsystem_trace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/nsystem_trace2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/timelog.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/timelog2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 11 additions & 9 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 10; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
Expand All @@ -64,21 +64,21 @@ int main(int argc, char* argv[]) {
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
StreamCompaction::Efficient::scan_opt(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
StreamCompaction::Efficient::scan_opt(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -108,6 +108,7 @@ int main(int argc, char* argv[]) {

int count, expectedCount, expectedNPOT;


// initialize b using StreamCompaction::CPU::compactWithoutScan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
zeroArray(SIZE, b);
Expand Down Expand Up @@ -137,16 +138,17 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);


system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
5 changes: 4 additions & 1 deletion src/testing_helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@ template<typename T>
int cmpArrays(int n, T *a, T *b) {
for (int i = 0; i < n; i++) {
if (a[i] != b[i]) {
printf(" a[%d] = %d, b[%d] = %d\n", i, a[i], i, b[i]);
printf(" a[%d] = %d, b[%d] = %d\n ", i, a[i], i, b[i]);
printf(" a[%d] = %d, b[%d] = %d\n ", i - 1, a[i - 1], i - 1, b[i - 1]);
printf(" a[%d] = %d, b[%d] = %d\n ", i +1, a[i + 1], i + 1, b[i + 1]);
return 1;
}
}
Expand Down Expand Up @@ -58,6 +60,7 @@ void genArray(int n, int *a, int maxval) {
}

void printArray(int n, int *a, bool abridged = false) {
//printf("count: %d \n", n);
printf(" [ ");
for (int i = 0; i < n; i++) {
if (abridged && i + 2 == 15 && n > 16) {
Expand Down
14 changes: 14 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
bools[index] = (idata[index] != 0) ? 1 : 0;

}

/**
Expand All @@ -33,6 +39,14 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}

}

}
Expand Down
34 changes: 30 additions & 4 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +35,15 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int count = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[count] = idata[i];
count++;
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -41,10 +52,25 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
//timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
int *bools = new int[n];
for (int i = 0; i < n; i++) {
bools[i] = (idata[i] != 0) ? 1 : 0;
}
int* scanResult = new int[n];
scan(n, scanResult, bools);
// scatter
int count = bools[n - 1] == 1 ? scanResult[n - 1] : scanResult[n - 1];
for (int i = 0; i < n; i++) {
if (bools[i] == 1) {
odata[scanResult[i]] = idata[i];
}
}
//timer().endCpuTimer();
delete[] bools;
delete[] scanResult;
return count;
}
}
}
Loading