Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 114 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,120 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Xinyu Niu
* [personal website](https://xinyuniu6.wixsite.com/my-site-1)
* Tested on: Windows 11, i9-13980HX @ 2.20GHz 16GB, RTX 4070 16185MB (Personal)

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
## Introduction

This project focuses on implementing *Scan* (*Prefix Sum*) and *Stream Compaction* algorithms in CUDA. Scan algorithms are about doing prefix sum on an array, and Stream Compaction algorithm is about removing elements that meet some given conditions from an array. In this project, the stream compaction implementations will remove `0`s from an array of `int`s.

## Implemented Features

In this project, I completed the following features:

* CPU Scan & Stream Compaction
* Naive GPU Scan Algorithm
* Work-Efficient GPU Scan & Stream Compaction
* GPU Scan using Thrust
* Optimized Work-Efficient GPU Scan(Extra Credit)
* Radix Sort (Extra Credit)

## Performance Analysis

![](img/blocksize.png)

**Figure 1:** Change on Elapsed Time influenced by increasing blocksize with fixed array size(2^25)

From Figure 1, we can observe that before blocksize = 32 is reached, blocksize has relatively significant effect on time elapsed, and a larger blocksize leads to better performance for all three implementations. After 32 is reached, the blocksize no longer has considerable inflence on time elapsed. And we can also observe that before optimization, the efficient scan is not actually efficient. This is causing by the inefficient use of threads, since there will be idle threads as we always launch the same number of blocks for each iteration. By compacting threads, using shared memory etc. we can reduce this inefficiency.

![](img/arraysize.png)

**Figure 2:** Change on Elapsed Time influenced by increasing arraysize with fixed blocksize(256)

From Figure 2, we can observe that when arraysize gets larger and over a certain number(from the data collected it's about 2^22), there will be significant increase on elapsed time and we can clearly see the difference in performance of different scan method. From faster to slower, the rank of methods is: ```thrust > efficient GPU scan > Naive GPU scan > CPU scan```. When arraysize is smaller, although it's hard to observe from the graph, CPU appears to have a bit better performance than the two algorithm I implemented while thrust is still the fastest one.

![](img/thrust.png)

**Figure 3:** NSight trace result of only allowing CPU scans and thrust scans

From Figure 3, we can see that the largest part of thrust is the use of ```cudaMemcpyAsync``` and ```cudaStreamSynchronize```, which allows control to be returned to the host thread immediately so we won't need to wait until the data copy is completed. Since we are using cudaMemcpy in our implementation, this waiting time might be a bottleneck.

## Output
```
****************
** SCAN TESTS **
****************
[ 0 13 20 46 19 36 9 0 15 22 27 38 11 ... 18 0 ]
==== cpu scan, power-of-two ====
elapsed time: 428.08ms (std::chrono Measured)
[ 0 0 13 33 79 98 134 143 143 158 180 207 245 ... -2015722115 -2015722097 ]
==== cpu scan, non-power-of-two ====
elapsed time: 425.212ms (std::chrono Measured)
[ 0 0 13 33 79 98 134 143 143 158 180 207 245 ... -2015722147 -2015722135 ]
passed
==== naive scan, power-of-two ====
elapsed time: 314.354ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 334.496ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 96.6155ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 97.165ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 10.5397ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 12.058ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 0 0 0 0 0 0 0 2 0 0 2 0 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 425.212ms (std::chrono Measured)
[ 2 2 2 3 2 1 2 3 1 1 3 3 2 ... 1 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 425.212ms (std::chrono Measured)
[ 2 2 2 3 2 1 2 3 1 1 3 3 2 ... 1 1 ]
passed
==== cpu compact with scan ====
elapsed time: 2080.47ms (std::chrono Measured)
[ 2 2 2 3 2 1 2 3 1 1 3 3 2 ... 1 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 202.213ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 188.192ms (CUDA Measured)
passed

```
## Extra Credit
**1. Optimized GPU Efficient Scan**

I attempted to optimize the performance by adjusting blocks launched at during the loop for upper sweep and down sweep. During upper sweep, each iteration the block number shrinks to half. During down sweep, each iteration the block number expands to twice.

**2. Radix Sort**

I have implemented the Radix Sort algorithm, which can be called using ```StreamCompaction::Efficient::radixSort()```.

The below output resulted from comparing the sorted arry using my implementation and std::sort.

```
*****************************
** RADIX SORT TESTS **
*****************************
[ 0 17 0 16 17 5 18 9 5 18 19 18 10 ... 3 0 ]
==== Radix Sort ====
elapsed time: 8.7022ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 19 19 ]
passed
```
Binary file added img/arraysize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/out1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/out2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/out3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/thrust.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 29 additions & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 28; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];

# define NOTESTTHRUST 1

int main(int argc, char* argv[]) {
// Scan tests

Expand Down Expand Up @@ -47,6 +49,8 @@ int main(int argc, char* argv[]) {
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

#if NOTESTTHRUST

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
Expand Down Expand Up @@ -81,6 +85,8 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

#endif

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
Expand All @@ -95,6 +101,8 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

#if NOTESTTHRUST

printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
Expand Down Expand Up @@ -147,6 +155,26 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** RADIX SORT TESTS **\n");
printf("*****************************\n");

// Radixsort tests

genArray(SIZE - 1, a, 20); // Leave a 0 at the end to test that edge case
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("Radix Sort");
StreamCompaction::Efficient::radixSort(SIZE, b, a);
std::sort(a, a + SIZE);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, a, true);
printCmpResult(NPOT, a, b);

#endif

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
6 changes: 6 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx < n)
bools[idx] = (idata[idx] != 0 ? 1 : 0);
}

/**
Expand All @@ -33,6 +36,9 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx < n && bools[idx] == 1)
odata[indices[idx]] = idata[idx];
}

}
Expand Down
50 changes: 44 additions & 6 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

#include "common.h"

int gNonCompact = 1;

namespace StreamCompaction {
namespace CPU {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -18,9 +20,16 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
if(gNonCompact)
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
odata[0] = 0;
for (int i = 1; i < n; i++)
{
odata[i] = odata[i - 1] + idata[i - 1];
}
if(gNonCompact)
timer().endCpuTimer();
}

/**
Expand All @@ -29,10 +38,19 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
//timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
int count = 0;
for (int i = 0; i < n; i++)
{
if (idata[i] != 0)
{
odata[count] = idata[i];
count++;
}
}
//timer().endCpuTimer();
return count;
}

/**
Expand All @@ -41,10 +59,30 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
gNonCompact = 0;
timer().startCpuTimer();
// TODO
// Step1: map
int* boolmap = new int[n];
for (int i = 0; i < n; i++)
{
boolmap[i] = (idata[i] != 0 ? 1 : 0);
}
// Step2: scan
scan(n, odata, boolmap);
int count = odata[n - 1];
// Step3: scatter
for (int k = 0; k < n; k++)
{
if (boolmap[k] != 0)
{
odata[odata[k]] = idata[k];
}
}
delete[] boolmap;

timer().endCpuTimer();
return -1;
return count;
}
}
}
Loading