Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 101 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,107 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Zixiao Wang
* [LinkedIn](https://www.linkedin.com/in/zixiao-wang-826a5a255/)
* Tested on: Windows 11, i7-14700K @ 3.40 GHz 32GB, GTX 4070TI 12GB 4K Monitor (Personal PC)

### (TODO: Your README)
### Background

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
* In this project, I implemented GPU stream compaction in CUDA, from scratch. This algorithm is widely used, and will be important for accelerating path tracer. I implemented several versions of the Scan (Prefix Sum) algorithm, a critical building block for stream compaction:
* CPU Implementation: Begin by coding a CPU version of the Scan algorithm.
* GPU Implementations:
* Naive Scan: Develop a straightforward GPU version that parallelizes the Scan operation.
* Work-Efficient Scan: Optimize GPU implementation to make better use of parallel resources, minimizing idle threads and redundant computations.

* GPU Stream Compaction: Utilize Scan implementations to create a GPU-based stream compaction algorithm that efficiently removes zeros from an integer array.
* For more detail [GPU GEM](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).

### Prefix-Sum(Scan)

![](img/ScanSpeed.png)
The chart compares the execution time (in milliseconds) of different scan implementations (CPU, Naive, Work-Efficient, and Thrust) for various array sizes
* CPU Implementation shows a significant increase in running time as the array size increases. And it is the slowest among all implementations, particularly for larger array sizes.
* The Naive scan algorithm, which is a GPU implementation, performs better than the CPU version but still increases noticeably as the array size grows. This implementation does not utilize memory optimizations, and the iterative approach of the naive scan leads to suboptimal performance, especially for larger arrays.
* The Work-Efficient scan performs better than the Naive scan but is slightly slower than the Thrust implementation for larger array sizes. This approach is optimized for better memory access patterns and reduces the number of required operations.
* The Thrust library implementation is the fastest across all array sizes. The execution time remains almost constant even as the array size increases, which indicates very efficient scaling and use of GPU resources
#### Summary
Thrust is the clear winner for larger data sets, offering the best performance and scalability. For smaller arrays, the CPU implementation’s overhead isn’t as visible, but as the array size increases, the GPU’s parallel nature shows a dramatic improvement in running time. The Work-Efficient scan addresses some of the Naive algorithm's computational inefficiencies by reducing the number of redundant calculations and focusing on better parallelization techniques. However, memory I/O remains a limiting factor because it relies on frequent reads and writes to global memory.
![](img/NsightCompute.png)

* Memory-bound issues: Low memory throughput across kernels might suggest suboptimal memory access patterns, which might be caused by excessive global memory operations. Optimizations using shared memory might hugely improve performance.
* Compute-bound issues: The low compute throughput in the kernels may indicate the underutilization of GPU compute resources. This might be improved by increasing occupancy and reducing divergence between threads.
### Extra Feature
#### Optimized Work Efficient Approach
![](img/Upsweep.png)

As the Up Sweep Phase Picture shows, it is obvious the number of threads needed for different depths is not fixed. Most threads become idle as only a few elements need to be processed. For example, at level d, only N / 2^d threads are doing useful work out of the total threads launched. Adjust the number of threads and blocks launched at each level based on the workload as follows:
```
int step = 1 << (d + 1);
int threads = new_n / step;
dim3 fullBlocksPerGrid((threads + block_size - 1) / block_size);
```
However, once these value are not fixed, the calculation of index in the Kernel also need to be modified as follow:
```
int k = (blockIdx.x * blockDim.x) + threadIdx.x;
int step = 1 << (d + 1);
int index = k * step + (step - 1);
```
### test result
test on size 1<<22 and block size 128
```bash
****************
** SCAN TESTS **
****************
[ 21 19 39 37 37 32 3 36 8 22 22 5 49 ... 20 0 ]
==== cpu scan, power-of-two ====
elapsed time: 5.5752ms (std::chrono Measured)
[ 0 21 40 79 116 153 185 188 224 232 254 276 281 ... 102730753 102730773 ]
==== cpu scan, non-power-of-two ====
elapsed time: 5.5342ms (std::chrono Measured)
[ 0 21 40 79 116 153 185 188 224 232 254 276 281 ... 102730634 102730677 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.57712ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.496128ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.489792ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.549632ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 1.10934ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.57584ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 1 3 1 1 0 1 0 0 2 2 1 3 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 6.5848ms (std::chrono Measured)
[ 1 1 3 1 1 1 2 2 1 3 1 2 1 ... 2 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 6.5552ms (std::chrono Measured)
[ 1 1 3 1 1 1 2 2 1 3 1 2 1 ... 1 2 ]
passed
==== cpu compact with scan ====
elapsed time: 17.0166ms (std::chrono Measured)
[ 1 1 3 1 1 1 2 2 1 3 1 2 1 ... 2 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.525504ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.521216ms (CUDA Measured)
passed
```
### Modification to CmakeList
Add `radixsort.cu` and `radixsort.h` and add entry for them in Cmakelist

Binary file added img/NsightCompute.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanSpeed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/Upsweep.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1<< 22; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down
14 changes: 14 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO

int index = (blockIdx.x * blockDim.x) + threadIdx.x;

if (index < n) {
bools[index] = (idata[index] != 0) ? 1 : 0;
}
}

/**
Expand All @@ -33,6 +39,14 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;

if (index < n) {
if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}

}
Expand Down
58 changes: 56 additions & 2 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#include "cpu.h"

#include "common.h"
#include <iostream>

namespace StreamCompaction {
namespace CPU {
Expand All @@ -20,6 +21,18 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
if (n == 0) return;


odata[0] = 0;


for (int i = 1; i <=n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];


}

timer().endCpuTimer();
}

Expand All @@ -31,8 +44,16 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int count = 0;

for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[count] = idata[i];
count++;
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -41,10 +62,43 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
int* flag = new int[n];
int* scannedFlag = new int[n];


timer().startCpuTimer();
// TODO
for (int i = 0; i < n; i++) {
flag[i] = (idata[i] != 0)? 1:0;

}



scannedFlag[0] = 0;


for (int i = 1; i < n; i++) {
scannedFlag[i] = scannedFlag[i - 1] + flag[i - 1];
}


int count = 0;

for (int i = 0; i < n; i++) {
if (flag[i] == 1) {
if (scannedFlag[i] < n) {
odata[scannedFlag[i]] = idata[i];
}

count++;
}
}

timer().endCpuTimer();
return -1;
delete[] flag;
delete[] scannedFlag;
return count;
}
}
}
120 changes: 117 additions & 3 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,82 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
int block_size = 128;

__global__ void Upsweep_kernel(int n, int* data, int d) {
int k = (blockIdx.x * blockDim.x) + threadIdx.x;
int step = 1 << (d + 1);
int index = k * step + (step - 1);

if (index < n) {
int offset = 1 << d;
data[index] += data[index - offset];

}

}
__global__ void Downsweep_kernel(int n, int* data, int d)
{
int k = (blockIdx.x * blockDim.x) + threadIdx.x;
int step = 1 << (d + 1);
int index = k * step + (step - 1);

if (index < n) {
int offset = 1 << d;
int t = data[index - offset];
data[index - offset] = data[index];
data[index] += t;
}
}



/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
int d = ilog2ceil(n);
int new_n = 1 << d;

int* dev_data;
cudaMalloc((void**)&dev_data, new_n * sizeof(int));

cudaMemset(dev_data, 0, new_n * sizeof(int));
cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);


timer().startGpuTimer();
// TODO
//Upsweep
for (int h = 0; h < ilog2ceil(new_n); h++) {
int step = 1 << (h + 1);
int threads = new_n / step;
dim3 fullBlocksPerGrid((threads + block_size - 1) / block_size);
Upsweep_kernel << <fullBlocksPerGrid, block_size >> > (new_n, dev_data, h);

}

cudaMemset(&dev_data[new_n - 1], 0, sizeof(int));

for (int d = ilog2ceil(new_n) - 1; d >= 0; d--) {
int step = 1 << (d + 1);
int threads = new_n / step;
dim3 fullBlocksPerGrid((threads + block_size - 1) / block_size);
Downsweep_kernel << <fullBlocksPerGrid, block_size >> > (new_n, dev_data, d);


}


timer().endGpuTimer();

cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev_data);

}



/**
* Performs stream compaction on idata, storing the result into odata.
Expand All @@ -31,10 +98,57 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
timer().startGpuTimer();


int* dev_idata;
int* dev_odata;
int* dev_bools;
int* dev_indices;

cudaMalloc((void**)&dev_idata, n * sizeof(int));
cudaMalloc((void**)&dev_bools, n * sizeof(int));
cudaMalloc((void**)&dev_indices, n * sizeof(int));
cudaMalloc((void**)&dev_odata, n * sizeof(int));


cudaMemcpy(dev_idata, idata, n*sizeof(int), cudaMemcpyHostToDevice);
dim3 fullBlocksPerGrid((n + block_size - 1) / block_size);




//timer().startGpuTimer();
// TODO
timer().endGpuTimer();
return -1;
StreamCompaction::Common::kernMapToBoolean << <fullBlocksPerGrid, block_size >> > (n, dev_bools, dev_idata);
cudaDeviceSynchronize();

scan(n, dev_indices, dev_bools);
cudaDeviceSynchronize();

StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, block_size >> > (n, dev_odata, dev_idata, dev_bools, dev_indices);
cudaDeviceSynchronize();




//timer().endGpuTimer();



int last_bool;
int last_index;
cudaMemcpy(&last_bool, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&last_index, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);



cudaMemcpy(odata, dev_odata, (last_index + last_bool)* sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dev_idata);
cudaFree(dev_bools);
cudaFree(dev_indices);
cudaFree(dev_odata);
return last_index + last_bool;
}
}
}
Loading