Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 94 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,100 @@
CUDA Stream Compaction
University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 CUDA Stream Compaction
======================
* Ziyu Li
* Tested on: Windows 7, i7-3840QM @ 2.8GHz 16GB, Nivida Quadro K4000M 4096MB (Personal Laptop)

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
## Performance Analysis
#### Efficient Scan without Optimization
This implementation is achieved by reduction and down-sweep in GPU. The performance of this method is much better than naive scan but actually still slow compare to CPU.

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
For the benchmark and result, please check **Performance** section

### (TODO: Your README)
#### Efficient Scan with Optimization
To avoid the efficient scan method uses extra non-working threads, simply change the index pattern to perform the kernels. So this optimization can reduce a huge amount of threads to perform useless operations and increase the overall performance.

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
For the benchmark and result, please check **Performance** section


#### More Efficient Scan with Shared Memory
The optimized method which states above still is not efficiency enough. Actually by performing the operations in shared memory can highly achieve the maximum the performance. The whole implementation can split into three parts.

* Scan each blocks seperatly and use a auxiliary array to store each block sum
* Scan the block sums
* Add scanned block sum to next scanned block

![](img/39fig06.jpg)

(Figure 1: Algorithm for Performing a Sum Scan on a Large Array of Values, Nvidia GPU Gems)


This implementation is relatively easy to achieve, however using share memory will sometimes suffer from bank conflicts which could hurt the performance significantly by access those memory everytime. To avoid these bank conflict, we have to add padding to share memory every certain elements. And those offset can be easily implement by a macro.

```c++
#define CONFLICT_FREE_OFFSET(n) \ ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
```

For the benchmark and result, please check **Performance** section

#### Radix Sort
One of the most significant application for GPU scan is radix sort which is a sorting algorithm for parallel processors.

To use radix sort function, please call the function below:
```c++
StreamCompaction::Radix::sort(int n, int *odata, int *idata);
```
* The first argument is the array size. (input)
* The second argument is sorted array. (output)
* The third argument is unsorted array. (input)

For the benchmark and result, please check **Performance** section

## Performance
#### Scan Performace Measurement and Result
The benckmark is performed the scan operation under 128 threads per block for array size from 2^4 to 2^22. (Since there is only one grid, 2^22 is the maximum amount for a 128 block size.)

The benchmark also makes a running time comparision between CPU, GPU Naive, GPU Efficient, GPU Efficient With Optimization, GPU Efficient With Share Memory and GPU Thrust Scan.

![](img/scan_power_2.PNG)

![](img/scan_power_not_2.PNG)

![](img/scan.PNG)

(For the detail result, please check the data in the **benckmark** folder)

#### Compact Performace Measurement and Result
The benckmark is performed the compaction operation under 128-512 threads per block for array size from 2^4 to 2^24. (128 block size for array size 2^4 to 2^22, 256 block size for 2^23 and 512 block size for 2^24)

The benchmark also makes a running time comparision between CPU without scan, CPU with scan and GPU with scan.

![](img/compaction_2.PNG)

![](img/compact.PNG)

(For the detail result, please check the data in the **benckmark** folder)

#### Radix Sort Performance Measurement and Result
The benchmark is performed the radix sort operation under 128 threads per block for array size from 2^4 to 2^24.

The benchmark makes a running time comparison between CPU 3-part hybrid sort (standard sort function in STL) and GPU radix sort

![](img/radix_c.PNG)

![](img/radix_result.PNG)

## Questions
#### Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.

Based on large number (larger than 2^20) benchmark result. The optimize block sizes for each implementation:

| Methods | Naive | Efficient | Efficient (Optimize) | Efficient (ShareMem) | Thrust |
|:----------:|-------|-----------|----------------------|----------------------|--------|
| Block Size | 1024 | 128 | 128 | 256 | 1024 |

#### Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison

For Thrust implementation, the highest occupancy in GPU is cudaMalloc and cudaMemcpy related function calls based on Nsight Timeline. However, there are three most significant functions in Thrust scan *accumlate_tiles*, *exculsive_scan_n* and *exclusive_downsweep* are not really use too much GPU time.

I believe the performance bottlenecks is memory bandwidth for Thrust scan. The computation time compare to memory I/O time is trivial. As for my implementation, the efficient method waste a huge amount of time on launching non-working threads. For efficient with optimization, the memory I/O become the most inefficient factor in whole system. By using shared memory can highly increase memory I/O efficiency and decrease memory latency to achieve maximum efficiency.

For the benchmark and graph, please check **Performance** section
Binary file added img/39fig06.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compact.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/compaction_2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/radix_c.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/radix_result.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_power_2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_power_not_2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
119 changes: 110 additions & 9 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,39 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radix.h>
#include <algorithm>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int a[SIZE], b[SIZE], c[SIZE];
int SIZE = 1 << 20; // feel free to change the size of array
int NPOT = SIZE - 3; // Non-Power-Of-Two

int main(int argc, char* argv[]) {
// Scan tests
StreamCompaction::Common::PerformanceTimer& timer();

using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

int main(int argc, char* argv[]) {
int *a, *b, *c;
a = (int *)malloc(SIZE * sizeof(int));
b = (int *)malloc(SIZE * sizeof(int));
c = (int *)malloc(SIZE * sizeof(int));

// Scan tests

if (argc == 2) {
if (atoi(argv[1]) < 0) {
printf("---------------------------------------------------");
SIZE = 1 << (-1 * atoi(argv[1]));
//printf("---bash test: %i----\n", -1 * atoi(argv[1]));
}
}


printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
Expand All @@ -32,6 +56,7 @@ int main(int argc, char* argv[]) {
// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.

zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
Expand Down Expand Up @@ -59,20 +84,49 @@ int main(int argc, char* argv[]) {
//printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan0(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan0(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
printDesc("work-efficient scan with optimization, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
printDesc("work-efficient scan with optimization, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan with SHARE MEMORY and optimization, power-of-two");
StreamCompaction::Efficient::scan_s(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan with SHARE MEMORY and optimization, non-power-of-two");
StreamCompaction::Efficient::scan_s(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);


zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
Expand All @@ -86,7 +140,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
Expand Down Expand Up @@ -119,12 +173,20 @@ int main(int argc, char* argv[]) {
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
printDesc("cpu compact with scan, power-of-two");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("cpu compact with scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithScan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedNPOT = count;
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
Expand All @@ -139,5 +201,44 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);



printf("\n");
printf("*****************************\n");
printf("** RADIX SORT TESTS **\n");
printf("*****************************\n");

genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

memcpy(c, a, SIZE * sizeof(int));
timer().startCpuTimer();
std::sort(c, c + SIZE);
timer().endCpuTimer();
printElapsedTime(timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");

zeroArray(SIZE, b);
printDesc("radix sort, power-of-two");
count = SIZE;
StreamCompaction::Radix::sort(SIZE, b, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda Measured)");
expectedCount = count;
printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, c);

memcpy(c, a, SIZE * sizeof(int));
std::sort(c, c + NPOT);
zeroArray(SIZE, b);
printDesc("radix sort, not-power-of-two");
StreamCompaction::Radix::sort(NPOT, b, a);
printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda Measured)");
printArray(NPOT, b, true);
printCmpLenResult(NPOT, NPOT, b, c);


free(a);
free(b);
free(c);
system("pause"); // stop Win32 console from closing on exit
}
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ set(SOURCE_FILES
"efficient.cu"
"thrust.h"
"thrust.cu"
"radix.h"
"radix.cu"
)

cuda_add_library(stream_compaction
Expand Down
15 changes: 13 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,12 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (idx >= n) {
return;
}

bools[idx] = (bool)idata[idx];
}

/**
Expand All @@ -32,7 +37,13 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (idx >= n) {
return;
}
if (bools[idx]) {
odata[indices[idx]] = idata[idx];
}
}

}
Expand Down
Loading