CIS5650-Fall-2024 · yuhanliu-tech · Sep 14, 2024 · Sep 17, 2024 · Sep 17, 2024 · Sep 18, 2024
diff --git a/README.md b/README.md
@@ -3,12 +3,152 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yuhan Liu
+  * [LinkedIn](https://www.linkedin.com/in/yuhan-liu-), [personal website](https://liuyuhan.me/), [twitter](https://x.com/yuhanl_?lang=en), etc.
+* Tested on: Windows 11 Pro, Ultra 7 155H @ 1.40 GHz 32GB, RTX 4060 8192MB (Personal Laptop)
 
-### (TODO: Your README)
+## README!
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Project Description
 
+* This project investigates parallel algorithms by implementing various versions of scan (prefix-sum), stream compaction (remove unwated elements), and sort algorithms. We begin with CPU implementations of all the aforementioned algorithms before developing GPU versions, including naive and work-efficient approaches, for comparison purposes. 
+
+### Performance Analysis
+
+**Optimizing Block Size**
+
+<img src="https://github.com/yuhanliu-tech/GPU-Stream-Compaction/blob/main/img/block_opt.png" width="600"/>
+
+* To decide on one block size for all parallel algorithm implementations, I assessed the naive and work-efficient runtimes with increasing block sizes. In the previous project, we established a tradeoff concerning block size between increasing GPU utilization and limiting resource availability for each block. Thus, we can choose an optimal block size by finding the dip in runtimes, which in this case (although close) we choose and proceed with a block size of 256. 
+
+#### Comparison of GPU Scan Implementations
+
+| CPU |  Naive  |   Work-Efficient  | Thrust |
+| :------------------------------: |:------------------------------: |:-----------------------------------------------: |:-----------------------------------------------:|
+| For smaller datasets, the overhead of managing parallel execution on a GPU (e.g., kernel launches, thread synchronization) offsets the parallel advantage, making the CPU more efficient. However, as data scales, the complexity of scanning on the CPU increases linearly. In the graph below where array sizes increase exponentially, the complexity of CPU scanning does so as well. | The naive parallel approach rivals the work-efficient approach for most of the array sizes below. However, this method has a memory latency bottleneck: its excessive use of global memory in the summation causes the runtime to swell as the array size grows. | The work-efficient GPU scan starts with more overhead than the naive implementation for smaller arrays, but as they grow larger, it outperforms both the CPU and naive versions. The work-efficient binary tree structure reduces unnecessary memory operations and optimizes thread usage by performing operations in place. The bottleneck here is thread usage: the binary tree structure leaves threads underutilized as the number of operations at each step in the algorithm reduces. Because of this, more primitive methods like naive and CPU come close in runtime for smaller array sizes. | Referencing the NVIDIA developer [documentation](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) for exclusive scan, we notice that the Thrust implementation employs techniques to improve performance, including optimizing shared memory and load balancing. Setting up these optimizations, like bankconflict avoidance and loop unrolling, introduces overhead, which make more primitive methods more efficient for small array sizes. However, even in the graph below, the thrust scan implementation clearly does not scale as significantly as any of the other scan implementations.        |
+
+<img src="https://github.com/yuhanliu-tech/GPU-Stream-Compaction/blob/main/img/scan_perf.png" width="600"/>
+
+#### Output of Testing 
+test array SIZE: 2^18, blockSize: 256
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  35  49  36  24   1  48  46   6  49  24  44  47   5 ...   2   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.4305ms    (std::chrono Measured)
+    [   0  35  84 120 144 145 193 239 245 294 318 362 409 ... 6412415 6412417 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.5392ms    (std::chrono Measured)
+    [   0  35  84 120 144 145 193 239 245 294 318 362 409 ... 6412308 6412343 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.368768ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.219936ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.333856ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.366688ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 1.55728ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.344288ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   3   0   0   3   2   0   2   1   2   0   1   3 ...   2   1 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.6932ms    (std::chrono Measured)
+    [   1   3   3   2   2   1   2   1   3   3   2   2   3 ...   2   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.7058ms    (std::chrono Measured)
+    [   1   3   3   2   2   1   2   1   3   3   2   2   3 ...   1   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 1.3968ms    (std::chrono Measured)
+    [   1   3   3   2   2   1   2   1   3   3   2   2   3 ...   2   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.445152ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.665728ms    (CUDA Measured)
+    passed
+
+*****************************
+***** RADIX SORT TESTS ******
+*****************************
+
+Radix Sort, hard-coded lecture example test
+
+    [   4   7   2   6   3   5   1   0 ]
+==== cpu sort (std::sort) ====
+   elapsed time: 0.0002ms    (std::chrono Measured)
+    [   0   1   2   3   4   5   6   7 ]
+==== radix sort ====
+   elapsed time: 5.5552ms    (CUDA Measured)
+    [   0   1   2   3   4   5   6   7 ]
+    passed
+
+Radix Sort, pow2 consecutive ints
+
+    [   0   1   2   3   4   5   6   7   8   9  10  11  12 ... 262142 262143 ]
+==== cpu sort (std::sort) ====
+   elapsed time: 1.4753ms    (std::chrono Measured)
+    [   0   1   2   3   4   5   6   7   8   9  10  11  12 ... 262142 262143 ]
+==== radix sort ====
+   elapsed time: 13.777ms    (CUDA Measured)
+    [   0   1   2   3   4   5   6   7   8   9  10  11  12 ... 262142 262143 ]
+    passed
+
+Radix Sort, non-pow2 consecutive ints
+
+    [   0   1   2   3   4   5   6   7   8   9  10  11  12 ... 262139 262140 ]
+==== cpu sort (std::sort) ====
+   elapsed time: 1.3184ms    (std::chrono Measured)
+    [   0   1   2   3   4   5   6   7   8   9  10  11  12 ... 262139 262140 ]
+==== radix sort ====
+   elapsed time: 13.37ms    (CUDA Measured)
+    [   0   1   2   3   4   5   6   7   8   9  10  11  12 ... 262139 262140 ]
+    passed
+
+Radix Sort, non-pow2 shuffled ints
+
+    [ 4785 499 16736 27824 14951 31998 8696 23806 7549 30974 31344 14697 29955 ... 21785 7497 ]
+==== cpu sort (std::sort) ====
+   elapsed time: 14.9359ms    (std::chrono Measured)
+    [   0   0   0   0   0   0   0   0   1   1   1   1   2 ... 32767 32767 ]
+==== radix sort ====
+   elapsed time: 18.8853ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   1   1   1   1   2 ... 32767 32767 ]
+    passed
+```
+
+### Additional Feature: Radix Sort
+
+* I implemented parallel radix sort as an additional module to the `stream_compaction` subproject (and additionally, a CPU std::sort function for comparison). The implementation can be found in `radix.cu` and `radix.h`. 
+
+* The radix sort function takes as input, the size of the array, along with pointers to the input and output arrays. Below is a code snippet from a radix test in the main method, showing how it is called:
+
+```
+StreamCompaction::Radix::sort(SIZE, c, a); // a is the array to be sorted, which is already-existing
+printArray(SIZE, c, true); // print the output sorted array, which is saved in c
+```
+
+* I wrote several tests for radix sort, in which I compare both the correctness and runtime of my implementation to the CPU equivalent (std::sort). First, I compared radix sort with CPU sort on the fixed array [4, 7, 2, 6, 3, 5, 1, 0] from lecture. Then, I evaluated radix sort on sequential arrays of size power-of-two and non-power-of-two sizes. Lastly, I tested radix sort on a shuffled arrays. For each test, I compared the results and performance of radix sort with CPU sort, measuring CUDA performance where applicable.
+
+**Radix sort Performance Evaluation**
+* Radix sort performs worse than std::sort for smaller array sizes due to overhead and initialization costs associated with GPU processing and memory. However, as the array size grows, radix sort's performance improves relative to std::sort. This makes sense with the performance graph as radix sort has a complexity of O(n⋅k), where k is the number of bits, while standard CPU sort has a complexity of O(n*log(n)).
+
+<img src="https://github.com/yuhanliu-tech/GPU-Stream-Compaction/blob/main/img/radix_perf.png" width="600"/>
diff --git a/img/block_opt.png b/img/block_opt.png
diff --git a/img/radix_perf.png b/img/radix_perf.png
diff --git a/img/scan_perf.png b/img/scan_perf.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,16 +11,18 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
 #include "testing_helpers.hpp"
+#include <iostream>
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 22; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
 int main(int argc, char* argv[]) {
-    // Scan tests
+    // Scan tests --------------------------------------
 
     printf("\n");
     printf("****************\n");
@@ -100,10 +102,10 @@ int main(int argc, char* argv[]) {
     printf("** STREAM COMPACTION TESTS **\n");
     printf("*****************************\n");
 
-    // Compaction tests
+    // Compaction tests --------------------------------------
 
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
+    genArray(SIZE - 1, a, 4);  
+    a[SIZE - 1] = 1; // Leave a 1 at the end to test that edge case
     printArray(SIZE, a, true);
 
     int count, expectedCount, expectedNPOT;
@@ -147,7 +149,123 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
-    system("pause"); // stop Win32 console from closing on exit
+
+    // radix sort tests --------------------------------------
+
+    printf("\n");
+    printf("*****************************\n");
+    printf("***** RADIX SORT TESTS ******\n");
+    printf("*****************************\n");
+
+    // pow2 test
+
+    zeroArray(8, a);
+    std::cout << " " << std::endl;
+    std::cout << "Radix Sort, hard-coded lecture example test" << std::endl;
+    std::cout << " " << std::endl;
+
+    // example from slides for debugging
+    a[0] = 4;
+    a[1] = 7;
+    a[2] = 2;
+    a[3] = 6;
+    a[4] = 3;
+    a[5] = 5;
+    a[6] = 1;
+    a[7] = 0;
+
+    printArray(8, a, true);
+
+    zeroArray(8, b);
+    printDesc("cpu sort (std::sort)");
+    StreamCompaction::CPU::sort(8, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(8, b, true);
+
+    zeroArray(8, c);
+    printDesc("radix sort");
+    StreamCompaction::Radix::sort(8, c, a);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(8, c, true);
+    printCmpResult(8, b, c);
+
+    // pow2 test
+
+    zeroArray(SIZE, a);
+    std::cout << " " << std::endl;
+    std::cout << "Radix Sort, pow2 consecutive ints" << std::endl;
+    std::cout << " " << std::endl;
+
+    for (int i = 0; i < SIZE; i++) {
+        a[i] = i;
+    }
+
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("cpu sort (std::sort)");
+    StreamCompaction::CPU::sort(SIZE, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(SIZE, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort");
+    StreamCompaction::Radix::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    // npot test
+
+    zeroArray(SIZE, a);
+    std::cout << " " << std::endl;
+    std::cout << "Radix Sort, non-pow2 consecutive ints" << std::endl;
+    std::cout << " " << std::endl;
+
+    for (int i = 0; i < NPOT; i++) {
+        a[i] = i;
+    }
+
+    printArray(NPOT, a, true);
+
+    zeroArray(NPOT, b);
+    printDesc("cpu sort (std::sort)");
+    StreamCompaction::CPU::sort(NPOT, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(NPOT, b, true);
+
+    zeroArray(NPOT, c);
+    printDesc("radix sort");
+    StreamCompaction::Radix::sort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    // npot shuffled
+
+    genArray(NPOT, a, NPOT);
+    std::cout << " " << std::endl;
+    std::cout << "Radix Sort, non-pow2 shuffled ints" << std::endl;
+    std::cout << " " << std::endl;
+
+    printArray(NPOT, a, true);
+
+    zeroArray(NPOT, b);
+    printDesc("cpu sort (std::sort)");
+    StreamCompaction::CPU::sort(NPOT, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(NPOT, b, true);
+
+    zeroArray(NPOT, c);
+    printDesc("radix sort");
+    StreamCompaction::Radix::sort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    // END TESTS ----------------------------------------
+
+    //system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;
     delete[] c;

diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
@@ -51,7 +51,6 @@ void onesArray(int n, int *a) {
 
 void genArray(int n, int *a, int maxval) {
     srand(time(nullptr));
-
     for (int i = 0; i < n; i++) {
         a[i] = rand() % maxval;
     }

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
 #include "common.h"
+#include <device_launch_parameters.h>
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -23,7 +24,13 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int i = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+            if (i >= n) {
+                return;
+            } else {
+                bools[i] = (idata[i] == 0) ? 0 : 1;
+            }
         }
 
         /**
@@ -32,7 +39,14 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+
+            int i = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+            if (i >= n) {
+                return;
+            } else if (bools[i] == 1) {
+                odata[indices[i]] = idata[i];
+            }
         }
 
     }