CIS5650-Fall-2024 · Kleokhov · Sep 14, 2024 · Sep 15, 2024 · Sep 15, 2024 · Sep 17, 2024
diff --git a/README.md b/README.md
@@ -3,12 +3,153 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Kevin Dong
+  * [LinkedIn](www.linkedin.com/in/xingyu-dong)
+* Tested on: Windows 11, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz, GTX 2060
 
-### (TODO: Your README)
+This repo implements several versions of parallel exclusive scan algorithms and used them to implement stream 
+compaction. The work-efficient scan algorithm has also been used in implementing a parallel radix sort algorithm.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Features
 
+#### CPU Scan and Compaction
+The CPU scan implementation is fairly straightforward $O(n)$ algorithm. The compaction algorithms, with and without 
+scan, also run in $O(n)$ time. All of them perform fairly well on small input sizes and become slow as the input size 
+increases.
+#### Naive GPU Scan
+The naive parallel scan is a $O(log_n)$ algorithm that still performs $O(nlog_n)$ number of adds, which makes it not 
+really better than the sequential algorithm.
+#### Work-Efficient GPU Scan and Compaction
+The work-efficient scan improves upon the naive algorithm by changing the array into a balanced binary tree. It first 
+performs a parallel reduction (the up-sweep) and a down-sweep that yields the final scan result. This algorithm has 
+$O(n)$ number of adds and performs much better than the naive scan. To accommodate input sizes that are not powers of 2, 
+we enlarge the input array to the nearest power of 2 and pad the rest of the array with 0s.
+#### Work-Efficient Scan Optimization (Part 5 Extra Credit)
+We can further optimize the work-efficient scan by reducing the number of works needed to be done. A lot of threads 
+are not needed in the process because only some nodes are required to be updated during each iteration.
+#### Thrust Scan
+The thrust implementation calls the thrust api to perform the exclusive scan. It is very straightforward.
+#### Radix Sort (Extra Credit 1)
+The parallel radix sort uses the work-efficient exclusive scan as part of its implementation. We iterate through $k$ 
+bits and the exclusive scan take $O(log_n)$ time. The total time complexity is $O(k\cdot log_n)$ because the generation of 
+bit array and scatter operation are $O(1)$ due to the parallelism.
+
+### Performance Analysis
+We will compare the performance of the different scan algorithms on different input sizes. The graph is shown below in 
+log scale.
+![Performance Graph](img/Figure_2.png)
+From the graph, we can see that the thrust scan algorithm performs the best among all the scan algorithms, followed 
+by the work-efficient algorithm. When the input size is small, the CPU algorithm is the fastest algorithm, 
+but it quickly becomes slower as the input size increases. The naive scan algorithm also becomes slow when the size 
+becomes very large.
+
+#### Performance Bottlenecks
+Our timer for all the algorithms does not contain the memory allocation process and the process of copying the final 
+result array back to the host. Therefore, the execution time should mainly reflect the computation time as well as the 
+memory access time while doing the computation.
+
+The CPU implementation is fairly efficient and can hardly be further improved, since it is already a linear time 
+algorithm. From the runtime result we can also observe the fact that it runs very fast when the input size is small, 
+since comparing to its GPU counterparts it doesn't cause a lot of overheads. 
+
+The naive algorithm generally performs worse than the work-efficient algorithm, since it requires more adds operations 
+even comparing to the CPU implementation and have many idle threads that are not actively working. Since our 
+implementation uses global memory, the memory access could also potentially slow down the algorithm. It is expected 
+that a shared-memory model may further improve the algorithm's performance.
+
+The work-efficient algorithm improves based on the naive algorithm by doing computations like a balanced binary tree. 
+The number of adds for both the up-sweep and the down-sweep in this case becomes $O(n)$, thus making this algorithm 
+more efficient. The work-efficient algorithm does require more memory access, since there are more steps involved and 
+all the data are stored as global memory, so changing it to a shared-memory model should greatly improve its 
+performance. In part 5, we implemented an optimization that reduces the number of threads generated and replaced the 
+modular operation with a value comparison, which should make the algorithm more efficient.
+
+The thrust algorithm is the most efficient algorithm when the input size is very large. It is not hard to imagine that 
+the thrust implementation uses shared-memory model and properly optimizes the use of warps and registers to make the 
+algorithm very efficient on large scale data.
+
+![Thrust Analysis](img/Nsight.png)
+
+From the Nsight System timeline, we can see that the thrust implementation has a very high SM Warp Occupancy, indicating 
+that the algorithm is very efficient in utilizing the GPU resources. There are also some DRAM read and write operations, 
+which could be the memory access operations during the up-sweep and down-sweep processes. Overall, the thrust 
+implementation looks very efficient and well optimized.
+
+
+### Output
+This output is generated with $2^{20}$ input size and $256$ block size.
+```
+****************
+** SCAN TESTS **
+****************
+    [  28  28  11  44  31  11   4  25  23  43  12  12  28 ...  48   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.5931ms    (std::chrono Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.4801ms    (std::chrono Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.85376ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.916128ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.6073ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.02006ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.83968ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.6672ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   2   1   2   3   2   3   2   0   3   2   2   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.2588ms    (std::chrono Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   1   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.1694ms    (std::chrono Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 5.4433ms    (std::chrono Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   1   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.83184ms    (CUDA Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   1   2 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.1128ms    (CUDA Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   3   3 ]
+    passed
+
+**********************
+** RADIX SORT TESTS **
+**********************
+    [  32  26 157 158 195 138 167 198 116  95 114 106 149 ...  30   0 ]
+==== radix sort ====
+   elapsed time: 17.5402ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 199 199 ]
+==== thrust sort ====
+   elapsed time: 1.57117ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 199 199 ]
+    passed
+```
diff --git a/img/Figure_1.png b/img/Figure_1.png
diff --git a/img/Figure_2.png b/img/Figure_2.png
diff --git a/img/Nsight.png b/img/Nsight.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,14 +11,17 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
+int *d = new int[SIZE];
+
 int main(int argc, char* argv[]) {
     // Scan tests
 
@@ -39,6 +42,7 @@ int main(int argc, char* argv[]) {
     StreamCompaction::CPU::scan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(SIZE, b, true);
+    printCmpResult(SIZE, b, b);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
@@ -51,7 +55,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,37 +68,50 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    // compare results between thrust array and cpu array
+//    zeroArray(SIZE, b);
+//    printDesc("compare thrust array and cpu array - CPU");
+//    StreamCompaction::CPU::scan(NPOT, b, a);
+//    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+//    printCmpResult(NPOT, b, b);
+//
+//    zeroArray(SIZE, d);
+//    printDesc("compare thrust array and cpu array - Thrust");
+//    StreamCompaction::Thrust::scan(NPOT, d, a);
+//    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+//    printCmpResult(NPOT, d, b);
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -137,18 +154,43 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("**********************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("**********************\n");
+
+    genArray(SIZE - 1, a, 200);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    int maxBits = StreamCompaction::Radix::getMaxBits(SIZE, a);
+
+    zeroArray(SIZE, b);
+    printDesc("radix sort");
+    StreamCompaction::Radix::sort(SIZE, b, a, maxBits);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("thrust sort");
+    StreamCompaction::Thrust::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;
     delete[] c;
+    delete[] d;
 }
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix.h"
     )
 
 set(sources
@@ -12,7 +13,8 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
-    )
+    "radix.cu"
+)
 
 list(SORT headers)
 list(SORT sources)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,12 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,7 +37,14 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }