CIS565-Fall-2023 · xinyuniu123 · Sep 15, 2023 · Sep 15, 2023 · Sep 15, 2023 · Sep 15, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,120 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Xinyu Niu
+  * [personal website](https://xinyuniu6.wixsite.com/my-site-1)
+* Tested on: Windows 11, i9-13980HX @ 2.20GHz 16GB, RTX 4070 16185MB (Personal)
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Introduction
 
+This project focuses on implementing *Scan* (*Prefix Sum*) and *Stream Compaction* algorithms in CUDA. Scan algorithms are about doing prefix sum on an array, and Stream Compaction algorithm is about removing elements that meet some given conditions from an array. In this project, the stream compaction implementations will remove `0`s from an array of `int`s.
+
+## Implemented Features
+
+In this project, I completed the following features:
+
+* CPU Scan & Stream Compaction
+* Naive GPU Scan Algorithm
+* Work-Efficient GPU Scan & Stream Compaction
+* GPU Scan using Thrust
+* Optimized Work-Efficient GPU Scan(Extra Credit)
+* Radix Sort (Extra Credit)
+
+## Performance Analysis
+
+![](img/blocksize.png)
+
+**Figure 1:** Change on Elapsed Time influenced by increasing blocksize with fixed array size(2^25)
+
+From Figure 1, we can observe that before blocksize = 32 is reached, blocksize has relatively significant effect on time elapsed, and a larger blocksize leads to better performance for all three implementations. After 32 is reached, the blocksize no longer has considerable inflence on time elapsed. And we can also observe that before optimization, the efficient scan is not actually efficient. This is causing by the inefficient use of threads, since there will be idle threads as we always launch the same number of blocks for each iteration. By compacting threads, using shared memory etc. we can reduce this inefficiency.
+
+![](img/arraysize.png)
+
+**Figure 2:** Change on Elapsed Time influenced by increasing arraysize with fixed blocksize(256)
+
+From Figure 2, we can observe that when arraysize gets larger and over a certain number(from the data collected it's about 2^22), there will be significant increase on elapsed time and we can clearly see the difference in performance of different scan method. From faster to slower, the rank of methods is: ```thrust > efficient GPU scan > Naive GPU scan > CPU scan```. When arraysize is smaller, although it's hard to observe from the graph, CPU appears to have a bit better performance than the two algorithm I implemented while thrust is still the fastest one.
+
+![](img/thrust.png)
+
+**Figure 3:** NSight trace result of only allowing CPU scans and thrust scans
+
+From Figure 3, we can see that the largest part of thrust is the use of ```cudaMemcpyAsync``` and ```cudaStreamSynchronize```, which allows control to be returned to the host thread immediately so we won't need to wait until the data copy is completed. Since we are using cudaMemcpy in our implementation, this waiting time might be a bottleneck.
+
+## Output
+```
+****************
+** SCAN TESTS **
+****************
+    [   0  13  20  46  19  36   9   0  15  22  27  38  11 ...  18   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 428.08ms    (std::chrono Measured)
+    [   0   0  13  33  79  98 134 143 143 158 180 207 245 ... -2015722115 -2015722097 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 425.212ms    (std::chrono Measured)
+    [   0   0  13  33  79  98 134 143 143 158 180 207 245 ... -2015722147 -2015722135 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 314.354ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 334.496ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 96.6155ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 97.165ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 10.5397ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 12.058ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   0   0   0   0   0   0   0   2   0   0   2   0 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 425.212ms    (std::chrono Measured)
+    [   2   2   2   3   2   1   2   3   1   1   3   3   2 ...   1   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 425.212ms    (std::chrono Measured)
+    [   2   2   2   3   2   1   2   3   1   1   3   3   2 ...   1   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 2080.47ms    (std::chrono Measured)
+    [   2   2   2   3   2   1   2   3   1   1   3   3   2 ...   1   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 202.213ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 188.192ms    (CUDA Measured)
+    passed
+
+```
+## Extra Credit
+**1. Optimized GPU Efficient Scan**
+
+I attempted to optimize the performance by adjusting blocks launched at during the loop for upper sweep and down sweep. During upper sweep, each iteration the block number shrinks to half. During down sweep, each iteration the block number expands to twice.
+
+**2. Radix Sort**
+
+I have implemented the Radix Sort algorithm, which can be called using ```StreamCompaction::Efficient::radixSort()```.
+
+The below output resulted from comparing the sorted arry using my implementation and std::sort.
+
+```
+*****************************
+** RADIX SORT TESTS **
+*****************************
+    [   0  17   0  16  17   5  18   9   5  18  19  18  10 ...   3   0 ]
+==== Radix Sort ====
+   elapsed time: 8.7022ms    (std::chrono Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ...  19  19 ]
+    passed
+```
diff --git a/img/arraysize.png b/img/arraysize.png
diff --git a/img/blocksize.png b/img/blocksize.png
diff --git a/img/out1.png b/img/out1.png
diff --git a/img/out2.png b/img/out2.png
diff --git a/img/out3.png b/img/out3.png
diff --git a/img/thrust.png b/img/thrust.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,12 +13,14 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
+# define NOTESTTHRUST 1
+
 int main(int argc, char* argv[]) {
     // Scan tests
 
@@ -47,6 +49,8 @@ int main(int argc, char* argv[]) {
     printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
+#if NOTESTTHRUST
+
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
@@ -81,6 +85,8 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+#endif
+
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
@@ -95,6 +101,8 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+#if NOTESTTHRUST
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -147,6 +155,26 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("*****************************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("*****************************\n");
+
+    // Radixsort tests
+
+    genArray(SIZE - 1, a, 20);  // Leave a 0 at the end to test that edge case
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("Radix Sort");
+    StreamCompaction::Efficient::radixSort(SIZE, b, a);
+    std::sort(a, a + SIZE);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(SIZE, a, true);
+    printCmpResult(NPOT, a, b);  
+
+#endif
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,9 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx < n)
+                bools[idx] = (idata[idx] != 0 ? 1 : 0);
         }
 
         /**
@@ -33,6 +36,9 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx < n && bools[idx] == 1)
+                odata[indices[idx]] = idata[idx];
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -3,6 +3,8 @@
 
 #include "common.h"
 
+int gNonCompact = 1;
+
 namespace StreamCompaction {
     namespace CPU {
         using StreamCompaction::Common::PerformanceTimer;
@@ -18,9 +20,16 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
+            if(gNonCompact)
+                timer().startCpuTimer();
             // TODO
-            timer().endCpuTimer();
+            odata[0] = 0;
+            for (int i = 1; i < n; i++) 
+            {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
+            if(gNonCompact)
+                timer().endCpuTimer();
         }
 
         /**
@@ -29,10 +38,19 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
+            //timer().startCpuTimer();
             // TODO
-            timer().endCpuTimer();
-            return -1;
+            int count = 0;
+            for (int i = 0; i < n; i++) 
+            {
+                if (idata[i] != 0) 
+                {
+                    odata[count] = idata[i];
+                    count++;
+                }
+            }
+            //timer().endCpuTimer();
+            return count;
         }
 
         /**
@@ -41,10 +59,30 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            gNonCompact = 0;
             timer().startCpuTimer();
             // TODO
+            // Step1: map
+            int* boolmap = new int[n];
+            for (int i = 0; i < n; i++) 
+            {
+                boolmap[i] = (idata[i] != 0 ? 1 : 0);
+            }
+            // Step2: scan
+            scan(n, odata, boolmap);        
+            int count = odata[n - 1];
+            // Step3: scatter
+            for (int k = 0; k < n; k++)
+            {
+                if (boolmap[k] != 0) 
+                {
+                    odata[odata[k]] = idata[k];
+                }
+            }
+            delete[] boolmap;
+
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
     }
 }