CIS5650-Fall-2024 · shreyas3156 · Sep 17, 2024 · Sep 18, 2024 · Sep 19, 2024 · Sep 19, 2024
diff --git a/README.md b/README.md
@@ -3,12 +3,133 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Shreyas Singh
+  * [LinkedIn](https://linkedin.com/in/shreyassinghiitr), [Personal Website](https://github.com/shreyas3156).
+* Tested on: Windows 10, i7-12700 @ 2.1GHz 32GB, T1000 (CETS Lab)
 
-### (TODO: Your README)
+### About
+This project features an implementation of all-prefix-sums operation on an array of data, often known as
+_Scan_, followed by _Stream Compaction_, which refers to creating a new array with elements from the input that are filtered using a given criteria, preserving the order of elements in the process. We use _scan_ interchangeably with _exclusive scan_ throughout the project, where each element _k_ in the result array is the sum
+of all elements up to but excluding _k_ itself in the input array.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+We would see three different implementations of Scan, the first being a trivial sequential CPU-Scan and the other two being
+"Naive" and "Work-Efficient" CUDA implementations of a parallel scan algorithm that leverage GPU's data parallelism, thus giving huge performance improvement
+for large inputs.
+
+Our Stream Compaction algorithm would remove `0`s from an array of `int`s, using CPU and Work-Efficient parallel scans using CUDA.
+For a detailed version of Scan and further reading, check out [GPU Gems 3, Ch-39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).
+
+
+## Performance Analysis
+
+The algorithms have been timed with respect to various parameters like blockSize and input length using the custom `timer()` API. We have only timed the algorithms and not the memory allocations and other computations 
+since we assume that compute on GPU is for free. Time units are in 'milliseconds(ms)'.
+
+### Optimal Block size 
+The `blockSize` parameter was roughly optimized for both parallel algorithms for an array length of 2^20. I chose to take an average of the time for power-of-two (POT) inputs and non-power-of-two (NPOT) inputs. 
+The optimal block size was found to be 512 for Naive parallel scan and 128 for Work-Efficient scan. 
+![](img/blocksizeopt.png)
+
+
+### Scan implementations vs Array Size
+We compare the various implementations of Scan (Naive, Work-Efficient and Thrust) with respect to the CPU Scan algorithm.
+Thrust is a C++ template library for CUDA based on the Standard Template Library (STL) and it allows you 
+to implement high performance parallel applications. 
+
+Both GPU and CPU timing functions were wrapped up as a performance timer class. We have used 
+`std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance.
+
+The power-of-two inputs and non-power-of-two inputs have been analyzed separately for better
+comparability of the CPU, GPU and Thrust implementations.
+
+
+![](img/scan_pot.png)
+![](img/perfpot.png)
+
+Analysis for NPOT arrays follows:
+![](img/scan_npot.png)
+![](img/perfnpot.png)
+
+Clearly, the CPU implementation is faster for shorter input lengths until around an array size
+of 2^13. Following this, the CPU Scan is slightly slower than Naive Parallel Scan but still faster than Work-Efficient Parallel Scan
+until an array size of 2^20. However, the Thrust scan always has the
+fastest performance among all the GPU algorithms and the CPU scan beyond 2^13.
+
+The *Work-Efficient scan has been optimized for thread usage* such that it only launches the
+required number of threads and no unused thread have to wait for other threads to terminate. This
+is why it is faster than all but the Thrust Scan algorithm at arrays of large lengths.
+
+The major *performance bottleneck* here is the global memory I/O for all parallel algorithms. This can be improved upon
+by utilising the shared memory as the size of shared memory is dynamic and is related to the block size. This overhead is overcome
+at arrays of larger sizes when the Work-Efficient and Thrust Scan run faster than the CPU Scan.
+
+The *Thrust* library's performance through `exclusive_scan()` shows remarkable consistency as the input size increases upto 
+2^17, following which there is a marginal increase in performance time. The possible explanation is optimal memory management
+by Thrust, avoiding possible global memory I/O overheads. The slower performance at larger array sizes could be because of
+memory allocation and copying latency between the host and the device.
+
+
+### Stream Compaction Performance
+
+Here, we show a comparison of the stream compaction algorithms implemented through CPU (With and Without Scan) and through the Work-Efficient Algorithm.
+![](img/compact_perf.png).
+
+As one would expect from the discussion above, the CPU algorithms take much lesser time than their GPU
+counterparts for smaller array sizes (<2^20). This trend appears to flip at array sizes > 2^20 as the optimized
+thread usage for the efficient-algorithm produces improved performance.
+```
+****************
+** SCAN TESTS **
+****************
+    [  10  32  23  43  28  33  17   0  35  37  44   0  23 ...  33   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.7358ms    (std::chrono Measured)
+    [   0  10  42  65 108 136 169 186 186 221 258 302 302 ... 25672547 25672580 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.7335ms    (std::chrono Measured)
+    [   0  10  42  65 108 136 169 186 186 221 258 302 302 ... 25672508 25672513 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.57238ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.51142ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.42707ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.36602ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.596ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.176128ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   1   1   2   2   0   1   3   1   1   0   0   0 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.146ms    (std::chrono Measured)
+    [   3   1   1   2   2   1   3   1   1   3   3   2   1 ...   3   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.171ms    (std::chrono Measured)
+    [   3   1   1   2   2   1   3   1   1   3   3   2   1 ...   1   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 5.0626ms    (std::chrono Measured)
+    [   3   1   1   2   2   1   3   1   1   3   3   2   1 ...   3   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.25693ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.92234ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+```
 
diff --git a/img/blocksizeopt.png b/img/blocksizeopt.png
diff --git a/img/compact_perf.png b/img/compact_perf.png
diff --git a/img/perfnpot.png b/img/perfnpot.png
diff --git a/img/perfpot.png b/img/perfpot.png
diff --git a/img/scan_npot.png b/img/scan_npot.png
diff --git a/img/scan_pot.png b/img/scan_pot.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,6 +1,6 @@
 #include "common.h"
 
-void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+void checkCUDAErrorFn(const char* msg, const char* file, int line) {
     cudaError_t err = cudaGetLastError();
     if (cudaSuccess == err) {
         return;
@@ -22,17 +22,29 @@ namespace StreamCompaction {
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+        __global__ void kernMapToBoolean(int n, int* bools, const int* idata) {
             // TODO
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+            if (idx >= n) {
+                return;
+            }
+            bools[idx] = (idata[idx]) ? 1 : 0;
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
+        __global__ void kernScatter(int n, int* odata,
+            const int* idata, const int* bools, const int* indices) {
             // TODO
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+            if (idx >= n) {
+                return;
+            }
+            if (bools[idx]) {
+                odata[indices[idx]] = idata[idx];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,9 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +32,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int j = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i]) {
+                    odata[j] = idata[i];
+                    j++; 
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return (j) ? j : -1;
         }
 
         /**
@@ -42,9 +50,29 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // map the input array to an array of 0s and 1s
+            int* temp = new int[n];
+            for (int i = 0; i < n; i++) {
+                temp[i] = idata[i] == 0 ? 0 : 1;
+            }
+
+            int* scanOutput = new int[n];
+            scanOutput[0] = 0;
+
+            // scan the temp array
+            for (int i = 1; i < n; i++) {
+                scanOutput[i] = scanOutput[i - 1] + temp[i - 1];
+            }
+
+            int compactLen = scanOutput[n - 1];
+
+            for (int i = 0; i < n; i++) {
+                if (temp[i]) {
+                    odata[scanOutput[i]] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return (compactLen) ? compactLen : -1;
         }
     }
 }