CIS5650-Fall-2024 · nadnane · Sep 18, 2024 · Sep 18, 2024 · Sep 19, 2024 · Sep 19, 2024
diff --git a/README.md b/README.md
@@ -3,12 +3,107 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Nadine Adnane
+  * [LinkedIn](https://www.linkedin.com/in/nadnane/)
+* Tested on my personal laptop (ASUS ROG Zephyrus M16):
+* **OS:** Windows 11
+* **Processor:** 12th Gen Intel(R) Core(TM) i9-12900H, 2500 Mhz, 14 Core(s), 20 Logical Processor(s) 
+* **GPU:** NVIDIA GeForce RTX 3070 Ti Laptop GPU
 
-### (TODO: Your README)
+Note: I used a late day on this assignment. I also need a few more hours to finish up my analysis.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Introduction
 
+In this project, I set out to implement a few different versions of the Scan (Prefix Sum) algorithm and GPU stream compaction in CUDA. I first implemented the algorithms on the CPU as a basis for comparison and to reinforce my understanding of the algorithm. Then, I implemented the "naive" and "work-efficient" versions of the algorithm on the GPU using CUDA. Finally, I utilized some of my earlier implementations to implement GPU stream compaction. Through this project, I analyze the trade-offs between CPU and GPU performance and explore the benefits of parallel programming.
+
+### Features
+
+1. **CPU Scan & Stream Compaction**
+   - **CPU Scan:** Implemented a straightforward exclusive prefix sum using a for loop for sequential processing.
+   - **Compaction Without Scan:** A basic method that filters out zero values without employing a scan operation.
+   - **Compaction With Scan:** An optimized approach that leverages the scan algorithm to enhance the efficiency of the compaction process.
+
+2. **Naive GPU Scan**
+   - Developed a naive GPU scan algorithm following the method outlined in *GPU Gems 3*, Section 39.2.1. This implementation utilizes global memory and alternates between input/output arrays through multiple kernel invocations.
+
+3. **Work-Efficient GPU Scan & Stream Compaction**
+   - **Work-Efficient Scan:** Implemented an optimized scan using a tree-based approach, as described in *GPU Gems 3*, Section 39.2.2, for better performance.
+   - **Stream Compaction with Scan:** Built upon the work-efficient scan by first mapping the input to a boolean array, scanning it, and then scattering the elements that satisfy the condition to achieve compaction.
+   - Efficiently handles arrays that are not sized to a power of two.
+
+4. **Thrust Library Integration**
+   - Integrated the Thrust library’s `exclusive_scan` function to perform stream compaction utilizing the GPU-accelerated primitives offered by Thrust.
+
+## Performance Analysis
+
+# Block Size Optimization
+<img src="images/graph1.png" width="900">
+
+# Scan Implementation Comparison
+<img src="images/graph2.png" width="900">
+
+# Stream Compaction Implementation Comparison
+<img src="images/graph3.png" width="900">
+
+
+
+# Test Program Output
+
+The following test output was generated by running the Scan and Stream compaction algorithms on:
+- an array of size 2^21
+- an array of size 2^21 - 3
+
+with a block size of 128.
+
+```****************
+** SCAN TESTS **
+****************
+    [  47   4  13  41  40  12  10  35  21  19  24  19   8 ...  46   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 6.6257ms    (std::chrono Measured)
+    [   0  47  51  64 105 145 157 167 202 223 242 266 285 ... 51377593 51377639 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 6.5007ms    (std::chrono Measured)
+    [   0  47  51  64 105 145 157 167 202 223 242 266 285 ... 51377490 51377518 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.49398ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.31354ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.48496ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.13254ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.857024ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.697344ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   1   3   2   1   2   1   2   3   1   0   3   1 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 9.0749ms    (std::chrono Measured)
+    [   2   1   3   2   1   2   1   2   3   1   3   1   3 ...   3   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 8.3302ms    (std::chrono Measured)
+    [   2   1   3   2   1   2   1   2   3   1   3   1   3 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 12.1135ms    (std::chrono Measured)
+    [   2   1   3   2   1   2   1   2   3   1   3   1   3 ...   3   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.50966ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.41312ms    (CUDA Measured)
+    passed```
diff --git a/images/graph1.png b/images/graph1.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 21; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,16 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            // DONE
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+            if (idx >= n)
+            {
+                return;
+            }
+
+            // Map non-zero values to 1, otherwise 0
+            bools[idx] = idata[idx] != 0 ? 1 : 0;
         }
 
         /**
@@ -32,7 +41,18 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            // DONE
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+            if (idx >= n)
+            {
+                return;
+            }
+
+            if (bools[idx])
+            {
+                odata[indices[idx]] = idata[idx];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -17,6 +17,7 @@
  * Check for CUDA errors; print and exit if there was a problem.
  */
 void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+const int blockSize = 512;
 
 inline int ilog2(int x) {
     int lg = 0;

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,14 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // DONE
+            // Exclusive - Include the identity in the output
+            odata[0] = 0;
+            for(int k = 1; k < n; k++)
+            {
+                odata[k] = odata[k - 1] + idata[k - 1];
+            }
+
             timer().endCpuTimer();
         }
 
@@ -30,9 +37,21 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+
+            // DONE
+            int j = 0;
+            for (int i = 0; i < n; ++i)
+            {
+                if (idata[i] != 0)
+                {
+                    odata[j] = idata[i];
+                    ++j;
+                }
+            }
+
+
             timer().endCpuTimer();
-            return -1;
+            return j;
         }
 
         /**
@@ -42,9 +61,49 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // DONE
+
+            int numRemaining;
+
+            // Step 1: Compute temporary array containing
+            // either 1 or 0, depending on if element meets criteria
+            int* temp = new int[n];
+            for (int i = 0; i < n; ++i)
+            {
+                if (idata[i] == 0)
+                {
+                    temp[i] = 0;
+                }
+                else
+                {
+                    temp[i] = 1;
+                }
+            }
+
+
+            // Step 2: Run exclusive scan on the temp array
+            // Exclusive - Insert the identity
+            odata[0] = 0;
+            // Start at 1 since we inserted the identity and are shifting to the right
+            for (int k = 1; k < n; ++k)
+            {
+                odata[k] = odata[k - 1] + temp[k - 1];
+            }
+
+            // Step 3: Scatter!
+            // Result of scan is index into the final array
+            numRemaining = odata[n - 1];
+            for (int i = 0; i < n; ++i)
+            {
+                if (temp[i] == 1)
+                {
+                    odata[odata[i]] = idata[i];
+                }
+            }
+
             timer().endCpuTimer();
-            return -1;
+
+            return numRemaining;
         }
     }
 }