CIS5650-Fall-2024 · lyifaxxx · Sep 11, 2024 · Sep 14, 2024 · Sep 15, 2024 · Sep 15, 2024
diff --git a/README.md b/README.md
@@ -3,12 +3,144 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yifan Lu
+  * [LinkedIn](https://www.linkedin.com/in/yifan-lu-495559231/), [personal website](http://portfolio.samielouse.icu/)
+* Tested on: Windows 11, AMD Ryzen 7 5800H 3.20 GHz, Nvidia GeForce RTX 3060 Laptop GPU (Personal Laptop)
 
-### (TODO: Your README)
+## Project Feature
+- CPU Scan & Stream Compaction
+- Naive GPU Scan Algorithm
+- Work-Efficient GPU Scan & Stream Compaction
+- Thrust Scan
+- **[Extra Credit]** Optimized work-efficient with bitwise kernel operations and dynamic launching block numbers
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Introduction
+
+### Scan
+Scan is to process elements of an array to generate a new array where each position contains an aggregation like a sum, product, or logical operation of elements up to that position in the original array. in this project, we are doing a scan for sum. 
+
+There are two main variaty of scan:
+
+ - **Inclusive Scan**
+
+Each element in the output array includes the corresponding element from the input array in the sum.
+
+ - **Exclusive Scan**
+
+Each element in the output array does not include the corresponding element from the input array, starting with an initial value (like zero for sum).
+
+
+### Compaction
+Compaction is to remove or filter out unwanted elements from an array or list, creating a new array that contains only the elements that satisfy a specific condition.
+
+### Parallelization
+Parallelizing the scan and compaction process can significantly improve its performance, especially on large datasets. In parallel implementations, the array can be divided among multiple processors or threads, each performing the mapping, scanning, and scattering on a segment of the array. 
+
+
+## Performance Analysis
+
+### Blocksize Optimization
+
+For my implement and hardware setting, when block size reaches 256, the operations are showing better performance. The performance will not change significantly if the block size continues increasing.
+
+This is the GPU time for work-efficient method on array size $2^{15}$, which reaches its lowest point at block size 256.
+
+![](img/blocksize.png)
+
+### Compare GPU Scan Implementations
+
+The following chart shows the time for CPU, GPU Navie, GPU work-efficient and thrust scan.
+
+![](img/time.png)
+![](img/timelog2.png)
+
+Naive scan needs $O(log2(n))$ passes.  Each pass has $O(n)$ computations. Work-efficient uses a 'binary tree' structure and we only need to do $O(n)$ computations for a single traverse of the tree.
+
+Work-efficient has a significant upgrade when the array size are getting larger.
+
+### Performance Bottlenecks
+
+To trace the GPU bottlenecks when doing scanning, I used Nsight Systems to launch release build.
+
+![](img/b4numblock.png)
+
+The above screenshots are taken from nsight systems which time period are related to work-efficient scan.
+
+From the graph we can see that there are time gaps between GPU SM executions. That is probably because non-coalesced memory accesses to global memory.
+
+In order for each kernel to have faster computations, I switch operations such as mod and multiply/divide into bit-wise operations.
+
+For example, change ``` index % offset == 0 ``` into ``` (index & (offset - 1)) == 0 ```. This is because ```offset``` is always the power of 2, so the ```index``` we are looking for is also power of 2. By AND the ```index``` and ```offset - 1```, which is a bunch of 1s, we can check the result to see if it is 0. If it is, then ```index``` can be divided by ```offset```.
+
+Also to avoid too much threads idling, I **dynamically change the number of blocks that kernel will launch in the work-efficient method**.
+
+After the optimization, the timeline looks like the graph below, which has longer SM active time. However, the time gaps will exists because the memory latency. 
+
+![](img/afternumblock.png)
+
+### Test Output
+
+The test is done under the configuration of array size 2^10 and block size 256.
+
+```
+
+****************
+** SCAN TESTS **
+****************
+    [  16  47   5  34  25  42  48  11  36  49  39  23   0 ...  42   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.0008ms    (std::chrono Measured)
+    [   0  16  63  68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0008ms    (std::chrono Measured)
+    [   0  16  63  68 102 127 169 217 228 264 313 352 375 ... 25372 25394 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.342016ms    (CUDA Measured)
+    [   0  16  63  68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.105472ms    (CUDA Measured)
+    [   0  16  63  68 102 127 169 217 228 264 313 352 375 ...   0   0 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.897024ms    (CUDA Measured)
+    [   0  16  63  68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.27648ms    (CUDA Measured)
+    [   0  16  63  68 102 127 169 217 228 264 313 352 375 ... 25372 25394 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.121856ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.04096ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   1   3   1   1   3   1   1   0   1   1   1   3 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.0028ms    (std::chrono Measured)
+    [   3   1   3   1   1   3   1   1   1   1   1   3   1 ...   1   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.0022ms    (std::chrono Measured)
+    [   3   1   3   1   1   3   1   1   1   1   1   3   1 ...   1   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.0005ms    (std::chrono Measured)
+    [   3   1   3   1   1   3   1   1   1   1   1   3   1 ...   1   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.27648ms    (CUDA Measured)
+    [   3   1   3   1   1   3   1   1   1   1   1   3   1 ...   1   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.27648ms    (CUDA Measured)
+    [   3   1   3   1   1   3   1   1   1   1   1   3   1 ...   1   1 ]
+    passed
+```
 
diff --git a/img/afternumblock.png b/img/afternumblock.png
diff --git a/img/b4numblock.png b/img/b4numblock.png
diff --git a/img/blocksize.png b/img/blocksize.png
diff --git a/img/nsystem_trace.png b/img/nsystem_trace.png
diff --git a/img/nsystem_trace2.png b/img/nsystem_trace2.png
diff --git a/img/time.png b/img/time.png
diff --git a/img/timelog.png b/img/timelog.png
diff --git a/img/timelog2.png b/img/timelog2.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 10; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,21 +64,21 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
+    StreamCompaction::Efficient::scan_opt(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
+    StreamCompaction::Efficient::scan_opt(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -108,6 +108,7 @@ int main(int argc, char* argv[]) {
 
     int count, expectedCount, expectedNPOT;
 
+
     // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
     zeroArray(SIZE, b);
@@ -137,16 +138,17 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
@@ -10,7 +10,9 @@ template<typename T>
 int cmpArrays(int n, T *a, T *b) {
     for (int i = 0; i < n; i++) {
         if (a[i] != b[i]) {
-            printf("    a[%d] = %d, b[%d] = %d\n", i, a[i], i, b[i]);
+            printf("    a[%d] = %d, b[%d] = %d\n ", i, a[i], i, b[i]);
+            printf("    a[%d] = %d, b[%d] = %d\n ", i - 1, a[i - 1], i - 1, b[i - 1]);
+            printf("    a[%d] = %d, b[%d] = %d\n ", i +1, a[i + 1], i + 1, b[i + 1]);
             return 1;
         }
     }
@@ -58,6 +60,7 @@ void genArray(int n, int *a, int maxval) {
 }
 
 void printArray(int n, int *a, bool abridged = false) {
+    //printf("count: %d \n", n);
     printf("    [ ");
     for (int i = 0; i < n; i++) {
         if (abridged && i + 2 == 15 && n > 16) {

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,12 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (index >= n) {
+				return;
+			}
+			bools[index] = (idata[index] != 0) ? 1 : 0;
+
         }
 
         /**
@@ -33,6 +39,14 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+			int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
+
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -20,6 +20,10 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+			odata[0] = 0;
+			for (int i = 1; i < n; i++) {
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
             timer().endCpuTimer();
         }
 
@@ -31,8 +35,15 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+			int count = 0;
+			for (int i = 0; i < n; i++) {
+				if (idata[i] != 0) {
+					odata[count] = idata[i];
+					count++;
+				}
+			}
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
 
         /**
@@ -41,10 +52,25 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
+            //timer().startCpuTimer();
             // TODO
-            timer().endCpuTimer();
-            return -1;
+			int *bools = new int[n];
+            for (int i = 0; i < n; i++) {
+                bools[i] = (idata[i] != 0) ? 1 : 0;
+            }
+            int* scanResult = new int[n];
+            scan(n, scanResult, bools);
+            // scatter
+			int count = bools[n - 1] == 1 ? scanResult[n - 1] : scanResult[n - 1];
+			for (int i = 0; i < n; i++) {
+				if (bools[i] == 1) {
+					odata[scanResult[i]] = idata[i];
+				}
+			}
+            //timer().endCpuTimer();
+            delete[] bools;
+            delete[] scanResult;
+            return count;
         }
     }
 }