CIS565-Fall-2017 · MegSesh · Sep 14, 2017 · Sep 14, 2017 · Sep 15, 2017 · Sep 15, 2017
diff --git a/README.md b/README.md
@@ -1,13 +1,160 @@
-CUDA Stream Compaction
+Project 2: CUDA Stream Compaction
 ======================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+**Course project for CIS 565: GPU Programming and Architecture, University of Pennsylvania**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Name: Meghana Seshadri
+* Tested on: Windows 10, i7-4870HQ @ 2.50GHz 16GB, GeForce GT 750M 2048MB (personal computer)
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Project Overview
 
+The goal of this project was to get an introduction to the GPU stream compaction in CUDA. This algorithm is often used to quickly process large sets of data according to certain conditions - hence very useful in parallel programming. In this project, the stream compaction will remove any instance of a zero from an array of int's. 
+
+
+The following four versions of stream compaction were implemented:
+
+* CPU algorithm
+* Naive parallel algorithm 
+* Work-efficient algorithm 
+* Thrust-based algorithm
+
+
+The stream compaction algorithm breaks down into a scan function and a compat function. The scan function computes an exclusive prefix sum. The compact function maps an input array to an array of zero's and one's, scans it, and then uses scatter to produce the output. 
+
+The following descriptions dive a bit deeper into the different approaches:
+
+### CPU
+* `n - 1` adds for an array of length `n` --> O(n) number of adds
+
+
+### Naive Parallel
+* log2(n) passes on n elements --> O(n log2 (n)) numer of adds
+
+
+### Work-Efficient
+* This algorithm organizes the data into a balanced binary tree
+* The occurs in two phases: 
+	- Up-sweep = implementing parallel reduction 
+	- Down-sweep = traversing back down the tree using partial sums to build the scan in place
+* Up Sweep = O(n - 1) adds
+* Down Sweep = O(n - 1) adds, O(n - 1) swaps
+
+
+### Thrust-Based
+* The Thrust library has a "thrust::exclusive_scan()" function that we use to compare the above algorithms to.
+
+
+## Performance Analysis 
+
+### Block Comparison
+
+Below is a graph that shows which would be the best block size to test on. The times below are for the work-efficient algorithm, comparing 64 elements and 65536 elements.
+
+![](images/blockcomparison.PNG)
+
+Based on the chart above, block size 128 on average, seems to be the most efficient. The rest of the analysis below will be using that block size.
+
+
+### Algorithm Comparison
+
+Below are charts showing tests across all 4 algorithms, both testing with power-of-two sized arrays and non-power-of-two sized arrays.
+
+
+![](images/algorithmcomparison.PNG)
+
+
+![](images/algorithmcomparisontable.PNG)
+
+
+First things first - why is the GPU implementation so much slower than the Naive Parallel and CPU versions? Isn't the GPU supposed to be faster? 
+
+Take a look at the example diagram of the up sweep process as a part of the work-efficient algorithm below. 
+
+![](images/upsweep.PNG)
+
+After iteration 0, only 4 out of the 8 elements are being computed and written to the output array. After iteration 1, only two are being computed and written, and after iteration 2, only one final answer is being outputted. 
+
+Within each iteration, the same number of blocks are being launched. Essentially, every index is being allocated a block of memory (aka 32 threads). However, only one thread in 4 blocks for iteration 0, 2 blocks for iteration 1, and 1 block for iteration 2 are actually being used. This means that at least 4 blocks are not being used at all per iteration, and most of the threads per block are not even doing any work. So, really, the actual power of the GPU isn't being used at all. 
+
+In order to optimize the code, it would be better to try and allocate only the necessary blocks needed per iteration, and to try and push computation to be done on earlier threads to achieve early termination of the block. In this example, this means that at iteration 0: 
+
+* Having the computation and write at index 1 use thread #0 instead of thread #1
+* Having the computation and write at index 3 use thread #1 instead of thread #3
+* Having the computation and write at index 5 use thread #2 instead of thread #5
+* Having the computation and write at index 7 use thread #3 instead of thread #7
+
+This way you only use the number of threads required and push them to be the first couple of ones so that the block can retire early and be used elsewhere. 
+
+
+Despite this explanation, it can be seen that Thrust implementation, up until size power of 14, becomes more efficient with increasing array size. It's interesting to note that there's a drastic increase in computation time across all algorithms with array size power of 14 and above.
+
+
+### Scan and Stream Compaction Test Results
+
+The following tests were run on 128 elements at a block size of 128.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  40  12   3  21  27   2  32  20   1   4  41  34  15 ...  34   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.000821ms    (std::chrono Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6112 6146 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.000821ms    (std::chrono Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6063 6064 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.44064ms    (CUDA Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6112 6146 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.433024ms    (CUDA Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6063 6064 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.781184ms    (CUDA Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6112 6146 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.20867ms    (CUDA Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6099 6099 ]
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6063 6064 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 2.34442ms    (CUDA Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6112 6146 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.019392ms    (CUDA Measured)
+    [   0  40  52  55  76 103 105 137 157 158 162 203 237 ... 6063 6064 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   0   1   3   1   0   0   2   1   0   1   0   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.000821ms    (std::chrono Measured)
+    [   2   1   3   1   2   1   1   1   3   3   3   3   3 ...   1   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.000821ms    (std::chrono Measured)
+    [   2   1   3   1   2   1   1   1   3   3   3   3   3 ...   3   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.962322ms    (std::chrono Measured)
+    [   2   1   3   1   2   1   1   1   3   3   3   3   3 ...   1   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.22262ms    (CUDA Measured)
+    [   2   1   3   1   2   1   1   1   3   3   3   3   3 ...   1   1 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 2.10656ms    (CUDA Measured)
+    [   2   1   3   1   2   1   1   1   3   3   3   3   3 ...   3   1 ]
+    passed
+Press any key to continue . . .
+```
diff --git a/images/algorithmcomparison.PNG b/images/algorithmcomparison.PNG
diff --git a/images/algorithmcomparisontable.PNG b/images/algorithmcomparisontable.PNG
diff --git a/images/blockcomparison.PNG b/images/blockcomparison.PNG
diff --git a/images/kernelindexmanipulationtests.PNG b/images/kernelindexmanipulationtests.PNG
diff --git a/images/upsweep.PNG b/images/upsweep.PNG
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,9 +13,28 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+
+const int SIZE = 1 << 22;
+//const int SIZE = 1 << 16;
+//const int SIZE = 1 << 14;
+//const int SIZE = 1 << 12;
+//const int SIZE = 1 << 10;
+//const int SIZE = 1 << 8;
+//const int SIZE = 1 << 6;
+
+//const int SIZE = 1 << 8; // = 256, feel free to change the size of array
+//const int SIZE = 1 << 5; // = 32, feel free to change the size of array
+//const int SIZE = 1 << 4; // = 16, feel free to change the size of array
+//const int SIZE = 1 << 3; // = 8, feel free to change the size of array
+//const int SIZE = 1 << 2; // = 4, feel free to change the size of array
+
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int a[SIZE], b[SIZE], c[SIZE];
+//int a[SIZE], b[SIZE], c[SIZE];
+
+int *a = new int[SIZE];
+int *b = new int[SIZE];
+int *c = new int[SIZE];
+
 
 int main(int argc, char* argv[]) {
     // Scan tests
@@ -49,42 +68,44 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+
+	printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     printf("\n");
@@ -129,15 +150,25 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit
+
+	//Copying this b/c aparently it stops from my exe window from crashing on the first go
+	//Not sure why
+	system("pause"); // stop Win32 console from closing on exit
+
+
+	free(a);
+	free(b);
+	free(c);
 }
+
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,5 +1,6 @@
 #include "common.h"
 
+
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
     if (cudaSuccess == err) {
@@ -23,7 +24,22 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+
+			// TODO
+
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+			if (index < n)
+			{
+				if (idata[index] == 0)
+				{
+					bools[index] = 0;
+				}
+				else
+				{
+					bools[index] = 1;
+				}
+			}//end kernMapToBoolean function
         }
 
         /**
@@ -33,7 +49,17 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
-        }
+
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+			if (index < n)
+			{
+				if (bools[index] == 1)
+				{
+					odata[indices[index]] = idata[index];
+				}
+			}
+        }//end kernScatter function
 
     }
 }