CIS565-Fall-2017 · akhabbaz · Sep 19, 2017 · Sep 20, 2017 · Sep 20, 2017 · Sep 20, 2017
diff --git a/INSTRUCTION.md b/INSTRUCTION.md
@@ -310,4 +310,4 @@ The template of the comment section of your pull request is attached below, you
         + You can try multi-threading on CPU if you want (not required and not our focus)
         + for each element input[i] in original array
           - if it's non-zero (given by mapped array)
-          - then put it at output[index], where index = scanned[i]
+          - then put it at output[index], where index = scanned[i]
diff --git a/README.md b/README.md
@@ -3,11 +3,37 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Anton Khabbaz
+* pennkey:akhabbaz
+* Tested on: Windows 10 surface book i7-6600u at 2.66 GHz with a GPU GTX 965M
+Personal computer
 
 ### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project aimed to understand parallel algorithms and how they are implemented on the GPU.  A scan takes a running sum of integers and here we used the scan to implement three versions of scan: a cpu version in series, a naive GPU version, a work efficient version,
+and one where we used the thrust library.
+
+To do the work efficient scan I implemented the extra credit.  All the threads are contiguous in both the up and down sweep.  Furthermore, I trimmed the number of threads needed so that the grid size was enough to run all the threads needed. In the end with some 10^8 data elements, my work efficient scan matched the CPU scan in speed.  The GPU and CPU timers measured computation, not the time to copy and transfer data.
+
+For all number of elements all the tests passed.
+
+![](img/RunTimes.png)
+THis figure plots the log10 of the run time per element in the array.  This measure was relatively constant as N increased. Here we can see that the work efficient scan at first is worse than the CPU scan but as the number of elements increases, it first beats the naive scan and then matches the CPU scan.  Thrust on the other hand is the fastest scan for large data (faster by a factor of about 6).
+The non power of two scans were slightly faster but comparable in speed to the power of two scans.
+
+Stream Compaction is one application of scan and it allows one to remove elements from a stream.  Here around 10^5 elements, the work efficient compaction actually beat the other compactions.  A thrust scan would beat the work efficient compaction, since a majority of the time to compact is spent making the exclusive scan.
+
+![](img/StreamCompactionTimes.png)
+
+
+Here I implemented the efficient scan using contiguous threads.  This worked perfectly up to one block but beyond one block the code failed.   The issue was that threads beyond one block do not synchronize.
+
+I had many difficulties.  First I used syncthreads to synchronize and that worked only when one block had all the threads.  I then changed my code to iterate over strides on the CPU and this allowed all the blocks to be synchronized.
+
+Another major issue I had was that aroud 2^15 or so, I got an allocation error.  I traced it down to the work efficient scan and then used Memtracker in cuda to catch the out of bounds error.  That showed me that I was multiplying 2 65K integers to calculate the index (some threads had high indices).  That produced a negative index.  I got around that by returning depending on the stride, so that the actual index would never be larger that the avalable array.  Ultimately I also chose the grid size so that the minimum number of threads would be available, a number that varied with the stride.  THis culled the number of possible threads and sped up the run time.
+
+
+A final issue I had was that in the compaction I did not return when index was beyond the array size.  That caused multiple writes to the same location and that was hard to debug.  
+
+The code now works for any N I tried up to the maximum memory on my GPU, about 2 GB.
 
diff --git a/RunTimes.xlsx b/RunTimes.xlsx
diff --git a/img/RunTimes.pdf b/img/RunTimes.pdf
diff --git a/img/RunTimes.png b/img/RunTimes.png
diff --git a/img/StreamCompactionTimes.pdf b/img/StreamCompactionTimes.pdf
diff --git a/img/StreamCompactionTimes.png b/img/StreamCompactionTimes.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -12,19 +12,23 @@
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
+#include <iostream>
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 26;// feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int a[SIZE], b[SIZE], c[SIZE];
+//int a[SIZE], b[SIZE], c[SIZE];
 
 int main(int argc, char* argv[]) {
-    // Scan tests
 
+    // Scan tests
+	int* a = static_cast<int*>(malloc(SIZE * sizeof(int)));
+	int* b = static_cast<int*>(malloc(SIZE * sizeof(int)));
+	int* c = static_cast<int*>(malloc(SIZE * sizeof(int)));
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
-
+	printf("Size:  %d\n", SIZE);
     genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
@@ -45,12 +49,14 @@ int main(int argc, char* argv[]) {
     printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
-    zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
+
+
+	printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda measured)");
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
+
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
@@ -63,14 +69,14 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -129,15 +135,18 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit
+	free(a);
+	free(b);
+	free(c);
 }
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
@@ -44,7 +44,7 @@ void zeroArray(int n, int *a) {
 }
 
 void genArray(int n, int *a, int maxval) {
-    srand(time(nullptr));
+   // srand(time(nullptr));
 
     for (int i = 0; i < n; i++) {
         a[i] = rand() % maxval;

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -13,5 +13,5 @@ set(SOURCE_FILES
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_52
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,11 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+	     int index = threadIdx.x + blockIdx.x * blockDim.x;
+             if (index >= n) {
+		     return;
+	     }
+             bools[index] = ( idata[index] == 0) ? 0 : 1;
         }
 
         /**
@@ -32,7 +36,13 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+		int index = threadIdx.x + blockIdx.x * blockDim.x;
+		if (index >= n) {
+			return;
+		}
+                if ( bools[index] == 1) {
+					odata[indices[index]] = idata[index];
+				}
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -1,42 +1,45 @@
-#pragma once
-
+#pragma once
+
 #include <cuda.h>
 #include <cuda_runtime.h>
 
 #include <cstdio>
 #include <cstring>
 #include <cmath>
 #include <algorithm>
-#include <chrono>
+#include <chrono>
 #include <stdexcept>
-
-#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
-#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
-
-/**
- * Check for CUDA errors; print and exit if there was a problem.
- */
-void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
-
-inline int ilog2(int x) {
-    int lg = 0;
-    while (x >>= 1) {
-        ++lg;
-    }
-    return lg;
-}
-
-inline int ilog2ceil(int x) {
-    return ilog2(x - 1) + 1;
-}
-
-namespace StreamCompaction {
-    namespace Common {
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
-
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices);
-
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+/**
+ * Check for CUDA errors; print and exit if there was a problem.
+ */
+void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+
+inline int ilog2(int x) {
+	if (x < 0) {
+		return 0;
+	}
+    int lg = 0;
+    while (x >>= 1) {
+        ++lg;
+    }
+    return lg;
+}
+
+inline int ilog2ceil(int x) {
+    return ilog2(x - 1) + 1;
+}
+
+namespace StreamCompaction {
+    namespace Common {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices);
+
 	    /**
 	    * This class is used for timing the performance
 	    * Uncopyable and unmovable
@@ -127,6 +130,6 @@ namespace StreamCompaction {
 
 		    float prev_elapsed_time_cpu_milliseconds = 0.f;
 		    float prev_elapsed_time_gpu_milliseconds = 0.f;
-	    };
-    }
-}
+	    };
+    }
+}