CIS565-Fall-2017 · ziyuli · Sep 19, 2017 · Sep 19, 2017 · Sep 19, 2017
diff --git a/README.md b/README.md
@@ -1,13 +1,100 @@
-CUDA Stream Compaction
+University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 CUDA Stream Compaction
 ======================
+* Ziyu Li
+* Tested on: Windows 7, i7-3840QM @ 2.8GHz 16GB, Nivida Quadro K4000M 4096MB (Personal Laptop)
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+## Performance Analysis
+#### Efficient Scan without Optimization
+This implementation is achieved by reduction and down-sweep in GPU. The performance of this method is much better than naive scan but actually still slow compare to CPU. 
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+For the benchmark and result, please check **Performance** section
 
-### (TODO: Your README)
+#### Efficient Scan with Optimization
+To avoid the efficient scan method uses extra non-working threads, simply change the index pattern to perform the kernels. So this optimization can reduce a huge amount of threads to perform useless operations and increase the overall performance.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+For the benchmark and result, please check **Performance** section
 
+
+#### More Efficient Scan with Shared Memory
+The optimized method which states above still is not efficiency enough. Actually by performing the operations in shared memory can highly achieve the maximum the performance. The whole implementation can split into three parts.
+
+* Scan each blocks seperatly and use a auxiliary array to store each block sum
+* Scan the block sums
+* Add scanned block sum to next scanned block
+
+![](img/39fig06.jpg)
+
+(Figure 1: Algorithm for Performing a Sum Scan on a Large Array of Values, Nvidia GPU Gems)
+
+
+This implementation is relatively easy to achieve, however using share memory will sometimes suffer from bank conflicts which could hurt the performance significantly by access those memory everytime. To avoid these bank conflict, we have to add padding to share memory every certain elements. And those offset can be easily implement by a macro.
+
+```c++
+#define CONFLICT_FREE_OFFSET(n) \ ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
+```
+
+For the benchmark and result, please check **Performance** section
+
+#### Radix Sort
+One of the most significant application for GPU scan is radix sort which is a sorting algorithm for parallel processors.  
+
+To use radix sort function, please call the function below:
+```c++
+StreamCompaction::Radix::sort(int n, int *odata, int *idata);
+```
+* The first argument is the array size. (input)
+* The second argument is sorted array. (output)
+* The third argument is unsorted array. (input)
+
+For the benchmark and result, please check **Performance** section
+
+## Performance
+#### Scan Performace Measurement and Result
+The benckmark is performed the scan operation under 128 threads per block for array size from 2^4 to 2^22. (Since there is only one grid, 2^22 is the maximum amount for a 128 block size.)
+
+The benchmark also makes a running time comparision between CPU, GPU Naive, GPU Efficient, GPU Efficient With Optimization, GPU Efficient With Share Memory and GPU Thrust Scan.
+
+![](img/scan_power_2.PNG)
+
+![](img/scan_power_not_2.PNG)
+
+![](img/scan.PNG)
+
+(For the detail result, please check the data in the **benckmark** folder)
+
+#### Compact Performace Measurement and Result
+The benckmark is performed the compaction operation under 128-512 threads per block for array size from 2^4 to 2^24. (128 block size for array size 2^4 to 2^22, 256 block size for 2^23 and 512 block size for 2^24)
+
+The benchmark also makes a running time comparision between CPU without scan, CPU with scan and GPU with scan.
+
+![](img/compaction_2.PNG)
+
+![](img/compact.PNG)
+
+(For the detail result, please check the data in the **benckmark** folder)
+
+#### Radix Sort Performance Measurement and Result
+The benchmark is performed the radix sort operation under 128 threads per block for array size from 2^4 to 2^24.
+
+The benchmark makes a running time comparison between CPU 3-part hybrid sort (standard sort function in STL) and GPU radix sort
+
+![](img/radix_c.PNG)
+
+![](img/radix_result.PNG)
+
+## Questions
+#### Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.
+
+Based on large number (larger than 2^20) benchmark result. The optimize block sizes for each implementation:
+
+|   Methods  | Naive | Efficient | Efficient (Optimize) | Efficient (ShareMem) | Thrust |
+|:----------:|-------|-----------|----------------------|----------------------|--------|
+| Block Size |  1024 |    128    |          128         | 256                  | 1024   |
+
+#### Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
+
+For Thrust implementation, the highest occupancy in GPU is cudaMalloc and cudaMemcpy related function calls based on Nsight Timeline. However, there are three most significant functions in Thrust scan *accumlate_tiles*, *exculsive_scan_n* and *exclusive_downsweep* are not really use too much GPU time.
+
+I believe the performance bottlenecks is memory bandwidth for Thrust scan. The computation time compare to memory I/O time is trivial. As for my implementation, the efficient method waste a huge amount of time on launching non-working threads. For efficient with optimization, the memory I/O become the most inefficient factor in whole system.  By using shared memory can highly increase memory I/O efficiency and decrease memory latency to achieve maximum efficiency.
+
+For the benchmark and graph, please check **Performance** section
diff --git a/img/39fig06.jpg b/img/39fig06.jpg
diff --git a/img/compact.PNG b/img/compact.PNG
diff --git a/img/compaction_2.PNG b/img/compaction_2.PNG
diff --git a/img/radix_c.PNG b/img/radix_c.PNG
diff --git a/img/radix_result.PNG b/img/radix_result.PNG
diff --git a/img/scan.PNG b/img/scan.PNG
diff --git a/img/scan_power_2.PNG b/img/scan_power_2.PNG
diff --git a/img/scan_power_not_2.PNG b/img/scan_power_not_2.PNG
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,15 +11,39 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
+#include <algorithm>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
-const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int a[SIZE], b[SIZE], c[SIZE];
+int SIZE = 1 << 20; // feel free to change the size of array
+int NPOT = SIZE - 3; // Non-Power-Of-Two
 
-int main(int argc, char* argv[]) {
-    // Scan tests
+StreamCompaction::Common::PerformanceTimer& timer();
+
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer& timer()
+{
+	static PerformanceTimer timer;
+	return timer;
+}
 
+int main(int argc, char* argv[]) {
+	int *a, *b, *c;
+	a = (int *)malloc(SIZE * sizeof(int));
+	b = (int *)malloc(SIZE * sizeof(int));
+	c = (int *)malloc(SIZE * sizeof(int));
+
+	// Scan tests
+
+	if (argc == 2) {
+		if (atoi(argv[1]) < 0) {
+			printf("---------------------------------------------------");
+			SIZE = 1 << (-1 * atoi(argv[1]));
+			//printf("---bash test: %i----\n", -1 * atoi(argv[1]));
+		}
+	}
+
+
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
@@ -32,6 +56,7 @@ int main(int argc, char* argv[]) {
     // initialize b using StreamCompaction::CPU::scan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
     // At first all cases passed because b && c are all zeroes.
+
     zeroArray(SIZE, b);
     printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);
@@ -59,20 +84,49 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
 
+	zeroArray(SIZE, c);
+	printDesc("work-efficient scan, power-of-two");
+	StreamCompaction::Efficient::scan0(SIZE, c, a);
+	printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+	//printArray(SIZE, c, true);
+	printCmpResult(SIZE, b, c);
+
+	zeroArray(SIZE, c);
+	printDesc("work-efficient scan, non-power-of-two");
+	StreamCompaction::Efficient::scan0(NPOT, c, a);
+	printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+	//printArray(NPOT, c, true);
+	printCmpResult(NPOT, b, c);
+
     zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
+    printDesc("work-efficient scan with optimization, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
+    printDesc("work-efficient scan with optimization, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+	zeroArray(SIZE, c);
+	printDesc("work-efficient scan with SHARE MEMORY and optimization, power-of-two");
+	StreamCompaction::Efficient::scan_s(SIZE, c, a);
+	printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+	//printArray(SIZE, c, true);
+	printCmpResult(SIZE, b, c);
+
+	zeroArray(SIZE, c);
+	printDesc("work-efficient scan with SHARE MEMORY and optimization, non-power-of-two");
+	StreamCompaction::Efficient::scan_s(NPOT, c, a);
+	printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+	//printArray(NPOT, c, true);
+	printCmpResult(NPOT, b, c);
+
+
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
@@ -86,7 +140,7 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
-
+	
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -119,12 +173,20 @@ int main(int argc, char* argv[]) {
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
+    printDesc("cpu compact with scan, power-of-two");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
+	zeroArray(SIZE, c);
+	printDesc("cpu compact with scan, non-power-of-two");
+	count = StreamCompaction::CPU::compactWithScan(NPOT, c, a);
+	printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+	expectedNPOT = count;
+	printArray(count, c, true);
+	printCmpLenResult(count, expectedNPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
@@ -139,5 +201,44 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
+
+	printf("\n");
+	printf("*****************************\n");
+	printf("**     RADIX SORT TESTS    **\n");
+	printf("*****************************\n");
+
+	genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+	a[SIZE - 1] = 0;
+	printArray(SIZE, a, true);
+
+	memcpy(c, a, SIZE * sizeof(int));
+	timer().startCpuTimer();
+	std::sort(c, c + SIZE);
+	timer().endCpuTimer();
+	printElapsedTime(timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+
+	zeroArray(SIZE, b);
+	printDesc("radix sort, power-of-two");
+	count = SIZE;
+	StreamCompaction::Radix::sort(SIZE, b, a);
+	printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda Measured)");
+	expectedCount = count;
+	printArray(count, b, true);
+	printCmpLenResult(count, expectedCount, b, c);
+
+	memcpy(c, a, SIZE * sizeof(int));
+	std::sort(c, c + NPOT);
+	zeroArray(SIZE, b);
+	printDesc("radix sort, not-power-of-two");
+	StreamCompaction::Radix::sort(NPOT, b, a);
+	printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda Measured)");
+	printArray(NPOT, b, true);
+	printCmpLenResult(NPOT, NPOT, b, c);
+
+
+	free(a);
+	free(b);
+	free(c);
     system("pause"); // stop Win32 console from closing on exit
 }
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -9,6 +9,8 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "radix.h"
+    "radix.cu"
     )
 
 cuda_add_library(stream_compaction

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,12 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+			int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (idx >= n) {
+				return;
+			}
+
+			bools[idx] = (bool)idata[idx];
         }
 
         /**
@@ -32,7 +37,13 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+			int idx = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (idx >= n) {
+				return;
+			}
+			if (bools[idx]) {
+				odata[indices[idx]] = idata[idx];
+			}
         }
 
     }