diff --git a/.vscode/.BROWSE.VC.DB b/.vscode/.BROWSE.VC.DB
deleted file mode 100644
index e46997b..0000000
Binary files a/.vscode/.BROWSE.VC.DB and /dev/null differ
diff --git a/.vscode/.BROWSE.VC.DB-wal b/.vscode/.BROWSE.VC.DB-wal
deleted file mode 100644
index 45fc655..0000000
Binary files a/.vscode/.BROWSE.VC.DB-wal and /dev/null differ
diff --git a/README.md b/README.md
index b71c458..7e568d3 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,153 @@
-CUDA Stream Compaction
-======================
+# University of Pennsylvania, CIS 565: GPU Programming and Architecture.
+Project 2 CUDA: Stream Compaction
+====================
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+## User resources
+- **Name:** David Grosman.
+- **Tested on:** Microsoft Windows 7 Professional, i7-5600U @ 2.6GHz, 256GB, GeForce 840M (Personal laptop).
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Project description
+This Project's purpose was to gain some experience writing parallel algorithms using CUDA kernels, using them, and analyzing their performance.
+I implemented several GPU Scan implementations (Naive, Work-Efficient, and Thrust) and stream compaction methods using these GPU Scan algorithms. I also implemented the same functionality on the CPU as a base-line and easy way to verify all my GPU implementations returned the expected output. As an extra-credit, I improved the performance of my Work-efficient Scan algorithm and also implemented RadixSort on the GPU which I then compared to Thrust's version.
+Finally, I implemented a simple Timer which supports both CUDA (through CUDA events) and cpp (through the C++11 std::chrono API ) code.
 
-### (TODO: Your README)
+## Extra-Credit
+I implemented all extra-credit given for this Project.
+
+###Why is my GPU approach so slow?
+I first implemented my efficient scan version following the class slides closely but I was unsatisfied with its performance -- it was around 8 times slower than my cpu approach.
+![](images/Capture.JPG)
+Looking at the above NSight Report, I noticed that most threads were stalled and they used up the CUDA scheduler. I then noticed that the Scan version is actually composed of two phases: one Up-Sweep followed by one Down-Sweep method. In the Up-Sweep method, the number of threads actually used is decreasing by a factor of two for each iteration, similarly the Down-Sweep method increases the number of threads used by a factor of two for each iteration. As the algorithm was already taking into account the cluster size at each frame, I just needed to call kernSumUpSweep and kernSumDownSweep as follows:
+``` C++
+int d = 0;
+for (; d < maxDepth; ++d)
+{
+  const size_t clusterSize = 1 << (d + 1);
+	const size_t numThreads = (n >= clusterSize) ? n / clusterSize : n;
+
+	dim3 fullBlocksPerGrid((numThreads + blockSize - 1) / blockSize);
+	kernSumUpSweep << <fullBlocksPerGrid, blockSize >> >(n, clusterSize, cudaBuffer);
+}
+
+cudaMemset(cudaBuffer + (n - 1), 0, sizeof(int));
+for (; d >= 0; --d)
+{
+	const size_t clusterSize = 1 << (d + 1);
+	const size_t numThreads = (n >= clusterSize) ? n / clusterSize : n;
+   
+	dim3 fullBlocksPerGrid((numThreads + blockSize - 1) / blockSize);
+	kernSumDownSweep << <fullBlocksPerGrid, blockSize >> >(n, clusterSize, cudaBuffer);
+}
+```
+I also had to set the following in both functions' kernels body in order to fetch the wanted memory:
+```
+const unsigned long iThreadIdx = clusterSize * ( threadIdx.x + (blockIdx.x * blockDim.x) );
+```
+
+The improved efficient GPU scan gave me the following NSight report:
+![](images/CaptureNew.JPG)
+
+### Radix Sort
+I added radix.h and radix.cu to the stream_compaction subproject where I implemented radix sort using my efficient scan implementation. I tested it using the same format as the provided UnitTests and got it to work with arrays of up to 65536 random integers. I also compared it to the performance of Thrust's radix sort implementation and here are the results:
+
+![](images/RadixSortRuntimePerformanceGivenArraySize.JPG)
+
+It is interesting to notice that the performance of the Thrust's method is constant whereas my version is faster for smaller arrays and much slower for bigger inputs. It would have been nice to run the algorithm on bigger inputs since they might converge again at some point. 
+
+Note that my radix sort's function declaration is similar to all other functions implemented in this project:
+```C++
+void StreamCompaction::RadixSort::sort(int n, int *odata, const int *idata);
+```
+
+---
+### Performance Analysis
+Note that the following statistics have been captured by calling the given functions 1000 times (with the default parameters given when starting the project) and averaging the results.
+Please note that I used CUDA events for timing GPU code and I did not include any initial/final memory operations (cudaMalloc, cudaMemcpy) in your performance measurements, for comparability.
+![](images/ScanRuntimePerformanceGivenArraySize.JPG)
+
+It is interesting to notice that the CPU version is the fastest. It is most probably due to the fact that the algorithm on CPU is O(n) and accessing contiguous memory on CPU is very fast compared to GPU. The performance time on GPU decreases much faster given a bigger array size. It confirms that memory access is the GPU's performance main bottleneck.
+It is nice to see that the efficient-scan performs better than the Naive implementation, even though it doesn't outperform Thrust's version which might include more efficient tricks.
+
+![](images/CompactRuntimePerformanceGivenArraySize.JPG)
+
+---
+### Test Program Output
+```C++
+****************
+** TESTS INFO **
+****************
+
+ArraySize = 65536; ArrayOddSize: 65533
+****************
+** SCAN TESTS **
+****************
+    [  38  19  38  37   5  47  15  35   0  12   3   0  42 ...  35   0 ]
+==== cpu scan, power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
+(null) - Elapsed Time:0.139000 ms.
+==== cpu scan, non-power-of-two ====
+    [   0  38  57  95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
+    passed
+(null) - Elapsed Time:0.129000 ms.
+==== naive scan, power-of-two ====
+    passed
+(null) - Elapsed Time:0.609576 ms.
+==== naive scan, non-power-of-two ====
+    passed
+(null) - Elapsed Time:0.609687 ms.
+==== work-efficient scan, power-of-two ====
+    passed
+(null) - Elapsed Time:0.535494 ms.
+==== work-efficient scan, non-power-of-two ====
+    passed
+(null) - Elapsed Time:0.535663 ms.
+==== thrust scan, power-of-two ====
+    passed
+(null) - Elapsed Time:0.466199 ms.
+==== thrust scan, non-power-of-two ====
+    passed
+(null) - Elapsed Time:0.478206 ms.
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   3   2   1   3   1   1   1   2   0   1   0   2 ...   1   0 ]
+==== cpu compact without scan, power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   1   1 ]
+    passed
+(null) - Elapsed Time:0.178000 ms.
+==== cpu compact without scan, non-power-of-two ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   3   1 ]
+    passed
+(null) - Elapsed Time:0.177000 ms.
+==== cpu compact with scan ====
+    [   2   3   2   1   3   1   1   1   2   1   2   1   1 ...   1   1 ]
+    passed
+(null) - Elapsed Time:0.374000 ms.
+==== work-efficient compact, power-of-two ====
+    passed
+(null) - Elapsed Time:1.581446 ms.
+==== work-efficient compact, non-power-of-two ====
+    passed
+(null) - Elapsed Time:1.613630 ms.
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+    [  38 7719 21238 2437 8855 11797 8365 32285 10450 30612 5853 28100 1142 ...
+15085   0 ]
+==== thrust radix-sort, power-of-two ====
+    [   0   1   1   1   1   2   2   2   3   3   4   4   4 ... 32766 32767 ]
+(null) - Elapsed Time:1.960000 ms.
+==== my radix-sort, power-of-two ====
+    passed
+(null) - Elapsed Time:3.173000 ms.
+==== thrust radix-sort, non-power-of-two ====
+    [   1   1   1   1   2   2   2   3   3   4   4   4   4 ... 32766 32767 ]
+(null) - Elapsed Time:1.901000 ms.
+==== my radix-sort, non-power-of-two ====
+    passed
+(null) - Elapsed Time:3.126000 ms.
+```
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/images/Capture.JPG b/images/Capture.JPG
new file mode 100644
index 0000000..47a48b8
Binary files /dev/null and b/images/Capture.JPG differ
diff --git a/images/CaptureNew.JPG b/images/CaptureNew.JPG
new file mode 100644
index 0000000..47e3fd1
Binary files /dev/null and b/images/CaptureNew.JPG differ
diff --git a/images/CompactRuntimePerformanceGivenArraySize.JPG b/images/CompactRuntimePerformanceGivenArraySize.JPG
new file mode 100644
index 0000000..9ce751a
Binary files /dev/null and b/images/CompactRuntimePerformanceGivenArraySize.JPG differ
diff --git a/images/PerfGraphs.xlsx b/images/PerfGraphs.xlsx
new file mode 100644
index 0000000..e77c419
Binary files /dev/null and b/images/PerfGraphs.xlsx differ
diff --git a/images/RadixSortRuntimePerformanceGivenArraySize.JPG b/images/RadixSortRuntimePerformanceGivenArraySize.JPG
new file mode 100644
index 0000000..4f1e434
Binary files /dev/null and b/images/RadixSortRuntimePerformanceGivenArraySize.JPG differ
diff --git a/images/ScanRuntimePerformanceGivenArraySize.JPG b/images/ScanRuntimePerformanceGivenArraySize.JPG
new file mode 100644
index 0000000..e68a109
Binary files /dev/null and b/images/ScanRuntimePerformanceGivenArraySize.JPG differ
diff --git a/img/example-1.png b/img/example-1.png
deleted file mode 100644
index 28633a6..0000000
Binary files a/img/example-1.png and /dev/null differ
diff --git a/img/example-2.jpg b/img/example-2.jpg
deleted file mode 100644
index 984c2fd..0000000
Binary files a/img/example-2.jpg and /dev/null differ
diff --git a/img/figure-39-2.jpg b/img/figure-39-2.jpg
deleted file mode 100644
index bc9f9da..0000000
Binary files a/img/figure-39-2.jpg and /dev/null differ
diff --git a/img/figure-39-4.jpg b/img/figure-39-4.jpg
deleted file mode 100644
index 5888f20..0000000
Binary files a/img/figure-39-4.jpg and /dev/null differ
diff --git a/src/main.cpp b/src/main.cpp
index 675da35..6cdfec4 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -7,74 +7,145 @@
  */
 
 #include <cstdio>
+
+#include <stream_compaction/timer.h>
 #include <stream_compaction/cpu.h>
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix_sort.h>
+
 #include "testing_helpers.hpp"
 
-int main(int argc, char* argv[]) {
-    const int SIZE = 1 << 8;
-    const int NPOT = SIZE - 3;
-    int a[SIZE], b[SIZE], c[SIZE];
+int main(int argc, char* argv[])
+{
+	const int  SIZE = 1 << 16;
+	const int  NPOT = SIZE - 3;
+	int* a = new int[SIZE];
+	int* b = new int[SIZE];
+	int* c = new int[SIZE];
 
-    // Scan tests
+	Timer::initializeTimer();
+	const int numTestRepeat = 1000;
 
-    printf("\n");
-    printf("****************\n");
-    printf("** SCAN TESTS **\n");
-    printf("****************\n");
+	// Scan tests
+	printf("****************\n");
+	printf("** TESTS INFO **\n");
+	printf("****************\n");
+	printf("\n");
+	printf("ArraySize = %d; ArrayOddSize: %d\n", SIZE, NPOT);
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
+	printf("****************\n");
+	printf("** SCAN TESTS **\n");
+	printf("****************\n");
+
+	genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+	a[SIZE - 1] = 0;
+	printArray(SIZE, a, true);
+
+	printDesc("cpu scan, power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, b);
+			StreamCompaction::CPU::scan(SIZE, b, a);
+		}
+		printArray(SIZE, b, true);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+	
+	printDesc("cpu scan, non-power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::CPU::scan(NPOT, c, a);
+		}
+		printArray(NPOT, b, true);
+		printCmpResult(NPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
 
-    zeroArray(SIZE, b);
-    printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
-    printArray(SIZE, b, true);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printArray(NPOT, b, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
-    //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+	printDesc("naive scan, power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::Naive::scan(SIZE, c, a);
+		}
+		//printArray(SIZE, c, true);
+		printCmpResult(SIZE, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("naive scan, non-power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::Naive::scan(NPOT, c, a);
+		}
+		//printArray(SIZE, c, true);
+		printCmpResult(NPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("work-efficient scan, power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::Efficient::scan(SIZE, c, a);
+		}
+		//printArray(SIZE, c, true);
+		printCmpResult(SIZE, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("work-efficient scan, non-power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::Efficient::scan(NPOT, c, a);
+		}
+		//printArray(NPOT, c, true);
+		printCmpResult(NPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("thrust scan, power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::Thrust::scan(SIZE, c, a);
+		}
+		//printArray(SIZE, c, true);
+		printCmpResult(SIZE, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("thrust scan, non-power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+
+			StreamCompaction::Thrust::scan(NPOT, c, a);
+		}
+		//printArray(NPOT, c, true);
+		printCmpResult(NPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
 
     printf("\n");
     printf("*****************************\n");
@@ -89,35 +160,138 @@ int main(int argc, char* argv[]) {
 
     int count, expectedCount, expectedNPOT;
 
-    zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
-    expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
-    expectedNPOT = count;
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
-    count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, power-of-two");
-    count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, non-power-of-two");
-    count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
+	printDesc("cpu compact without scan, power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, b);
+
+			count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+			expectedCount = count;
+		}
+		printArray(count, b, true);
+		printCmpLenResult(count, expectedCount, b, b);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("cpu compact without scan, non-power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+
+			count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+			expectedNPOT = count;
+		}
+		printArray(count, c, true);
+		printCmpLenResult(count, expectedNPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("cpu compact with scan");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
+		}
+		printArray(count, c, true);
+		printCmpLenResult(count, expectedCount, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("work-efficient compact, power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			count = StreamCompaction::Efficient::compact(SIZE, c, a);
+		}
+		//printArray(count, c, true);
+		printCmpLenResult(count, expectedCount, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("work-efficient compact, non-power-of-two");
+	{
+		Timer::resetTimer();
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			count = StreamCompaction::Efficient::compact(NPOT, c, a);
+		}
+		//printArray(count, c, true);
+		printCmpLenResult(count, expectedNPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printf("\n");
+	printf("*****************************\n");
+	printf("** RADIX SORT TESTS **\n");
+	printf("*****************************\n");
+
+	// RdixSort tests
+	genArray(SIZE - 1, a, SIZE);  // Leave a 0 at the end to test that edge case
+	a[SIZE - 1] = 0;
+	printArray(SIZE, a, true);
+
+	printDesc("thrust radix-sort, power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, b);
+			StreamCompaction::Thrust::radixSort(SIZE, b, a);
+		}
+		printArray(SIZE, b, true);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("my radix-sort, power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::RadixSort::sort(SIZE, c, a);
+		}
+		//printArray(SIZE, c, true);
+		printCmpResult(SIZE, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("thrust radix-sort, non-power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, b);
+			StreamCompaction::Thrust::radixSort(NPOT, b, a);
+		}
+		printArray(NPOT, b, true);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	printDesc("my radix-sort, non-power-of-two");
+	{
+		Timer::resetTimer(false);
+		for (size_t i = 0; i < numTestRepeat; ++i)
+		{
+			zeroArray(SIZE, c);
+			StreamCompaction::RadixSort::sort(NPOT, c, a);
+		}
+		//printArray(SIZE, c, true);
+		printCmpResult(NPOT, b, c);
+		Timer::printTimer(NULL, 1.0f / numTestRepeat);
+	}
+
+	Timer::shutdownTimer();
+
+	delete[] a;
+	delete[] b;
+	delete[] c;
 }
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index cdbef77..ae09203 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -1,6 +1,8 @@
 set(SOURCE_FILES
     "common.h"
     "common.cu"
+    "timer.h"
+    "timer.cu"
     "cpu.h"
     "cpu.cu"
     "naive.h"
@@ -9,9 +11,11 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "radix_sort.h"
+    "radix_sort.cu"
     )
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index fe872d4..3488929 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -14,7 +14,6 @@ void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     exit(EXIT_FAILURE);
 }
 
-
 namespace StreamCompaction {
 namespace Common {
 
@@ -22,8 +21,14 @@ namespace Common {
  * Maps an array to an array of 0s and 1s for stream compaction. Elements
  * which map to 0 will be removed, and elements which map to 1 will be kept.
  */
-__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-    // TODO
+__global__ void kernMapToBoolean(int n, int *bools, const int *idata)
+{
+	const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (0 <= iSelf && iSelf < n)
+	{
+		const bool elementIsValid = (idata[iSelf] != 0);
+		bools[iSelf] = elementIsValid;
+	}
 }
 
 /**
@@ -31,8 +36,43 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
  * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
  */
 __global__ void kernScatter(int n, int *odata,
-        const int *idata, const int *bools, const int *indices) {
-    // TODO
+        const int *idata, const int *bools, const int *indices)
+{
+	const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (0 <= iSelf && iSelf < n)
+	{
+		if (bools[iSelf] == 1)
+		{
+			const int dataIdx = indices[iSelf];
+			odata[dataIdx] = idata[iSelf];
+		}
+	}
+}
+
+__global__ void kernScatter(int n, int *odata, const int *idata, const int *indices)
+{
+	const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (0 <= iSelf && iSelf < n)
+	{
+		const int dataIdx = indices[iSelf];
+		odata[dataIdx] = idata[iSelf];
+	}
+}
+
+__global__ void convertInclusiveToExclusiveScan(int N, int* inInclusiveScan, int* outExclusiveScan)
+{
+	const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (iSelf < N)
+	{
+		if (iSelf > 0)
+		{
+			outExclusiveScan[iSelf] = inInclusiveScan[iSelf - 1];
+		}
+		else
+		{
+			outExclusiveScan[iSelf] = 0;
+		}
+	}
 }
 
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 4f52663..b4ec3e5 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -4,8 +4,10 @@
 #include <cstring>
 #include <cmath>
 
+
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
 
 /**
  * Check for CUDA errors; print and exit if there was a problem.
@@ -24,12 +26,15 @@ inline int ilog2ceil(int x) {
     return ilog2(x - 1) + 1;
 }
 
-
 namespace StreamCompaction {
 namespace Common {
     __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
 
     __global__ void kernScatter(int n, int *odata,
             const int *idata, const int *bools, const int *indices);
+
+	__global__ void kernScatter(int n, int *odata, const int *idata, const int *indices);
+
+	__global__ void convertInclusiveToExclusiveScan(int N, int* inInclusiveScan, int* outExclusiveScan);
 }
-}
+}
\ No newline at end of file
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index e600c29..3c5fb83 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,4 +1,5 @@
 #include <cstdio>
+#include "timer.h"
 #include "cpu.h"
 
 namespace StreamCompaction {
@@ -7,19 +8,37 @@ namespace CPU {
 /**
  * CPU scan (prefix sum).
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
-}
+	void scan(int n, int *odata, const int *idata)
+	{
+		Timer::playTimer();
+		{
+			odata[0] = 0;
+			for (size_t i = 1; i < n; ++i)
+			{
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
+		}
+		Timer::pauseTimer();
+	}
 
 /**
  * CPU stream compaction without using the scan function.
  *
  * @returns the number of elements remaining after compaction.
  */
-int compactWithoutScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+int compactWithoutScan(int n, int *odata, const int *idata)
+{
+	int j = 0;
+	Timer::playTimer();
+	{
+		for (int i = 0; i < n; ++i)
+		{
+			if (idata[i] != 0)
+				odata[j++] = idata[i];
+		}
+	}
+	Timer::pauseTimer();
+	return j;
 }
 
 /**
@@ -27,9 +46,32 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
  *
  * @returns the number of elements remaining after compaction.
  */
-int compactWithScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+int compactWithScan(int n, int *odata, const int *idata) 
+{
+	Timer::playTimer();
+		int* filterData = new int[n];
+		//1) Compute temporary array containing 1 if corresponding element meets
+		//criteria, 0 otherwise.
+		for (int i = 0; i < n; ++i)
+		{
+			filterData[i] = (idata[i] != 0);
+		}
+
+		//2) Run exclusive scan on temporary array.
+		scan(n, odata, filterData);
+		const int numElementsAfterCompaction = odata[n - 1];
+
+		//3) Result of scan is index into final array.
+		//Only write an element if temporary array has a 1
+		for (int i = 0; i < n; ++i)
+		{
+			if (filterData[i] == 1)
+				odata[odata[i]] = idata[i];
+		}
+		delete[] filterData;
+	Timer::pauseTimer();
+
+	return numElementsAfterCompaction;
 }
 
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index b2f739b..8ad6be2 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -1,19 +1,100 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
 #include "common.h"
+#include "timer.h"
 #include "efficient.h"
 
 namespace StreamCompaction {
 namespace Efficient {
 
-// TODO: __global__
+	__global__ void kernSumUpSweep(int N, size_t clusterSize, int* buffer)
+	{
+		const unsigned long iSelf = clusterSize *  (threadIdx.x + (blockIdx.x * blockDim.x));
+		const unsigned long clusterBitMask = clusterSize - 1;
+		if (0 <= iSelf && iSelf < N && ( (iSelf & clusterBitMask) == 0) )
+		{
+			buffer[iSelf + clusterSize - 1] += buffer[iSelf + (clusterSize >> 1) - 1];
+		}
+	}
+
+	__global__ void kernSumDownSweep(int N, size_t clusterSize, int* buffer)
+	{
+		const unsigned long iSelf = clusterSize * ( threadIdx.x + (blockIdx.x * blockDim.x) );
+		const unsigned long clusterBitMask = clusterSize - 1;
+		if (0 <= iSelf && iSelf < N && ((iSelf & clusterBitMask) == 0))
+		{
+			int leftChildVal = buffer[iSelf + (clusterSize >> 1) - 1]; 
+			buffer[iSelf + (clusterSize >> 1) - 1] = buffer[iSelf + clusterSize - 1]; // Set left child to this node�s value
+			buffer[iSelf + clusterSize - 1] += leftChildVal; // Set right child to old left value + this node�s value
+		}
+	}
+
+
+	void _scanInHostPlace(int n, int *cudaBuffer, const int *idata)
+	{
+		const int blockSize =  128;
+
+		const int maxDepth = ilog2ceil(n);
+		const int nextPowerOf2 = 1 << maxDepth;
+
+		cudaMemset(cudaBuffer, 0, nextPowerOf2 * sizeof(int));
+		cudaMemcpy(cudaBuffer, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+		n = nextPowerOf2;
+
+		Timer::playTimer();
+		int d = 0;
+		for (; d < maxDepth; ++d)
+		{
+			const size_t clusterSize = 1 << (d + 1);
+			const size_t numThreads = (n >= clusterSize) ? n / clusterSize : n;
+
+			dim3 fullBlocksPerGrid((numThreads + blockSize - 1) / blockSize);
+			kernSumUpSweep << <fullBlocksPerGrid, blockSize >> >(n, clusterSize, cudaBuffer);
+		}
+
+		cudaMemset(cudaBuffer + (n - 1), 0, sizeof(int));
+		for (; d >= 0; --d)
+		{
+			const size_t clusterSize = 1 << (d + 1);
+			const size_t numThreads = (n >= clusterSize) ? n / clusterSize : n;
+
+			dim3 fullBlocksPerGrid((numThreads + blockSize - 1) / blockSize);
+			kernSumDownSweep << <fullBlocksPerGrid, blockSize >> >(n, clusterSize, cudaBuffer);
+		}
+		Timer::pauseTimer();
+	}
+
+int* scanInHostPlace(int n, const int *idata)
+{
+	if (n <= 0 || idata == NULL)
+		return NULL;
+	checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+	const int maxDepth = ilog2ceil(n);
+	const int nextPowerOf2 = 1 << maxDepth;
+
+	int* cudaBuffer = NULL;
+	checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+	cudaMalloc((void**)&cudaBuffer, nextPowerOf2 * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+	_scanInHostPlace(n, cudaBuffer, idata);
+
+	return cudaBuffer;
+}
 
 /**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+* Performs prefix-sum (aka scan) on idata, storing the result into odata.
+*/
+
+void scan(int n, int *odata, const int *idata)
+{
+	if (n <= 0 || odata == NULL || idata == NULL)
+		return;
+
+	int* cudaBuffer = scanInHostPlace(n, idata);
+
+	cudaMemcpy(odata, cudaBuffer, n * sizeof(int), cudaMemcpyDeviceToHost);
+	cudaFree(cudaBuffer);
 }
 
 /**
@@ -25,9 +106,55 @@ void scan(int n, int *odata, const int *idata) {
  * @param idata  The array of elements to compact.
  * @returns      The number of elements remaining after compaction.
  */
-int compact(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+int compact(int n, int *odata, const int *idata)
+{
+	if (n <= 0 || odata == NULL || idata == NULL)
+		return 0;
+	
+	const int blockSize = 128;
+	dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+	int* cudaInBuffer = NULL;
+	cudaMalloc((void**)&cudaInBuffer, n * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+	cudaMemcpy(cudaInBuffer, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+	int* cudaBitMask = NULL;
+	cudaMalloc((void**)&cudaBitMask, n * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+	cudaMemset(cudaBitMask, 0, n * sizeof(int));
+
+	Timer::playTimer();
+
+	Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> >(n, cudaBitMask, cudaInBuffer);
+	cudaMemcpy(odata, cudaBitMask, n * sizeof(int), cudaMemcpyDeviceToHost);
+	int endsWith1;
+	cudaMemcpy(&endsWith1, &cudaBitMask[n - 1], sizeof(int), cudaMemcpyDeviceToHost);
+
+	int* cudaScanResult = scanInHostPlace(n, cudaBitMask);
+	cudaMemcpy(odata, cudaScanResult, n * sizeof(int), cudaMemcpyDeviceToHost);
+	
+	int outNumElements = 0;
+	cudaMemcpy(&outNumElements, cudaScanResult + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+	outNumElements += endsWith1;
+
+	int* cudaOutBuffer = NULL;
+	cudaMalloc((void**)&cudaOutBuffer, n * sizeof(int));
+	checkCUDAErrorWithLine("cudaMalloc cudaOutBuffer failed!");
+	cudaMemset(cudaOutBuffer, 0, n * sizeof(int));
+
+	Common::kernScatter << <fullBlocksPerGrid, blockSize >> >(n, cudaOutBuffer, cudaInBuffer, cudaBitMask, cudaScanResult);
+
+	Timer::pauseTimer();
+
+	cudaMemcpy(odata, cudaOutBuffer, outNumElements * sizeof(int), cudaMemcpyDeviceToHost);
+
+    cudaFree(cudaOutBuffer);
+	cudaFree(cudaScanResult);
+	cudaFree(cudaBitMask);
+	cudaFree(cudaInBuffer);
+	
+	return outNumElements;
 }
 
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 395ba10..390157a 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -2,7 +2,9 @@
 
 namespace StreamCompaction {
 namespace Efficient {
+
     void scan(int n, int *odata, const int *idata);
+	int* scanInHostPlace(int n, const int *idata);
 
     int compact(int n, int *odata, const int *idata);
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 3d86b60..2f93e6f 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -1,20 +1,61 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
 #include "common.h"
+#include "timer.h"
 #include "naive.h"
 
 namespace StreamCompaction {
 namespace Naive {
 
-// TODO: __global__
+	__global__ void kernSumUp(int N, int inStartIdx, int* inArray, int* outArray)
+	{
+		const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+		if (inStartIdx <= iSelf && iSelf < N)
+		{
+			outArray[iSelf] = inArray[iSelf - inStartIdx] + inArray[iSelf];
+		}
+	}
 
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
-}
+	void scan(int n, int *odata, const int *idata)
+	{
+		if (n <= 0 || odata == NULL || idata == NULL)
+			return;
+
+		const int blockSize = 96;
+		dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+		int* cudaIn = NULL;
+		cudaMalloc((void**)&cudaIn, n * sizeof(int));
+		checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+		cudaMemcpy(cudaIn, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+		int* cudaOut = NULL;
+		cudaMalloc((void**)&cudaOut, n * sizeof(int));
+		checkCUDAErrorWithLine("cudaMalloc cudaOut failed!");
+
+		int maxDepth = ilog2ceil(n);
+		Timer::playTimer();
+		for (int d = 0; d < maxDepth; ++d)
+		{
+			const int inStartIdx = 1 << d;
+			cudaMemcpy(cudaOut, cudaIn, inStartIdx * sizeof(int), cudaMemcpyHostToDevice);
+
+			kernSumUp << <fullBlocksPerGrid, blockSize >> >(n, inStartIdx, cudaIn, cudaOut);
+
+			// Ping-pong the buffers
+			int* cudaTemp = cudaIn; cudaIn = cudaOut; cudaOut = cudaTemp;
+		}
+		Common::convertInclusiveToExclusiveScan << <fullBlocksPerGrid, blockSize >> > (n, cudaIn, cudaOut);
+		Timer::pauseTimer();
+
+		cudaMemcpy(odata, cudaOut, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+		cudaFree(cudaOut);
+		cudaFree(cudaIn);
+	}
 
 }
 }
diff --git a/stream_compaction/radix_sort.cu b/stream_compaction/radix_sort.cu
new file mode 100644
index 0000000..00180ca
--- /dev/null
+++ b/stream_compaction/radix_sort.cu
@@ -0,0 +1,121 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "efficient.h"
+
+#include "radix_sort.h"
+
+namespace StreamCompaction {
+	namespace RadixSort {
+
+		__global__ void kernInitEBitMap(int N, int bit, int* inBuffer, int* outEBitMap)
+		{
+			const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (0 <= iSelf && iSelf < N)
+			{
+				outEBitMap[iSelf] = 1 - ((inBuffer[iSelf] & (1 << bit)) != 0);
+			}
+		}
+
+		__global__ void kernInitTArray(int N, int numTotalFalses, int* cudaFBuffer, int* outTBuffer)
+		{
+			const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (0 <= iSelf && iSelf < N)
+			{
+				outTBuffer[iSelf] = iSelf - cudaFBuffer[iSelf] + numTotalFalses;
+			}
+		}
+
+		__global__ void kernInitDArray(int N, int* cudaEBuffer, int* cudaTBuffer, int* cudaFBuffer, int* outDBuffer)
+		{
+			const int iSelf = threadIdx.x + (blockIdx.x * blockDim.x);
+			if (0 <= iSelf && iSelf < N)
+			{
+				outDBuffer[iSelf] = cudaEBuffer[iSelf] ? cudaFBuffer[iSelf] : cudaTBuffer[iSelf];
+			}
+		}
+
+		/**
+		* Performs Parallel Radix Sort.
+		*/
+		void sort(int n, int *odata, const int *idata)
+		{
+			if (n <= 0 || odata == NULL || idata == NULL)
+				return;
+
+			const int blockSize = 128;
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+			int* cudaInBuffer = NULL;
+			cudaMalloc((void**)&cudaInBuffer, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc cudaInBuffer failed!");
+			cudaMemcpy(cudaInBuffer, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			int* cudaEBuffer = NULL;
+			cudaMalloc((void**)&cudaEBuffer, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc cudaEBuffer failed!");
+			cudaMemset(cudaEBuffer, 0, n * sizeof(int));
+
+			int* cudaFBuffer = NULL;
+
+			int* cudaTBuffer = NULL;
+			cudaMalloc((void**)&cudaTBuffer, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc cudaTBuffer failed!");
+			cudaMemset(cudaTBuffer, 0, n * sizeof(int));
+
+			int* cudaDBuffer = NULL;
+			cudaMalloc((void**)&cudaDBuffer, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc cudaDBuffer failed!");
+			cudaMemset(cudaDBuffer, 0, n * sizeof(int));
+
+			int* cudaOutBuffer = NULL;
+			cudaMalloc((void**)&cudaOutBuffer, n * sizeof(int));
+			checkCUDAErrorWithLine("cudaMalloc cudaOutBuffer failed!");
+			cudaMemset(cudaOutBuffer, 0, n * sizeof(int));
+
+			const int maxNumBits = ilog2ceil(n) + 1;
+			for (int bitIdx = 0; bitIdx < maxNumBits; ++bitIdx)
+			{
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+				kernInitEBitMap << <fullBlocksPerGrid, blockSize >> > (n, bitIdx, cudaInBuffer, cudaEBuffer);
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+				cudaFBuffer = StreamCompaction::Efficient::scanInHostPlace(n, cudaEBuffer);
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+
+				int numTotalFalses = 0;
+				{
+					checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+					int lastEElement = 0;
+					checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+					cudaMemcpy(&lastEElement, cudaEBuffer + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+					checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+					int lastFElement = 0;
+					checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+					cudaMemcpy(&lastFElement, cudaFBuffer + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+					numTotalFalses = lastEElement + lastFElement;
+				}
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+				kernInitTArray << <fullBlocksPerGrid, blockSize >> > (n, numTotalFalses, cudaFBuffer, cudaTBuffer);
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+				kernInitDArray << <fullBlocksPerGrid, blockSize >> > (n, cudaEBuffer, cudaTBuffer, cudaFBuffer, cudaDBuffer);
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+				StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, cudaOutBuffer, cudaInBuffer, cudaDBuffer);
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+				cudaMemcpy(odata, cudaOutBuffer, n * sizeof(int), cudaMemcpyDeviceToHost);
+				cudaMemcpy(cudaInBuffer, cudaOutBuffer, n * sizeof(int), cudaMemcpyDeviceToDevice);
+				checkCUDAErrorWithLine("cudaMalloc cudaIn failed!");
+
+				cudaFree(cudaFBuffer);
+			}
+			
+
+			cudaFree(cudaOutBuffer);
+			cudaFree(cudaDBuffer);
+			cudaFree(cudaTBuffer);
+			cudaFree(cudaEBuffer);
+			cudaFree(cudaInBuffer);
+
+		}
+
+	}
+}
diff --git a/stream_compaction/radix_sort.h b/stream_compaction/radix_sort.h
new file mode 100644
index 0000000..0ec8249
--- /dev/null
+++ b/stream_compaction/radix_sort.h
@@ -0,0 +1,7 @@
+#pragma once
+
+namespace StreamCompaction {
+	namespace RadixSort {
+		void sort(int n, int *odata, const int *idata);
+	}
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index d8dbb32..89ef5a5 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -2,8 +2,10 @@
 #include <cuda_runtime.h>
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
+#include <thrust/sort.h>
 #include <thrust/scan.h>
 #include "common.h"
+#include "timer.h"
 #include "thrust.h"
 
 namespace StreamCompaction {
@@ -12,10 +14,23 @@ namespace Thrust {
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO use `thrust::exclusive_scan`
-    // example: for device_vectors dv_in and dv_out:
-    // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+void scan(int n, int *odata, const int *idata)
+{
+		thrust::device_vector<int> dv_in(idata, idata + n);
+		thrust::device_vector<int> dv_out(odata, odata + n);
+	Timer::playTimer();
+		thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+	Timer::pauseTimer();
+		thrust::copy(dv_out.begin(), dv_out.end(), odata);
+}
+
+void radixSort(int n, int *odata, const int *idata)
+{
+		thrust::device_vector<int> dv_in(idata, idata + n);
+	Timer::playTimer();
+		thrust::stable_sort(dv_in.begin(), dv_in.end());
+	Timer::pauseTimer();
+		thrust::copy(dv_in.begin(), dv_in.end(), odata);
 }
 
 }
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index 06707f3..2ec04d3 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -3,5 +3,6 @@
 namespace StreamCompaction {
 namespace Thrust {
     void scan(int n, int *odata, const int *idata);
+	void radixSort(int n, int *odata, const int *idata);
 }
 }
diff --git a/stream_compaction/timer.cu b/stream_compaction/timer.cu
new file mode 100644
index 0000000..113b24e
--- /dev/null
+++ b/stream_compaction/timer.cu
@@ -0,0 +1,140 @@
+
+#include <cstdio>
+#include <chrono>
+
+#include "timer.h"
+
+
+MyTimer* Timer::m_myTimer = NULL;
+
+class MyTimer
+{
+public:
+	using Clock = std::chrono::high_resolution_clock;
+	using TimePoint = std::chrono::time_point<Clock>;
+
+public:
+	MyTimer()
+	{
+		m_refCount = 0;
+		m_useGPU = true;
+		m_elapsedTimeInms = 0.0f;
+
+		cudaEventCreate(&m_start);
+		cudaEventCreate(&m_stop);
+
+		m_startTime = Clock::now();
+		m_stopTime = Clock::now();
+	}
+
+	~MyTimer()
+	{
+		cudaEventDestroy(m_start);
+		cudaEventDestroy(m_stop);
+	}
+
+public:
+
+	void resetTimer(bool useGPU = true) 
+	{
+		m_useGPU = useGPU;
+		m_elapsedTimeInms = 0.0f;
+	}
+
+	void playTimer()
+	{
+		if (m_refCount++ == 0)
+		{
+			if (m_useGPU)
+			{
+				cudaEventRecord(m_start);
+			}
+			else
+			{
+				m_startTime = Clock::now();
+			}
+		}
+	}
+
+	bool pauseTimer()
+	{
+		bool bPaused = false;
+		if (--m_refCount == 0)
+		{
+			float newElapsedTime = 0.0f;
+			if (m_useGPU)
+			{
+				cudaEventRecord(m_stop);
+				cudaEventSynchronize(m_stop);
+				cudaEventElapsedTime(&newElapsedTime, m_start, m_stop);
+			}
+			else
+			{
+				m_stopTime = Clock::now();
+				newElapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(m_stopTime - m_startTime).count();
+			}
+			m_elapsedTimeInms += newElapsedTime;
+			bPaused = true;
+		}
+		return bPaused;
+	}
+
+	float printTimer(const char* timerHeader, float timerFactor)
+	{
+		float elapsedTime = timerFactor * m_elapsedTimeInms;
+		printf("%s - Elapsed Time:%f ms.\n", timerHeader, elapsedTime);
+		return elapsedTime;
+	}
+
+private:
+	size_t m_refCount;
+	bool m_useGPU;
+	float m_elapsedTimeInms;
+private:
+	cudaEvent_t m_start;
+	cudaEvent_t m_stop;
+
+private:
+	TimePoint m_startTime;
+	TimePoint m_stopTime;
+};
+
+Timer::Timer()
+{
+}
+
+Timer::~Timer()
+{
+}
+
+void Timer::initializeTimer()
+{
+	if (m_myTimer == NULL)
+		m_myTimer = new MyTimer;
+}
+
+void Timer::shutdownTimer()
+{
+	if (m_myTimer != NULL)
+		delete m_myTimer;
+}
+
+void Timer::resetTimer(bool useGPU)
+{
+	m_myTimer->resetTimer(useGPU);
+}
+
+void Timer::playTimer()
+{
+	m_myTimer->playTimer();
+}
+
+void Timer::pauseTimer()
+{
+	m_myTimer->pauseTimer();
+}
+
+void Timer::printTimer(const char* timerHeader, float timerFactor)
+{
+	m_myTimer->printTimer(timerHeader, timerFactor);
+}
diff --git a/stream_compaction/timer.h b/stream_compaction/timer.h
new file mode 100644
index 0000000..76cab78
--- /dev/null
+++ b/stream_compaction/timer.h
@@ -0,0 +1,23 @@
+#pragma once
+
+class MyTimer;
+
+class Timer
+{
+public:
+	Timer();
+	~Timer();
+
+public:
+	static void initializeTimer();
+	static void shutdownTimer();
+
+public:
+	static void resetTimer(bool useGPU = true);
+	static void playTimer();
+	static void pauseTimer();
+	static void printTimer(const char* timerHeader, float timerFactor);
+
+private:
+	static MyTimer* m_myTimer;
+};
\ No newline at end of file