diff --git a/README.md b/README.md
index 0e38ddb..9021546 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,153 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Kevin Dong
+  * [LinkedIn](www.linkedin.com/in/xingyu-dong)
+* Tested on: Windows 11, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz, GTX 2060
 
-### (TODO: Your README)
+This repo implements several versions of parallel exclusive scan algorithms and used them to implement stream 
+compaction. The work-efficient scan algorithm has also been used in implementing a parallel radix sort algorithm.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Features
 
+#### CPU Scan and Compaction
+The CPU scan implementation is fairly straightforward $O(n)$ algorithm. The compaction algorithms, with and without 
+scan, also run in $O(n)$ time. All of them perform fairly well on small input sizes and become slow as the input size 
+increases.
+#### Naive GPU Scan
+The naive parallel scan is a $O(log_n)$ algorithm that still performs $O(nlog_n)$ number of adds, which makes it not 
+really better than the sequential algorithm.
+#### Work-Efficient GPU Scan and Compaction
+The work-efficient scan improves upon the naive algorithm by changing the array into a balanced binary tree. It first 
+performs a parallel reduction (the up-sweep) and a down-sweep that yields the final scan result. This algorithm has 
+$O(n)$ number of adds and performs much better than the naive scan. To accommodate input sizes that are not powers of 2, 
+we enlarge the input array to the nearest power of 2 and pad the rest of the array with 0s.
+#### Work-Efficient Scan Optimization (Part 5 Extra Credit)
+We can further optimize the work-efficient scan by reducing the number of works needed to be done. A lot of threads 
+are not needed in the process because only some nodes are required to be updated during each iteration.
+#### Thrust Scan
+The thrust implementation calls the thrust api to perform the exclusive scan. It is very straightforward.
+#### Radix Sort (Extra Credit 1)
+The parallel radix sort uses the work-efficient exclusive scan as part of its implementation. We iterate through $k$ 
+bits and the exclusive scan take $O(log_n)$ time. The total time complexity is $O(k\cdot log_n)$ because the generation of 
+bit array and scatter operation are $O(1)$ due to the parallelism.
+
+### Performance Analysis
+We will compare the performance of the different scan algorithms on different input sizes. The graph is shown below in 
+log scale.
+![Performance Graph](img/Figure_2.png)
+From the graph, we can see that the thrust scan algorithm performs the best among all the scan algorithms, followed 
+by the work-efficient algorithm. When the input size is small, the CPU algorithm is the fastest algorithm, 
+but it quickly becomes slower as the input size increases. The naive scan algorithm also becomes slow when the size 
+becomes very large.
+
+#### Performance Bottlenecks
+Our timer for all the algorithms does not contain the memory allocation process and the process of copying the final 
+result array back to the host. Therefore, the execution time should mainly reflect the computation time as well as the 
+memory access time while doing the computation.
+
+The CPU implementation is fairly efficient and can hardly be further improved, since it is already a linear time 
+algorithm. From the runtime result we can also observe the fact that it runs very fast when the input size is small, 
+since comparing to its GPU counterparts it doesn't cause a lot of overheads. 
+
+The naive algorithm generally performs worse than the work-efficient algorithm, since it requires more adds operations 
+even comparing to the CPU implementation and have many idle threads that are not actively working. Since our 
+implementation uses global memory, the memory access could also potentially slow down the algorithm. It is expected 
+that a shared-memory model may further improve the algorithm's performance.
+
+The work-efficient algorithm improves based on the naive algorithm by doing computations like a balanced binary tree. 
+The number of adds for both the up-sweep and the down-sweep in this case becomes $O(n)$, thus making this algorithm 
+more efficient. The work-efficient algorithm does require more memory access, since there are more steps involved and 
+all the data are stored as global memory, so changing it to a shared-memory model should greatly improve its 
+performance. In part 5, we implemented an optimization that reduces the number of threads generated and replaced the 
+modular operation with a value comparison, which should make the algorithm more efficient.
+
+The thrust algorithm is the most efficient algorithm when the input size is very large. It is not hard to imagine that 
+the thrust implementation uses shared-memory model and properly optimizes the use of warps and registers to make the 
+algorithm very efficient on large scale data.
+
+![Thrust Analysis](img/Nsight.png)
+
+From the Nsight System timeline, we can see that the thrust implementation has a very high SM Warp Occupancy, indicating 
+that the algorithm is very efficient in utilizing the GPU resources. There are also some DRAM read and write operations, 
+which could be the memory access operations during the up-sweep and down-sweep processes. Overall, the thrust 
+implementation looks very efficient and well optimized.
+
+
+### Output
+This output is generated with $2^{20}$ input size and $256$ block size.
+```
+****************
+** SCAN TESTS **
+****************
+    [  28  28  11  44  31  11   4  25  23  43  12  12  28 ...  48   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.5931ms    (std::chrono Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.4801ms    (std::chrono Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.85376ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.916128ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.6073ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.02006ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.83968ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666831 25666879 ]
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.6672ms    (CUDA Measured)
+    [   0  28  56  67 111 142 153 157 182 205 248 260 272 ... 25666796 25666799 ]
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   2   1   2   3   2   3   2   0   3   2   2   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.2588ms    (std::chrono Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   1   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.1694ms    (std::chrono Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 5.4433ms    (std::chrono Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   1   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.83184ms    (CUDA Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   1   2 ]
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.1128ms    (CUDA Measured)
+    [   2   1   2   3   2   3   2   3   2   2   1   1   2 ...   3   3 ]
+    passed
+
+**********************
+** RADIX SORT TESTS **
+**********************
+    [  32  26 157 158 195 138 167 198 116  95 114 106 149 ...  30   0 ]
+==== radix sort ====
+   elapsed time: 17.5402ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 199 199 ]
+==== thrust sort ====
+   elapsed time: 1.57117ms    (CUDA Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 199 199 ]
+    passed
+```
\ No newline at end of file
diff --git a/img/Figure_1.png b/img/Figure_1.png
new file mode 100644
index 0000000..42c4bb8
Binary files /dev/null and b/img/Figure_1.png differ
diff --git a/img/Figure_2.png b/img/Figure_2.png
new file mode 100644
index 0000000..2686593
Binary files /dev/null and b/img/Figure_2.png differ
diff --git a/img/Nsight.png b/img/Nsight.png
new file mode 100644
index 0000000..d269e81
Binary files /dev/null and b/img/Nsight.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..3418c1f 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,14 +11,17 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
+int *d = new int[SIZE];
+
 int main(int argc, char* argv[]) {
     // Scan tests
 
@@ -39,6 +42,7 @@ int main(int argc, char* argv[]) {
     StreamCompaction::CPU::scan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     printArray(SIZE, b, true);
+    printCmpResult(SIZE, b, b);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
@@ -51,7 +55,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -64,37 +68,50 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
     StreamCompaction::Efficient::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
     StreamCompaction::Thrust::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
+    printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    // compare results between thrust array and cpu array
+//    zeroArray(SIZE, b);
+//    printDesc("compare thrust array and cpu array - CPU");
+//    StreamCompaction::CPU::scan(NPOT, b, a);
+//    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+//    printCmpResult(NPOT, b, b);
+//
+//    zeroArray(SIZE, d);
+//    printDesc("compare thrust array and cpu array - Thrust");
+//    StreamCompaction::Thrust::scan(NPOT, d, a);
+//    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+//    printCmpResult(NPOT, d, b);
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -137,18 +154,43 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, power-of-two");
     count = StreamCompaction::Efficient::compact(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("**********************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("**********************\n");
+
+    genArray(SIZE - 1, a, 200);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    int maxBits = StreamCompaction::Radix::getMaxBits(SIZE, a);
+
+    zeroArray(SIZE, b);
+    printDesc("radix sort");
+    StreamCompaction::Radix::sort(SIZE, b, a, maxBits);
+    printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("thrust sort");
+    StreamCompaction::Thrust::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;
     delete[] c;
+    delete[] d;
 }
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index 19511ca..d5529e7 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix.h"
     )
 
 set(sources
@@ -12,7 +13,8 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
-    )
+    "radix.cu"
+)
 
 list(SORT headers)
 list(SORT sources)
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..9b9f14a 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,7 +23,12 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,7 +37,14 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..941bca7 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -19,7 +19,12 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            if (n > 0) {
+                odata[0] = 0;
+                for (int i = 1; i < n; i++) {
+                    odata[i] = odata[i - 1] + idata[i - 1];
+                }
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +35,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int pos = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[pos] = idata[i];
+                    pos++;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return pos;
         }
 
         /**
@@ -41,10 +52,40 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            // construct the temporary array
+            int *temp = new int[n];
+            int *scanResult = new int[n];
+
             timer().startCpuTimer();
-            // TODO
+            // map
+            for (int i = 0; i < n; i++) {
+                temp[i] = idata[i] == 0 ? 0 : 1;
+            }
+
+            // scan
+            if (n > 0) {
+                scanResult[0] = 0;
+                for (int i = 1; i < n; i++) {
+                    scanResult[i] = scanResult[i - 1] + temp[i - 1];
+                }
+            }
+
+            // scatter
+            int count = 0;
+            for (int i = 0; i < n; i++) {
+                if (temp[i] == 1) {
+                    odata[scanResult[i]] = idata[i];
+                    count++;
+                }
+            }
+            count = scanResult[n - 1] + temp[n - 1];
             timer().endCpuTimer();
-            return -1;
+
+            // free memory
+            delete[] temp;
+            delete[] scanResult;
+
+            return count;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..953f664 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,6 +3,8 @@
 #include "common.h"
 #include "efficient.h"
 
+#define blockSize 256
+
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +14,65 @@ namespace StreamCompaction {
             return timer;
         }
 
+        __global__ void kernUpSweep(int n, int d, int* data) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            int offset = 1 << (d + 1); // 2^(d+1)
+            int pos = index * offset;
+
+            if (pos >= n) {
+                return;
+            }
+
+            data[pos + offset - 1] += data[pos + (offset >> 1) - 1];
+        }
+
+        __global__ void kernDownSweep(int n, int d, int* data) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            int offset = 1 << (d + 1); // 2^(d+1)
+            int pos = index * offset;
+
+            if (pos >= n) {
+                return;
+            }
+
+            int t = data[pos + (offset >> 1) - 1];
+            data[pos + (offset >> 1) - 1] = data[pos + offset - 1];
+            data[pos + offset - 1] += t;
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            int enlargedSize = 1 << ilog2ceil(n); // enlarge the size to the nearest power of 2
+            int* dev_idata;
+
+            cudaMalloc((void**)&dev_idata, enlargedSize * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed!");
+
+            // copy the input to GPU (size n data)
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
             timer().startGpuTimer();
-            // TODO
+            // up-sweep
+            for (int d = 0; d <= ilog2ceil(enlargedSize) - 1; d++) {
+                int fullBlocksPerGrid = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize;
+                kernUpSweep<<<fullBlocksPerGrid, blockSize>>> (enlargedSize, d, dev_idata);
+            }
+
+            // down-sweep
+            cudaMemset(dev_idata + enlargedSize - 1, 0, sizeof(int));
+            for (int d = ilog2ceil(enlargedSize) - 1; d >= 0; d--) {
+                int fullBlocksPerGrid = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize;
+                kernDownSweep<<<fullBlocksPerGrid, blockSize>>> (enlargedSize, d, dev_idata);
+            }
             timer().endGpuTimer();
+
+            // copy the result to odata (size n data)
+            cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // free memory
+            cudaFree(dev_idata);
         }
 
         /**
@@ -31,10 +85,66 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+            int* dev_idata;
+            int* dev_odata;
+            int* dev_temp;
+            int* dev_scan;
+            int fullBlocksPerGrid = (n + blockSize - 1) / blockSize;
+
+            int enlargedSize = 1 << ilog2ceil(n); // enlarge the size to the nearest power of 2
+
+            cudaMalloc((void**)&dev_idata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed!");
+            cudaMalloc((void**)&dev_odata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata failed!");
+            cudaMalloc((void**)&dev_temp, enlargedSize * sizeof(int));
+            checkCUDAError("cudaMalloc dev_temp failed!");
+            cudaMalloc((void**)&dev_scan, enlargedSize * sizeof(int));
+            checkCUDAError("cudaMalloc dev_scan failed!");
+
+            // copy the input to GPU
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
             timer().startGpuTimer();
-            // TODO
+            // map
+            StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, blockSize>>> (n, dev_temp, dev_idata);
+
+            // scan (implemented again)
+            cudaMemcpy(dev_scan, dev_temp, n * sizeof(int), cudaMemcpyDeviceToDevice);
+            // up-sweep
+            for (int d = 0; d <= ilog2ceil(enlargedSize) - 1; d++) {
+                int fullBlocksPerGridEnlarged = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize;
+                kernUpSweep<<<fullBlocksPerGridEnlarged, blockSize>>> (enlargedSize, d, dev_scan);
+            }
+
+            // down-sweep
+            cudaMemset(dev_scan + enlargedSize - 1, 0, sizeof(int));
+            for (int d = ilog2ceil(enlargedSize) - 1; d >= 0; d--) {
+                int fullBlocksPerGridEnlarged = (enlargedSize / (1 << (d + 1)) + blockSize - 1) / blockSize;
+                kernDownSweep<<<fullBlocksPerGridEnlarged, blockSize>>> (enlargedSize, d, dev_scan);
+            }
+
+            // scatter
+            StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, blockSize>>> (n, dev_odata, dev_idata, dev_temp, dev_scan);
             timer().endGpuTimer();
-            return -1;
+
+            // calculate count
+            int lastScanValue = 0;
+            int lastTempValue = 0;
+            cudaMemcpy(&lastScanValue, dev_scan + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            cudaMemcpy(&lastTempValue, dev_temp + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            int count = lastScanValue + lastTempValue;
+
+            // copy the result to odata
+            cudaMemcpy(odata, dev_odata, count * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // free memory
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+            cudaFree(dev_temp);
+            cudaFree(dev_scan);
+
+            return count;
         }
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..8cb8787 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -3,6 +3,8 @@
 #include "common.h"
 #include "naive.h"
 
+#define blockSize 256
+
 namespace StreamCompaction {
     namespace Naive {
         using StreamCompaction::Common::PerformanceTimer;
@@ -11,15 +13,62 @@ namespace StreamCompaction {
             static PerformanceTimer timer;
             return timer;
         }
-        // TODO: __global__
+
+        __global__ void kernNaiveScan(int n, int d, int* odata, const int* idata) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            int offset = 1 << (d - 1);
+            if (index >= offset) {
+                odata[index] = idata[index - offset] + idata[index];
+            }
+            else {
+                odata[index] = idata[index];
+            }
+        }
+
+        void ChangeToExclusive(int n, int* odata) {
+            for (int i = n - 1; i > 0; i--) {
+                odata[i] = odata[i - 1];
+            }
+            odata[0] = 0;
+        }
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            int* dev_idata;
+            int* dev_odata;
+            int fullBlocksPerGrid = (n + blockSize - 1) / blockSize;
+
+            cudaMalloc((void**)&dev_idata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed!");
+            cudaMalloc((void**)&dev_odata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata failed!");
+
+            // copy the input to GPU
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
             timer().startGpuTimer();
-            // TODO
+            // outer loop
+            for (int d = 1; d <= ilog2ceil(n); d++) {
+                // parallel process
+                kernNaiveScan<<<fullBlocksPerGrid, blockSize>>> (n, d, dev_odata, dev_idata);
+                std::swap(dev_odata, dev_idata);
+            }
             timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            // shift right and add 0 to the beginning to acquire the exclusive scan
+            ChangeToExclusive(n, odata);
+
+            // free memory
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
         }
     }
 }
diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu
new file mode 100644
index 0000000..e0a682b
--- /dev/null
+++ b/stream_compaction/radix.cu
@@ -0,0 +1,139 @@
+#include <cstdio>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "radix.h"
+
+#include "common.h"
+#include "efficient.h"
+
+#define blockSize 128
+
+namespace StreamCompaction {
+    namespace Radix {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+
+        __global__ void kernMapToCurrBits(int n, int *odata, const int *idata, int bit) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            // get the bit-th bit of the number, but also flip the bit to produce the e array
+            odata[index] = !((idata[index] >> bit) & 1);
+        }
+
+        __global__ void kernComputeT(int n, int *odata, const int *idata, const int totalFalses) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            odata[index] = index - idata[index] + totalFalses;
+        }
+
+        __global__ void kernComputeD(int n, int *odata, const int *be, const int *t, const int *f) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            // be array is stored as e array. To access b, reverse the condition
+            odata[index] = be[index] ? f[index] : t[index];
+        }
+
+        __global__ void scatter(int n, int *odata, const int *idata, const int *d) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+
+            odata[d[index]] = idata[index];
+        }
+
+        int getMaxBits(int n, const int *idata) {
+            // find maximum number in the array
+            int maxNum = 0;
+            for (int i = 0; i < n; i++) {
+                maxNum = std::max(maxNum, idata[i]);
+            }
+            // calculate the number of bits of the maximum number
+            int maxBits = 0;
+            while (maxNum > 0) {
+                maxNum >>= 1;
+                maxBits++;
+            }
+            return maxBits;
+        }
+
+        /**
+         * Performs radix sort on idata, storing the result into odata.
+         * @param n the number of elements in idata
+         * @param odata output.txt data
+         * @param idata input data
+         * @param maxBits the maximum number of bits of a given number in the array
+         */
+        void sort(int n, int *odata, const int *idata, int maxBits) {
+            int* dev_idata;
+            int* dev_odata;
+            int* dev_be;
+            int* dev_f;
+            int* dev_t;
+            int* dev_d;
+            int fullBlocksPerGrid = (n + blockSize - 1) / blockSize;
+
+            cudaMalloc((void**)&dev_idata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed!");
+            cudaMalloc((void**)&dev_odata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata failed!");
+            cudaMalloc((void**)&dev_be, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_b failed!");
+            cudaMalloc((void**)&dev_f, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_f failed!");
+            cudaMalloc((void**)&dev_t, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_t failed!");
+            cudaMalloc((void**)&dev_d, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_d failed!");
+
+            // copy the input to GPU (size n data)
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+            timer().startGpuTimer();
+            for (int i = 0; i < maxBits; i++) {
+                // b & e array (b can be acquired by flipping back)
+                kernMapToCurrBits<<<fullBlocksPerGrid, blockSize>>>(n, dev_be, dev_idata, i);
+                // exclusive scan f array
+                StreamCompaction::Efficient::scan(n, dev_f, dev_be);
+                // calculate totalFalses
+                int lastFVal = 0;
+                int lastEVal = 0;
+                cudaMemcpy(&lastEVal, dev_be + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+                cudaMemcpy(&lastFVal, dev_f + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+                int totalFalses = lastFVal + lastEVal;
+                // t array
+                kernComputeT<<<fullBlocksPerGrid, blockSize>>>(n, dev_t, dev_f, totalFalses);
+                // d array
+                kernComputeD<<<fullBlocksPerGrid, blockSize>>>(n, dev_d, dev_be, dev_t, dev_f);
+                // scatter
+                scatter<<<fullBlocksPerGrid, blockSize>>>(n, dev_odata, dev_idata, dev_d);
+                // swap
+                std::swap(dev_idata, dev_odata);
+            }
+            timer().endGpuTimer();
+
+            // copy the result to odata (size n data)
+            cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+            cudaFree(dev_be);
+            cudaFree(dev_f);
+            cudaFree(dev_t);
+            cudaFree(dev_d);
+        }
+    }
+}
\ No newline at end of file
diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h
new file mode 100644
index 0000000..3942817
--- /dev/null
+++ b/stream_compaction/radix.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Radix {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        int getMaxBits(int n, const int *idata);
+
+        void sort(int n, int *odata, const int *idata, int maxBits);
+    }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..b74bb69 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -3,6 +3,7 @@
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/scan.h>
+#include <thrust/sort.h>
 #include "common.h"
 #include "thrust.h"
 
@@ -18,11 +19,21 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            thrust::device_vector<int> dv_in(idata, idata + n);
+            thrust::device_vector<int> dv_out(n);
             timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
-            // example: for device_vectors dv_in and dv_out:
-            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
             timer().endGpuTimer();
+            thrust::copy(dv_out.begin(), dv_out.end(), odata);
+        }
+
+        void sort(int n, int *odata, const int *idata) {
+            thrust::device_vector<int> dv_in(idata, idata + n);
+            thrust::device_vector<int> dv_out(n);
+            timer().startGpuTimer();
+            thrust::sort(dv_in.begin(), dv_in.end());
+            timer().endGpuTimer();
+            thrust::copy(dv_in.begin(), dv_in.end(), odata);
         }
     }
 }
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index fe98206..e12c10f 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -7,5 +7,7 @@ namespace StreamCompaction {
         StreamCompaction::Common::PerformanceTimer& timer();
 
         void scan(int n, int *odata, const int *idata);
+
+        void sort(int n, int *odata, const int *idata);
     }
 }