diff --git a/README.md b/README.md
index 0e38ddb..ee30f79 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,97 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yian Chen
+  * [LinkedIn](https://www.linkedin.com/in/yian-chen-33a31a1a8/), [personal website](https://sydianandrewchen.github.io/) etc.
+* Tested on: Windows 10, AMD Ryzen 5800 HS with Radeon Graphics CPU @ 3.20GHz 16GB, NVIDIA GeForce RTX3060 Laptop 8GB
 
-### (TODO: Your README)
+### Questions
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
+* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
+  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
+  (with array size on the independent axis).
+  ![](img/scan-power-of-two.png)
+  ![](img/scan-non-power-of-two.png)
+  * To guess at what might be happening inside the Thrust implementation (e.g.
+    allocation, memory copy), take a look at the Nsight timeline for its
+    execution. Your analysis here doesn't have to be detailed, since you aren't
+    even looking at the code for the implementation.
+  - Timeline of `thrust::exclusive_scan`:
+    ![](img/image.png)
+    From the graph we can see that there are three scans launched at the same time and ended at the same time. 
+    I guess probably that `thrust::exclusive_scan` implemented an algorithm that uses Up Sweep, Down Sweep and Shifting at the same time.
+
+* Write a brief explanation of the phenomena you see here.
+  - As the size of data grow exponentially, the time of CPU scanning algorithm and GPU naive scan will increase most rapidly. 
+    - The CPU scanning algorithm can beaet GPU naive scan algorithm when data size is large. I guess that might be caused by the full exploitation of locality in the CPU scanning algorithm. On the contrary, GPU naive scan will usually cause a lot of global memory acceesses.  
+  - The time of GPU efficient scan and `thrust` scan increase much slower. `thrust` performs the best.  
+* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
+  The performance bottlenecks should exist within the design of each implementation.
+  - For CPU method, the bottlenecks appear because of the lack of parallellism, comparing with GPU methods.
+  - For Naive Scan, the bottlenecks appear because of:
+    - Double buffering. On the positive side, this trick can save us a large amount of time copying memory. However, two large double-bufferr will cause a stable global memory access within each thread. 
+    - Time complexity.
+  - For Efficient Scan:
+    - Redundant thread launching(settled).
+    - Shared memory.
+  
+
+* Paste the output of the test program into a triple-backtick block in your
+  README.
+  - Output of the scan (`SIZE = 1 << 26`)
+```
+****************
+** SCAN TESTS **
+****************
+    [   6  12  22   4  21   5  23  32  32  36  46  37  34 ...  47   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 48.9518ms    (std::chrono Measured)
+    [   0   6  18  40  44  65  70  93 125 157 193 239 276 ... 1643792958 1643793005 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 45.263ms    (std::chrono Measured)
+    [   0   6  18  40  44  65  70  93 125 157 193 239 276 ... 1643792920 1643792939 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 63.2586ms    (CUDA Measured)
+    [   0   6  18  40  44  65  70  93 125 157 193 239 276 ... 1643792958 1643793005 ]
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 62.9893ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 24.2463ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 22.6304ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 2.46502ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 2.53338ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   0   1   1   3   0   1   0   1   0   0   3   1 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 120.001ms    (std::chrono Measured)
+    [   1   1   1   3   1   1   3   1   3   3   2   3   3 ...   3   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 123.921ms    (std::chrono Measured)
+    [   1   1   1   3   1   1   3   1   3   3   2   3   3 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 219.142ms    (std::chrono Measured)
+    [   1   1   1   3   1   1   3   1   3   3   2   3   3 ...   3   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 22.6714ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 22.6632ms    (CUDA Measured)
+    passed
+```
\ No newline at end of file
diff --git a/img/image.png b/img/image.png
new file mode 100644
index 0000000..14b4dcd
Binary files /dev/null and b/img/image.png differ
diff --git a/img/scan-non-power-of-two.png b/img/scan-non-power-of-two.png
new file mode 100644
index 0000000..9c4f042
Binary files /dev/null and b/img/scan-non-power-of-two.png differ
diff --git a/img/scan-power-of-two.png b/img/scan-power-of-two.png
new file mode 100644
index 0000000..730551a
Binary files /dev/null and b/img/scan-power-of-two.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..3203388 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,13 +13,13 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 26; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
-int main(int argc, char* argv[]) {
+void scanTest() {
     // Scan tests
 
     printf("\n");
@@ -51,7 +51,7 @@ int main(int argc, char* argv[]) {
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
@@ -94,6 +94,9 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
+}
+
+void compactionTest() {
 
     printf("\n");
     printf("*****************************\n");
@@ -146,8 +149,16 @@ int main(int argc, char* argv[]) {
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
+}
 
+void test() {
+    scanTest();
+    compactionTest();
     system("pause"); // stop Win32 console from closing on exit
+}
+
+int main(int argc, char* argv[]) {
+    test();
     delete[] a;
     delete[] b;
     delete[] c;
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..1bf6708 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -24,6 +24,10 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            if (index < n) {
+                bools[index] = bool(idata[index]);
+            }
         }
 
         /**
@@ -33,6 +37,12 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            if (index < n) {
+                if (bools[index]) {
+                    odata[indices[index]] = idata[index];
+                }
+            }
         }
 
     }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed..85b7665 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -12,7 +12,7 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
-
+#define blockSize 512
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..f6f6cb4 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -12,6 +12,17 @@ namespace StreamCompaction {
             return timer;
         }
 
+        void scanCore(int n, int* odata, const int* idata) {
+            //for (int i = 0; i < n; ++i) {
+            //    for (int j = 0; j < i /* exclusive prefix sum */; ++j) {
+            //        odata[i] += idata[j];
+            //    }
+            //}
+            for (int i = 1; i < n; ++i) {
+                odata[i] = odata[i - 1] + idata[i-1];
+            }
+        }
+
         /**
          * CPU scan (prefix sum).
          * For performance analysis, this is supposed to be a simple for loop.
@@ -20,6 +31,7 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            scanCore(n, odata, idata);
             timer().endCpuTimer();
         }
 
@@ -31,8 +43,12 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int oPtr = 0;
+            for (int i = 0; i < n; ++i) {
+                if (idata[i]) odata[oPtr++] = idata[i];
+            }
             timer().endCpuTimer();
-            return -1;
+            return oPtr;
         }
 
         /**
@@ -41,10 +57,21 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* odata_tmp = new int[n];
             timer().startCpuTimer();
-            // TODO
+
+            for (int i = 0; i < n; ++i) {
+                odata_tmp[i] = !(!idata[i]);
+                if (i) odata_tmp[i] += odata_tmp[i - 1];
+                if (idata[i]) {
+                    odata[odata_tmp[i] - 1] = idata[i];
+                }
+            }
+            int oSize = odata_tmp[n-1];
             timer().endCpuTimer();
-            return -1;
+            delete [] odata_tmp;
+
+            return oSize;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..20dc735 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -15,10 +15,75 @@ namespace StreamCompaction {
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
+
+        void scanCore(int n, int* dev_odata) {
+
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+            // Reduce
+            int offset = 1;
+            for (int d = 0; d < ilog2(n); d++) {
+                int operation_number = n / (offset * 2);
+                dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize);
+                //printf("%d\n", blocksPerGrid.x);
+                //printf("Cut off unneccessary threads\n");
+                if (blocksPerGrid.x == 1) {
+                    kernUpSweep << <1, operation_number >> > (n, offset, dev_odata);
+                }
+                else 
+                    kernUpSweep << <blocksPerGrid, blockSize >> > (n, offset, dev_odata);
+                //kernUpSweep << <fullBlocksPerGrid, blockSize >> > (n, offset, dev_odata); // 0.31504 for power of two
+                checkCUDAError("kernUpSweep failed");
+                offset <<= 1;
+            }
+
+            // Down sweep
+            for (int d = ilog2(n) - 1; d >= 0; d--) {
+                offset = (1 << d);
+                int operation_number = n / (offset * 2);
+                dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize);
+                //printf("%d\n", blocksPerGrid.x);
+                //printf("Cut off unneccessary threads\n");
+                if (blocksPerGrid.x == 1) {
+                    kernDownSweep << <1, operation_number >> > (n, offset, dev_odata);
+                }
+                else
+                    kernDownSweep << <blocksPerGrid, blockSize >> > (n, offset, dev_odata);
+                //kernDownSweep << <fullBlocksPerGrid, blockSize >> > (n, offset, dev_odata);
+                checkCUDAError("kernDownSweep failed");
+            }
+
+        }
         void scan(int n, int *odata, const int *idata) {
+            int padded_n = (1 << ilog2ceil(n));
+            int* dev_odata;
+            cudaMalloc(&dev_odata, padded_n * sizeof(int));
+            cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
             timer().startGpuTimer();
-            // TODO
+            scanCore(padded_n, dev_odata);
             timer().endGpuTimer();
+            cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            cudaFree(dev_odata);
+        }
+
+        __global__ void kernUpSweep(int n,int offset, int* odata1) {
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            //printf("%d, %d, %d, %d\n",n,  index, offset, index*offset*2);
+            int arrIndex = index * (offset * 2);
+            if (arrIndex < n) {
+                odata1[arrIndex + offset * 2 - 1] += odata1[arrIndex + offset - 1];
+                odata1[n-1] = 0;
+            }
+        }
+
+        __global__ void kernDownSweep(int n, int offset, int* odata1) {
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            //printf("%d, %d, %d, %d\n",n,  index, offset, index*offset*2);
+            int arrIndex = index * (offset * 2);
+            if (arrIndex < n) {
+                int t = odata1[arrIndex + offset - 1];
+                odata1[arrIndex + offset - 1] = odata1[arrIndex + offset * 2 - 1];
+                odata1[arrIndex + offset * 2 - 1] += t;
+            }
         }
 
         /**
@@ -31,10 +96,59 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
             // TODO
+            /*
+                bools, indices should only be allocated on device
+                odata and idata needs to be copied to device
+            */
+            int padded_n = (1 << ilog2ceil(n));
+
+            int* dev_bools;
+            /* TODO: Check if remaining part is also zero OR DOESN'T MATTER? */
+            cudaMalloc(&dev_bools, padded_n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_bools failed");
+
+            int* dev_idata;
+            cudaMalloc(&dev_idata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed");
+
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy from idata to dev_idata failed");
+
+            int* dev_indices;
+            cudaMalloc(&dev_indices, padded_n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_indices failed");
+
+            int* dev_odata;
+            cudaMalloc(&dev_odata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata failed");
+
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+            StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, blockSize>>>(n, dev_bools, dev_idata);
+            checkCUDAError("kernMapToBoolean failed");
+            
+            cudaMemcpy(dev_indices, dev_bools, padded_n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToDevice);
+
+            timer().startGpuTimer();
+            scanCore(padded_n, dev_indices);
             timer().endGpuTimer();
-            return -1;
+            StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, blockSize>>>(n, dev_odata, dev_idata, dev_bools, dev_indices);
+            checkCUDAError("kernScatter failed");
+
+            /* Still got problem here! */
+            int length, last_element;
+            cudaMemcpy(&length, dev_indices + n - 1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            cudaMemcpy(&last_element, dev_bools + n-1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy from dev_indices[n-1] to length failed");
+            length += last_element; 
+            cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+            cudaFree(dev_indices);
+            cudaFree(dev_bools);
+            
+            return length;
         }
     }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..b3fc851 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -1,13 +1,15 @@
-#pragma once
-
-#include "common.h"
-
-namespace StreamCompaction {
-    namespace Efficient {
-        StreamCompaction::Common::PerformanceTimer& timer();
-
-        void scan(int n, int *odata, const int *idata);
-
-        int compact(int n, int *odata, const int *idata);
-    }
-}
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Efficient {
+        StreamCompaction::Common::PerformanceTimer& timer();
+        void scanCore(int n, int* odata, const int* idata);
+        void scan(int n, int *odata, const int *idata);
+
+        int compact(int n, int *odata, const int *idata);
+        __global__ void kernUpSweep(int n,int offset, int* odata1);
+        __global__ void kernDownSweep(int n, int offset, int* odata1);
+    }
+}
\ No newline at end of file
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..88779d1 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -17,9 +17,53 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
             // TODO
+            int* dev_odata1;
+            cudaMalloc(&dev_odata1, n*sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata1 failed");
+            cudaMemcpy(dev_odata1, idata, n*sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy from idata to dev_odata1 failed");
+            int *dev_odata2;
+            cudaMalloc(&dev_odata2, n*sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata2 failed");
+            cudaMemcpy(dev_odata2, idata, n*sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy from idata to dev_odata2 failed");
+
+            timer().startGpuTimer();
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize + 1);
+            int d = ilog2ceil(n);
+            int offset = 1;
+            for (int k = 1; k <= d; ++k) {
+                //offset <<= 1;
+                kernScan << <fullBlocksPerGrid, blockSize >> > (n, (1 << (k-1)), dev_odata1, dev_odata2);
+                std::swap(dev_odata1, dev_odata2);
+            }
+            checkCUDAError("kernScan failed");
+
             timer().endGpuTimer();
+            cudaMemcpy(odata, dev_odata1, n*sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            for (int i = n - 1; i >= 0; --i) {
+                if (i == 0) odata[i] = 0;
+                else odata[i] = odata[i - 1];
+            }
+
+            checkCUDAError("cudaMemcpy from dev_odata1 to odata failed");
+
+            cudaFree(dev_odata1);
+            cudaFree(dev_odata2);
+
+        }
+
+        __global__ void kernScan(int n, int offset , int *odata1, int * odata2 ) {
+            int index = (blockDim.x * blockIdx.x) + threadIdx.x;
+            if (index >= n) return;
+            if (index >= offset) { /* no need to check if (index < n)*/
+                //odata2[index + offset] = odata1[index] + odata1[index + offset];
+                odata2[index] = odata1[index] + odata1[index - offset];
+            }
+            else {
+                odata2[index] = odata1[index];
+            }
         }
     }
 }
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb06..de8c6d7 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -7,5 +7,7 @@ namespace StreamCompaction {
         StreamCompaction::Common::PerformanceTimer& timer();
 
         void scan(int n, int *odata, const int *idata);
+        __global__ void kernScan(int n, int offset, int* odata1, int* odata2);
+
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..79c94a3 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,30 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            int* dev_idata;
+            int* dev_odata;
+            
+            cudaMalloc(&dev_idata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_idata failed!");
+            cudaMalloc(&dev_odata, n * sizeof(int));
+            checkCUDAError("cudaMalloc dev_odata failed!");
+
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice);
+
+            thrust::device_ptr<int> dev_in_ptr(dev_idata);
+            thrust::device_ptr<int> dev_out_ptr(dev_odata);
+            thrust::device_vector<int> dv_in(dev_in_ptr, dev_in_ptr +n);
+            thrust::device_vector<int> dv_out(dev_out_ptr, dev_out_ptr +n);
+            
             timer().startGpuTimer();
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
             timer().endGpuTimer();
+            cudaMemcpy(odata, dv_out.data().get(), n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost);
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
         }
     }
 }
diff --git a/tools/data_processing.py b/tools/data_processing.py
new file mode 100644
index 0000000..e3b25bd
--- /dev/null
+++ b/tools/data_processing.py
@@ -0,0 +1,90 @@
+import matplotlib.pyplot as plt
+
+N = []
+for i in range(18, 28, 2):
+    xlabel = "$2^{"
+    xlabel += str(i)
+    xlabel += '}$'
+    N.append(xlabel)
+# N = [r"$2^{}$".format(i) for i in range(18, 28, 2)]
+
+# Drawing increasing boids, with visuals
+cpu = [0.1506, 0.5339, 2.3771, 13.9042, 56.4294]
+gpu_naive = [0.106496,  0.754688,  3.23482, 14.9873, 63.2361]
+gpu_efficient = [0.157696, 0.443392, 1.40288, 5.29101, 20.8722]
+gpu_thrust = [0.160768, 0.18832, 0.319488, 0.975808, 2.20467]
+plt.xlabel("N count")
+plt.ylabel("Time (ms)")
+plt.plot(N, cpu, label="CPU", color="r", marker="o")
+plt.plot(N, gpu_naive, label="Naive GPU", color="g", marker="o")
+plt.plot(N, gpu_efficient, label="Efficient GPU", color="b", marker="o")
+plt.plot(N, gpu_thrust, label="Thrust", color="y", marker="o")
+plt.legend(labels=["CPU", "Naive GPU", "Efficient GPU", "Thrust"])
+plt.title("Time of scan using different method(power-of-two)")
+plt.show()
+plt.savefig("../img/scan-power-of-two.png")
+plt.clf()
+
+cpu = [0.1685, 0.6555, 3.2271, 17.4222, 44.574]
+gpu_naive = [0.1024, 0.765952, 3.24301, 15.574, 74.4376]
+gpu_efficient = [0.166912, 0.477184, 1.41312, 6.2505, 20.8456]
+gpu_thrust = [0.166944, 0.191584, 0.401504, 0.777952, 2.18931]
+plt.xlabel("N count")
+plt.ylabel("Time (ms)")
+plt.plot(N, cpu, label="CPU", color="r", marker="o")
+plt.plot(N, gpu_naive, label="Naive GPU", color="g", marker="o")
+plt.plot(N, gpu_efficient, label="Efficient GPU", color="b", marker="o")
+plt.plot(N, gpu_thrust, label="Thrust", color="y", marker="o")
+plt.legend(labels=["CPU", "Naive GPU", "Efficient GPU", "Thrust"])
+plt.title("Time of scan using different method(non-power-of-two)")
+plt.show()
+plt.savefig("../img/scan-non-power-of-two.png")
+plt.clf()
+
+exit()
+
+# Drawing increasing boids, no visuals
+fps_brutalForce = [8297.48, 1227.13, 42.882, 0.46393, 0.0]
+fps_scatteredGrid = [7666.14, 7279.71, 4348.93, 380.379, 18.6696]
+fps_coherentGrid = [7383.36, 7034.57, 4743.67, 683.372, 57.848]
+
+plt.xlabel("N count")
+plt.ylabel("FPS")
+plt.plot(N, fps_brutalForce, label="brutalForce", color="r", marker="o")
+plt.plot(N, fps_scatteredGrid, label="scatteredGrid", color="g", marker="o")
+plt.plot(N, fps_coherentGrid, label="coherentGrid", color="b", marker="o")
+plt.legend(labels=["brutalForce", "scatteredGrid", "coherentGrid"])
+plt.title("FPS of kernel under different method (No visulization)")
+# plt.show()
+plt.savefig("../images/boids_no_visual.png")
+
+plt.clf()
+
+# Drawing 27-cell vs 8-cell
+plt.xlabel("N count")
+plt.ylabel("FPS")
+fps_scatteredGrid27Neighbors = [7285.03, 6472.57, 3955.12, 353.385, 18.667]
+fps_scatteredGrid8Neighbors = [7666.14, 7279.71, 4348.93, 380.379, 18.6696]
+plt.plot(N, fps_scatteredGrid27Neighbors, label="27-cell", marker="o")
+plt.plot(N, fps_scatteredGrid8Neighbors, label="8-cell", marker="o")
+plt.legend(labels=["27-cell", "8-cell"])
+plt.title("FPS of 27-cell vs 8-cell (No visualization)")
+# plt.show()
+plt.savefig("../images/27vs8_no_visual.png")
+
+plt.clf()
+
+
+# Drawing increasing blockSize
+
+blockSize = [r"$2^{}$".format(i) for i in range(2, 11)]
+fps_coherentGridBlockSizeChanged = [ 11.7512, 18.7417,28.524,41.9919, 57.0016, 62.4268, 64.3508, 61.0085, 59.5032]
+plt.xlabel("Block size")
+plt.ylabel("FPS")
+plt.plot(blockSize, fps_coherentGridBlockSizeChanged, marker="o")
+plt.title("FPS of coherent grid over increasing grid size (With visualization)")
+# plt.show()
+plt.savefig("../images/increasing_gridSize.png")
+
+if __name__ == "__main__":
+    pass
\ No newline at end of file