diff --git a/README.md b/README.md
index 0e38ddb..b8d2b0b 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,133 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Shreyas Singh
+  * [LinkedIn](https://linkedin.com/in/shreyassinghiitr), [Personal Website](https://github.com/shreyas3156).
+* Tested on: Windows 10, i7-12700 @ 2.1GHz 32GB, T1000 (CETS Lab)
 
-### (TODO: Your README)
+### About
+This project features an implementation of all-prefix-sums operation on an array of data, often known as
+_Scan_, followed by _Stream Compaction_, which refers to creating a new array with elements from the input that are filtered using a given criteria, preserving the order of elements in the process. We use _scan_ interchangeably with _exclusive scan_ throughout the project, where each element _k_ in the result array is the sum
+of all elements up to but excluding _k_ itself in the input array.
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+We would see three different implementations of Scan, the first being a trivial sequential CPU-Scan and the other two being
+"Naive" and "Work-Efficient" CUDA implementations of a parallel scan algorithm that leverage GPU's data parallelism, thus giving huge performance improvement
+for large inputs.
+
+Our Stream Compaction algorithm would remove `0`s from an array of `int`s, using CPU and Work-Efficient parallel scans using CUDA.
+For a detailed version of Scan and further reading, check out [GPU Gems 3, Ch-39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).
+
+
+## Performance Analysis
+
+The algorithms have been timed with respect to various parameters like blockSize and input length using the custom `timer()` API. We have only timed the algorithms and not the memory allocations and other computations 
+since we assume that compute on GPU is for free. Time units are in 'milliseconds(ms)'.
+
+### Optimal Block size 
+The `blockSize` parameter was roughly optimized for both parallel algorithms for an array length of 2^20. I chose to take an average of the time for power-of-two (POT) inputs and non-power-of-two (NPOT) inputs. 
+The optimal block size was found to be 512 for Naive parallel scan and 128 for Work-Efficient scan. 
+![](img/blocksizeopt.png)
+
+
+### Scan implementations vs Array Size
+We compare the various implementations of Scan (Naive, Work-Efficient and Thrust) with respect to the CPU Scan algorithm.
+Thrust is a C++ template library for CUDA based on the Standard Template Library (STL) and it allows you 
+to implement high performance parallel applications. 
+
+Both GPU and CPU timing functions were wrapped up as a performance timer class. We have used 
+`std::chrono` to provide CPU high-precision timing and CUDA event to measure the CUDA performance.
+
+The power-of-two inputs and non-power-of-two inputs have been analyzed separately for better
+comparability of the CPU, GPU and Thrust implementations.
+
+
+![](img/scan_pot.png)
+![](img/perfpot.png)
+
+Analysis for NPOT arrays follows:
+![](img/scan_npot.png)
+![](img/perfnpot.png)
+
+Clearly, the CPU implementation is faster for shorter input lengths until around an array size
+of 2^13. Following this, the CPU Scan is slightly slower than Naive Parallel Scan but still faster than Work-Efficient Parallel Scan
+until an array size of 2^20. However, the Thrust scan always has the
+fastest performance among all the GPU algorithms and the CPU scan beyond 2^13.
+
+The *Work-Efficient scan has been optimized for thread usage* such that it only launches the
+required number of threads and no unused thread have to wait for other threads to terminate. This
+is why it is faster than all but the Thrust Scan algorithm at arrays of large lengths.
+
+The major *performance bottleneck* here is the global memory I/O for all parallel algorithms. This can be improved upon
+by utilising the shared memory as the size of shared memory is dynamic and is related to the block size. This overhead is overcome
+at arrays of larger sizes when the Work-Efficient and Thrust Scan run faster than the CPU Scan.
+
+The *Thrust* library's performance through `exclusive_scan()` shows remarkable consistency as the input size increases upto 
+2^17, following which there is a marginal increase in performance time. The possible explanation is optimal memory management
+by Thrust, avoiding possible global memory I/O overheads. The slower performance at larger array sizes could be because of
+memory allocation and copying latency between the host and the device.
+
+
+### Stream Compaction Performance
+
+Here, we show a comparison of the stream compaction algorithms implemented through CPU (With and Without Scan) and through the Work-Efficient Algorithm.
+![](img/compact_perf.png).
+
+As one would expect from the discussion above, the CPU algorithms take much lesser time than their GPU
+counterparts for smaller array sizes (<2^20). This trend appears to flip at array sizes > 2^20 as the optimized
+thread usage for the efficient-algorithm produces improved performance.
+```
+****************
+** SCAN TESTS **
+****************
+    [  10  32  23  43  28  33  17   0  35  37  44   0  23 ...  33   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.7358ms    (std::chrono Measured)
+    [   0  10  42  65 108 136 169 186 186 221 258 302 302 ... 25672547 25672580 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.7335ms    (std::chrono Measured)
+    [   0  10  42  65 108 136 169 186 186 221 258 302 302 ... 25672508 25672513 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.57238ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.51142ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.42707ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.36602ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.596ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.176128ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   1   1   2   2   0   1   3   1   1   0   0   0 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 2.146ms    (std::chrono Measured)
+    [   3   1   1   2   2   1   3   1   1   3   3   2   1 ...   3   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.171ms    (std::chrono Measured)
+    [   3   1   1   2   2   1   3   1   1   3   3   2   1 ...   1   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 5.0626ms    (std::chrono Measured)
+    [   3   1   1   2   2   1   3   1   1   3   3   2   1 ...   3   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.25693ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.92234ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+```
 
diff --git a/img/blocksizeopt.png b/img/blocksizeopt.png
new file mode 100644
index 0000000..e93953e
Binary files /dev/null and b/img/blocksizeopt.png differ
diff --git a/img/compact_perf.png b/img/compact_perf.png
new file mode 100644
index 0000000..2795578
Binary files /dev/null and b/img/compact_perf.png differ
diff --git a/img/perfnpot.png b/img/perfnpot.png
new file mode 100644
index 0000000..e99835a
Binary files /dev/null and b/img/perfnpot.png differ
diff --git a/img/perfpot.png b/img/perfpot.png
new file mode 100644
index 0000000..574e4a7
Binary files /dev/null and b/img/perfpot.png differ
diff --git a/img/scan_npot.png b/img/scan_npot.png
new file mode 100644
index 0000000..4ad67b8
Binary files /dev/null and b/img/scan_npot.png differ
diff --git a/img/scan_pot.png b/img/scan_pot.png
new file mode 100644
index 0000000..f725648
Binary files /dev/null and b/img/scan_pot.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..600c29f 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..de8e11a 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -1,6 +1,6 @@
 #include "common.h"
 
-void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+void checkCUDAErrorFn(const char* msg, const char* file, int line) {
     cudaError_t err = cudaGetLastError();
     if (cudaSuccess == err) {
         return;
@@ -22,17 +22,29 @@ namespace StreamCompaction {
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+        __global__ void kernMapToBoolean(int n, int* bools, const int* idata) {
             // TODO
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+            if (idx >= n) {
+                return;
+            }
+            bools[idx] = (idata[idx]) ? 1 : 0;
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
+        __global__ void kernScatter(int n, int* odata,
+            const int* idata, const int* bools, const int* indices) {
             // TODO
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+            if (idx >= n) {
+                return;
+            }
+            if (bools[idx]) {
+                odata[indices[idx]] = idata[idx];
+            }
         }
 
     }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..36ae628 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -19,7 +19,9 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +32,15 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int j = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i]) {
+                    odata[j] = idata[i];
+                    j++; 
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return (j) ? j : -1;
         }
 
         /**
@@ -42,9 +50,29 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            // map the input array to an array of 0s and 1s
+            int* temp = new int[n];
+            for (int i = 0; i < n; i++) {
+                temp[i] = idata[i] == 0 ? 0 : 1;
+            }
+
+            int* scanOutput = new int[n];
+            scanOutput[0] = 0;
+
+            // scan the temp array
+            for (int i = 1; i < n; i++) {
+                scanOutput[i] = scanOutput[i - 1] + temp[i - 1];
+            }
+
+            int compactLen = scanOutput[n - 1];
+
+            for (int i = 0; i < n; i++) {
+                if (temp[i]) {
+                    odata[scanOutput[i]] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return (compactLen) ? compactLen : -1;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..80f5502 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -12,13 +12,84 @@ namespace StreamCompaction {
             return timer;
         }
 
+        __global__ void kernUpSweep(int n, int* A, int offset) {
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+            // Dividing the input into groups each of offset size
+            if (idx >= n / offset) {
+                return;
+            }
+            idx *= offset;
+            A[idx + offset - 1] += A[idx + offset / 2 - 1];
+        }
+
+        __global__ void kernDownSweep(int n, int* A, int offset) {
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+            if (idx >= n/offset) {
+                return;
+            }
+            idx *= offset;
+
+            int temp = A[idx + offset / 2 - 1];
+            A[idx + offset / 2 - 1] = A[idx + offset - 1];
+            A[idx + offset - 1] += temp;
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+        void scan(int n, int* odata, const int* idata, bool timeFlag) {
+
+            unsigned int blockSize = 128;
+
+            int padding = 1 << ilog2ceil(n);
+
+            int* dev_odata;
+            size_t arraySize = n * sizeof(int);
+            size_t paddedSize = padding * sizeof(int);
+            cudaMalloc((void**)&dev_odata, paddedSize);
+            checkCUDAError("cudaMalloc dev_odata failed!");
+
+            cudaMemcpy(dev_odata, idata, arraySize, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata to dev_odata failed!");
+
+            cudaMemset(dev_odata + n, 0, (paddedSize - arraySize));
+            checkCUDAError("cudaMemcpy padding dev_odata failed!");
+
+            int numThreads = padding;
+
+            if (timeFlag)
+                timer().startGpuTimer();
+            for (int i = 0; i < ilog2ceil(n); i++) {
+                int offset = 1 << (i + 1);
+                numThreads /= 2;
+                
+                dim3 fullBlocksPerGrid = ((numThreads + blockSize - 1) / blockSize);
+                kernUpSweep << <fullBlocksPerGrid, blockSize >> > (padding, dev_odata, offset);
+                cudaDeviceSynchronize();
+                checkCUDAError("kernUpSweep failed!");
+            }
+
+            // assign 0 to the root of the tree for Down-Sweep
+            cudaMemset(dev_odata + padding - 1, 0, sizeof(int));
+            cudaDeviceSynchronize();
+            checkCUDAError("cudaMemset to dev_odata failed!");
+
+            for (int i = ilog2ceil(n) - 1; i >= 0; i--) {
+                int offset = 1 << (i + 1);
+                numThreads *= 2;
+                dim3 fullBlocksPerGrid = ((numThreads + blockSize - 1) / blockSize);
+                kernDownSweep << <fullBlocksPerGrid, blockSize >> > (padding, dev_odata, offset);
+                checkCUDAError("kernDownSweep failed!");
+            }
+            if (timeFlag)
+                timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_odata, arraySize, cudaMemcpyDeviceToHost);
+            cudaDeviceSynchronize();
+            checkCUDAError("cudaMemcpy dev_odata to odata failed!");
+
+            cudaFree(dev_odata);
         }
 
         /**
@@ -30,11 +101,68 @@ namespace StreamCompaction {
          * @param idata  The array of elements to compact.
          * @returns      The number of elements remaining after compaction.
          */
-        int compact(int n, int *odata, const int *idata) {
+        int compact(int n, int* odata, const int* idata) {
+
+            // Create device arrays
+            int* dev_idata;
+            int* dev_odata;
+            int* dev_bools;
+            int* dev_indices;
+            int padding = 1 << ilog2ceil(n);
+            size_t arraySize = n * sizeof(int);
+            size_t paddedSize = padding * sizeof(int);
+
+            cudaMalloc((void**)&dev_idata, arraySize);
+            cudaDeviceSynchronize();
+            checkCUDAError("cudaMalloc1 failed!");
+
+            cudaMalloc((void**)&dev_bools, paddedSize);
+            cudaDeviceSynchronize();
+            checkCUDAError("cudaMalloc2 failed!");
+
+            cudaMalloc((void**)&dev_indices, paddedSize);
+            cudaMalloc((void**)&dev_odata, arraySize);
+            cudaDeviceSynchronize();
+            checkCUDAError("cudaMalloc failed!");
+
+            cudaMemcpy(dev_idata, idata, arraySize, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata failed!");
+
+            cudaMemset(dev_bools + n, 0, (paddedSize - arraySize));
+            checkCUDAError("cudaMemset dev_bools failed!");
+
+            unsigned int blockSize = 128;
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
             timer().startGpuTimer();
-            // TODO
+
+            StreamCompaction::Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, dev_bools, dev_idata);
+
+            scan(n, dev_indices, dev_bools, 0);
+            StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, dev_odata, dev_idata, dev_bools, dev_indices);
+            cudaDeviceSynchronize();
+            checkCUDAError("kernel calls failed!");
+
             timer().endGpuTimer();
-            return -1;
+            cudaMemcpy(odata, dev_odata, arraySize, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy dev_odata to odata failed!");
+
+            // check if last element of idata is valid, by checking dev_bools. 
+            // If yes, then its index is compactLen - 1. If not, its index is compactLen.
+            int isLastElemValid;
+            cudaMemcpy(&isLastElemValid, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+
+            int lastElemIdx;
+            cudaMemcpy(&lastElemIdx, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+
+            int compactLen = (isLastElemValid) ? lastElemIdx + 1 : lastElemIdx;
+
+            cudaFree(dev_bools);
+            cudaFree(dev_indices);
+            cudaFree(dev_idata);
+            cudaFree(dev_odata);
+
+            return (compactLen) ? compactLen : -1;
         }
     }
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..4341042 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -6,7 +6,7 @@ namespace StreamCompaction {
     namespace Efficient {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+        void scan(int n, int* odata, const int* idata, bool timeFlag = 1);
 
         int compact(int n, int *odata, const int *idata);
     }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..6163109 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -1,6 +1,6 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
-#include "common.h"
+#include "common.h" 
 #include "naive.h"
 
 namespace StreamCompaction {
@@ -11,15 +11,61 @@ namespace StreamCompaction {
             static PerformanceTimer timer;
             return timer;
         }
-        // TODO: __global__
+
+        __global__ void kernNaiveScan(int n, int *B, int *A, int d) {
+            int idx = blockDim.x * blockIdx.x + threadIdx.x;
+
+            if (idx >= n) {
+                return;
+            }
+            int offset = 1 << (d - 1);
+
+            if (idx >= offset) {
+                B[idx] = A[idx - offset] + A[idx];
+            }
+            else {
+                B[idx] = A[idx];
+            }
+        }
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int *odata, const int *idata) { 
+
+            unsigned int blockSize = 128;
+            dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+            
+            // define two device arrays
+            int* A;
+            int* B;
+
+            size_t arraySize = n * sizeof(int);
+            cudaMalloc((void**)&A, arraySize);
+            checkCUDAError("cudaMalloc A failed!");
+            cudaMalloc((void**)&B, arraySize);
+            checkCUDAError("cudaMalloc B failed!");
+
+            cudaMemcpy(A, idata, arraySize, cudaMemcpyHostToDevice);
+            checkCUDAError("cudaMemcpy idata to A failed!");
+
             timer().startGpuTimer();
-            // TODO
+
+            for (int i = 1; i <= ilog2ceil(n); i++) {
+                kernNaiveScan << <fullBlocksPerGrid, blockSize >> > (n, B, A, i);
+
+                // swap the two arrays
+                std::swap(B, A);
+            }
+
             timer().endGpuTimer();
+
+            odata[0] = 0;
+            cudaMemcpy(odata + 1, A, arraySize, cudaMemcpyDeviceToHost);
+            checkCUDAError("cudaMemcpy odata failed!");
+
+            cudaFree(A);
+            cudaFree(B);
         }
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..58f5d8c 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,22 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            
+            thrust::host_vector<int> h_odata(odata, odata + n);
+            thrust::host_vector<int> h_idata(idata, idata + n);
+
+            thrust::device_vector<int> dv_in = h_idata;
+            thrust::device_vector<int> dv_out = h_odata;
+
+            cudaDeviceSynchronize();
             timer().startGpuTimer();
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
             timer().endGpuTimer();
+
+            thrust::copy(dv_out.begin(), dv_out.end(), odata);
         }
     }
 }