diff --git a/Project2-Character-Recognition/CMakeLists.txt b/Project2-Character-Recognition/CMakeLists.txt
index 09e9198..d1a3746 100644
--- a/Project2-Character-Recognition/CMakeLists.txt
+++ b/Project2-Character-Recognition/CMakeLists.txt
@@ -22,6 +22,7 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
 endif()
 
 include_directories(.)
+link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib/x64)
 add_subdirectory(character_recognition)
 
 cuda_add_executable(${CMAKE_PROJECT_NAME}
@@ -32,4 +33,6 @@ cuda_add_executable(${CMAKE_PROJECT_NAME}
 target_link_libraries(${CMAKE_PROJECT_NAME}
     character_recognition
     ${CORELIBS}
+    cublas
+    curand
     )
diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md
index 4503fac..7cf87bb 100644
--- a/Project2-Character-Recognition/README.md
+++ b/Project2-Character-Recognition/README.md
@@ -3,12 +3,126 @@ CUDA Character Recognition
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Dewang Sultania
+  * [LinkedIn](https://www.linkedin.com/in/dewang-sultania/)
+* Tested on: Windows 10, Intel Xeon E-2176M @ 2.70GHz 16GB, Quadro P2000 4GB (Personal Computer)
+* Dependencies: cublas, opencv, curand
 
-### (TODO: Your README)
+### Table of Contents
+1. [Overview](#overview)
+2.   [Multilayer Perceptron](#mlp)
+3.   [Architecture](#architecture)
+4.   [Workflow Diagram](#workflow)
+5.   [Performance Analysis](#performance)
+6.   [Extending it to MNIST Dataset](#mnist)
+7.   [Future Work](#future)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+<a name = "overview"/>
+
+## Overview
+
+This repository contains code for a fully connected neural network in CUDA. Includes forward propagation, back propagation and kernels for activation functions. The network is trained to identify characters of the english alphabet. It was then later extended to train on the MNIST dataset as well.
+
+<a name = "mlp"/>
+
+## Multilayer Perceptron
+
+Neural Network             |  Neuron
+:-------------------------:|:-------------------------:
+![](img/neural_network.JPG) | ![](img/neuron.JPG)
+
+
+Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons. For regular neural networks, the most common layer type is the fully-connected layer in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections.
+
+A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.
+
+##### Terminology
+
+There are certain terms that are widely used when describing neural networks. I will try to give a concise definition of the commonly used terms throughout the readme here.
+
+- **Input Layer**:  The input layer is basically the inputs to the neural network, the dimensions of the input layer is same as the dimension of the input.
+
+- **Hidden Layer**: A hidden layer in an artificial neural network is a layer in between input layers and output layers, where artificial neurons take in a set of weighted inputs and produce an output through an activation function
+
+- **Output Layer**:  The output layer in an artificial neural network is the last layer of neurons that produces given outputs for the program.
+
+- **Activation Functions**:  Every activation function (or non-linearity) takes a single number and performs a certain fixed mathematical operation on it. The non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input. The non-linearity is where we get the wiggle.
+
+You can learn more about them [here](http://cs231n.github.io)
+
+<a name = "architecture"/>
+
+## Architecture:
+
+The network contains 196 dimensions in the input layer, 100 in the first hidden layer, 50 in the second, 25 in the third and finally 52 outputs for 52 classes. The activation function used in the hidden layers was ReLU activation and the final layer uses softmax. Softmax is used because it predicts the output as probability of belongingness to each class.
+
+There are certain advantages to using ReLU over its counterparts like sigmoid and tanh. Namely:
+
+- It was found to greatly accelerate (e.g. a factor of 6 in [Krizhevsky et al.](http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf)) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.
+- Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.
+
+**Weight Initialization**: In case of neural networks, it is critical to start at the right place because the loss function is not convex, hence gradient descent is not guaranteed to find the optimal solution. Since the neural network has ReLU activation function, I used He-normal initialization for the starting point of weights. More details on this can be found [here](https://medium.com/@prateekvishnu/xavier-and-he-normal-he-et-al-initialization-8e3d7a087528)
+
+The forward and the backward propagation equations are as shown in the figure.
+ ![](img/for_back.JPG)
+
+The update equations are:
+$$
+w^{[l]} = w^{[l]} - \alpha dw^{[l]}\\
+b^{[l]} = b^{[l]} - \alpha db^{[l]}\\
+$$
+
+The matrix multiplication steps here were done using cuBlas:sgemm() API, which was not an easy task to interpret when it came to transpose multiplications. Basically the parameters lda, ldb, ldc were not properly documented in the documentation and took a lot of time to figure out.
+
+Alpha here is the learning rate, Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. The lower the value, the slower we travel along the downward slope. While this might be a good idea (using a low learning rate) in terms of making sure that we do not miss any local minima, it could also mean that we’ll be taking a long time to converge — especially if we get stuck on a plateau region.
+
+
+Optimal Learning Rate             |  Learning Rate affects
+:-------------------------:|:-------------------------:
+![](img/learning_rate.png) | ![](img/lr.png)
+
+Doing this over and over again for 1000 epochs over the whole training set gets 100% accuracy.
+
+<a name = "workflow"/>
+
+## Workflow Diagram
+
+![](img/workflow.png)
+
+<a name = "performance"/>
+
+## Performance Analysis
+
+I trained the network for 100000 iterations on the given data, while randomly sampling from it with a learning rate of 0.0005. The loss curve for the network is:
+
+![](img/loss_charac.png)
+
+Accuracy:  100%
+
+<a name = "mnist"/>
+
+##  Extending the network to MNIST dataset (Extra Credit)
+
+The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples.It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. 
+
+The way I had written my code made it easy to extend to other datasets which can be read as opencv Mat images, so I extended the network to train on the MNIST dataset. I only had to add a few lines of code to the main.cpp function to read in the dataset and no changes were required in any other file. To toggle training between mnist and character recognition, just turn the mnist flag on line number 18 of main.cpp to 1/0
+
+The loss curves mnist dataset:
+
+![](img/loss_mnist.png)
+
+
+Training Accuracy: 96.48%
+Testing Accuracy: 95.65%
+
+<a name = "future"/>
+
+## Future Work
+
+Deep Learning is highly sensitive to and has lots of hyperparameters, starting from architecture, learning rate, etc. There are a lot of other things to try out as well like learning rate decay, optimization techniques like adam and momentum, regularization techniques like weight decay, lasso, batchnorm, dropout, etc. There was an overwhelming amount of choices that can be made here. The only way to know what will work and what won't is to try these out. I have played around with all those choices a lot in Pytorch and Tensorflow, but will keep them out of scope for this project.
+
+#### References
+[1 https://www.deciphertechnic.com/install-opencv-with-visual-studio/](https://www.deciphertechnic.com/install-opencv-with-visual-studio/)
+[2 http://cs231n.github.io](http://cs231n.github.io)
+[3 http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)
 
diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt
index 7446175..9e834c1 100644
--- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt
+++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt
@@ -7,5 +7,5 @@ set(SOURCE_FILES
 
 cuda_add_library(character_recognition
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/Project2-Character-Recognition/character_recognition/common.h b/Project2-Character-Recognition/character_recognition/common.h
index 6aede64..eb1dca6 100644
--- a/Project2-Character-Recognition/character_recognition/common.h
+++ b/Project2-Character-Recognition/character_recognition/common.h
@@ -89,16 +89,16 @@ namespace Common {
 
 		    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
 
-		    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+		    cudaEventElapsedTime((float*)&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
 		    gpu_timer_started = false;
 	    }
 
-	    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+	    double getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
 	    {
 		    return prev_elapsed_time_cpu_milliseconds;
 	    }
 
-	    float getGpuElapsedTimeForPreviousOperation() //noexcept
+	    double getGpuElapsedTimeForPreviousOperation() //noexcept
 	    {
 		    return prev_elapsed_time_gpu_milliseconds;
 	    }
@@ -120,7 +120,7 @@ namespace Common {
 	    bool cpu_timer_started = false;
 	    bool gpu_timer_started = false;
 
-	    float prev_elapsed_time_cpu_milliseconds = 0.f;
-	    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    double prev_elapsed_time_cpu_milliseconds = 0.f;
+	    double prev_elapsed_time_gpu_milliseconds = 0.f;
     };
 }
diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu
index 5a3ed7f..2dedb9a 100644
--- a/Project2-Character-Recognition/character_recognition/mlp.cu
+++ b/Project2-Character-Recognition/character_recognition/mlp.cu
@@ -2,7 +2,15 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "mlp.h"
+#include <cublas_v2.h>
+#include <curand.h>
+#include<memory.h>
+#include<iostream>
+#include<string>
 
+#define blockSize 128
+
+cublasHandle_t handle;
 namespace CharacterRecognition {
     using Common::PerformanceTimer;
     PerformanceTimer& timer()
@@ -10,7 +18,366 @@ namespace CharacterRecognition {
         static PerformanceTimer timer;
         return timer;
     }
-        
+
+	__global__ void kernAddVectors(int n, double* g, double* bias, double* result) {
+		int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+		if (index >= n)
+			return;
+		result[index] = g[index] + bias[index];
+	}
+
+	__global__ void kernUpdateParameters(int n, double *input, double *grad, double alpha) {
+		int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+		if (index >= n)
+			return;
+		input[index] = input[index] - alpha * grad[index];
+	}
+
+	__global__ void kernSubVectors(int n, double* y, double* yhat, double* result) {
+		int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+		if (index >= n)
+			return;
+		result[index] = yhat[index] - y[index];
+	}
+
+	__global__ void kernInitBiasVectors(int n, double* b, double value) {
+		int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+		if (index >= n)
+			return;
+		b[index] = value;
+	}
+
+	__global__ void kernUpSweep(int n, int d, double *itemp) {
+		int k = (blockIdx.x*blockDim.x) + threadIdx.x;
+		if (k > (n - 1)) {
+			return;
+		}
+		int power = 1 << (d + 1);
+		int power_2 = 1 << d;
+		if (k % power == 0 && k + power - 1 < n && k + power_2 - 1 < n)
+			itemp[k + power - 1] += itemp[k + power_2 - 1];
+	}
+
+	__global__ void kernSoftmaxActivation(int n, double *g, double *output, double exp_sum) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+
+		if (index >= n)
+			return;
+		output[index] = expf(g[index]) / exp_sum;
+	}
+
+	__global__ void kernSoftmaxDerivative(int n, double *input, double *grad, double exp_sum) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+
+		if (index >= n)
+			return;
+		grad[index] = (exp_sum*expf(input[index]) - expf(input[index]) * expf(input[index])) / (exp_sum * exp_sum);
+	}
+
+	__global__ void kernReluActivationForward(int n, double* g, double* a) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+
+		if (index >= n)
+			return; 
+		a[index] = fmaxf(g[index], 0);
+	}
+
+	__global__ void kernReluDerivative(int n, double* input, double* grad) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+
+		if (index >= n)
+			return;
+		grad[index] = (input[index] > 0) ? 1 : 0;
+	}
+
+	__global__ void kernCopyVectors(int n, double *g, double *output) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		if (index >= n)
+			return;
+		output[index] = expf(g[index]);
+	}
+
+	__global__ void kernDerivativeLoss(int n, double *y, double *y_pred, double *output) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		if (index >= n) {
+			return;
+		}
+		output[index] = -y[index] / y_pred[index] + (1 - y[index]) / (1 - y_pred[index]);
+	}
+	__global__ void kernElementWiseMultiplication(int n, double *input1, double *input2, double *output) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		if (index >= n) {
+			return;
+		}
+		output[index] = input1[index] * input2[index];
+	}
+
+	void random_init(double * A, int rows, int cols) {
+		curandGenerator_t prng;
+		curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT);
+		curandSetPseudoRandomGeneratorSeed(prng, (unsigned long long) clock());
+		curandGenerateNormalDouble(prng, A, rows * cols, 0, 2.0/rows);
+	}
+
+	//C(m,n) = A(m,k)*B(k,n)
+	void mmul(const double* A, const double* B, double* C, const int m, const int k, const int n, int a_trans_flag, int b_trans_flag
+		, int lda, int ldb, int ldc) {
+		const double alf = 1;
+		const double bet = 0;
+		const double *alpha = &alf;
+		const double *beta = &bet;
+		if(a_trans_flag == 0 && b_trans_flag == 0)
+			cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+		else if(a_trans_flag == 0 && b_trans_flag == 1)
+			cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+		else if(a_trans_flag == 1 && b_trans_flag == 0)
+			cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+	}
+
+	void printCuda(double *a1, int n, std::string name) {
+		double *print_a = new double[n];
+		std::cout << name.c_str() << std::endl;
+		std::cout << "{" << std::endl;
+		cudaMemcpy(print_a, a1, n * sizeof(double), cudaMemcpyDeviceToHost);
+		for (int i = 0; i < n; i++) {
+			std::cout << "\t" << print_a[i] << std::endl;
+		}
+		std::cout << "}" << std::endl;
+		delete[]print_a;
+	}
+	double NeuralNet::calculateLoss(double *y_pred, double *y, int classes) {
+		double loss = 0.0;
+		for (int i = 0; i < classes; i++) {
+			loss += y[i] * log2(y_pred[i]);
+		}
+		return -1.0*loss;
+		
+	}
+	NeuralNet::NeuralNet(int input_size, int classes, vector<int>layers) {
+		layer_sizes.push_back(input_size);
+
+		// Set all layer sizes
+		for (int i = 0; i < layers.size(); i++)
+			layer_sizes.push_back(layers[i]);
+		layer_sizes.push_back(classes);
+		// Temporary variables to be pushed;
+		double *z_t, *dz_t, *a_t, *da_t, *w_t, *dw_t, *b_t, *db_t, *ghat_t;
+		// Some dummy mallocs to be pushed for the 0th(input) layer
+		// We treat a0 as the input layer.
+		cudaMalloc((void**)&z_t, sizeof(double));
+		checkCUDAError("Cuda Malloc for z failed.");
+		z.push_back(z_t);
+
+		cudaMalloc((void**)&dz_t, sizeof(double));
+		checkCUDAError("Cuda Malloc for dz failed.");
+		dz.push_back(dz_t);
+
+		cudaMalloc((void**)&a_t, layer_sizes[0] * 1 * sizeof(double));
+		checkCUDAError("Cuda Malloc for a failed.");
+		a.push_back(a_t);
+
+		cudaMalloc((void**)&da_t, layer_sizes[0] * 1 * sizeof(double));
+		checkCUDAError("Cuda Malloc for da failed.");
+		da.push_back(da_t);
+
+		cudaMalloc((void**)&w_t, sizeof(double));
+		checkCUDAError("Cuda Malloc for weights failed.");
+		w.push_back(w_t);
+
+		cudaMalloc((void**)&dw_t, sizeof(double));	
+		checkCUDAError("Cuda Malloc for derivative of weights failed.");
+		dw.push_back(dw_t);
+	
+		cudaMalloc((void**)&b_t, sizeof(double));
+		checkCUDAError("Cuda Malloc for bias failed.");
+		b.push_back(b_t);
+
+		cudaMalloc((void**)&db_t, sizeof(double));
+		checkCUDAError("Cuda Malloc for derivatives of bias failed.");
+		db.push_back(db_t);
+
+		cudaMalloc((void**)&ghat_t, sizeof(double));
+		checkCUDAError("Cuda Malloc for derivatives of bias failed.");
+		ghat.push_back(ghat_t);
+
+
+		// The following loop allocates sizes to all the weights, bias, a and z vectors and their gradients.
+		for (int i = 1; i < layer_sizes.size(); i++) {
+
+			cudaMalloc((void**)&z_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for z failed.");
+			z.push_back(z_t);
+
+			cudaMalloc((void**)&dz_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for dz failed.");
+			dz.push_back(dz_t);
+
+			cudaMalloc((void**)&a_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for a failed.");
+			a.push_back(a_t);
+
+			cudaMalloc((void**)&da_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for da failed.");
+			da.push_back(da_t);
+
+			cudaMalloc((void**)&w_t, layer_sizes[i] * layer_sizes[i - 1] * sizeof(double));
+			checkCUDAError("Cuda Malloc for weights failed.");
+			w.push_back(w_t);
+
+			cudaMalloc((void**)&dw_t, layer_sizes[i] * layer_sizes[i - 1] * sizeof(double));
+			checkCUDAError("Cuda Malloc for derivative of weights failed.");
+			dw.push_back(dw_t);
+
+			cudaMalloc((void**)&b_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for bias failed.");
+			b.push_back(b_t);
+
+			cudaMalloc((void**)&db_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for derivatives of bias failed.");
+			db.push_back(db_t);
+
+			cudaMalloc((void**)&ghat_t, layer_sizes[i] * 1 * sizeof(double));
+			checkCUDAError("Cuda Malloc for derivatives of activations failed");
+			ghat.push_back(ghat_t);
+			
+		}
+
+		dim3 fullBlocksPerGrid;
+		// The following for loop initializes weights according to normal distribution 
+		// We are using he-normal initialization here because of ReLU activation function
+		for (int i = 1; i < layer_sizes.size(); i++) {
+			random_init(w[i], layer_sizes[i], layer_sizes[i - 1]);
+		}
+		// The following loop initializes the bias to a small value.
+		// It invokes a kernel which fills the bias vector with the desired value
+		for (int i = 1; i < layer_sizes.size(); i++) {
+			fullBlocksPerGrid = ((layer_sizes[i] + blockSize - 1) / blockSize);
+			kernInitBiasVectors << <fullBlocksPerGrid, blockSize >> > (layer_sizes[i], b[i], 0.001);
+		}
+		// Create a cublas handle for matrix multiplication
+		cublasCreate(&handle);
+	}
+
+	
+	double* NeuralNet::forward(double *input) {
+
+		// a^[0] will be the input
+		cudaMemcpy(a[0], input, layer_sizes[0] * sizeof(double), cudaMemcpyHostToDevice);
+		// The activation for every layer but the last is relu, so the steps will be the same.
+		// The equations here are
+		//z[l] = w[l]a[l-1] + b[l]
+		// a[l] = relu(z[l])
+		dim3 fullBlocksPerGrid;
+		int L = layer_sizes.size() - 1;
+		for (int i = 1; i < L; i++) {
+			// Do the matrix multiplication to find w[l]a[l-1] and store in z[l]
+			mmul(w[i], a[i - 1], z[i], layer_sizes[i], layer_sizes[i - 1], 1, 0,0,layer_sizes[i], layer_sizes[i-1],layer_sizes[i]);
+			// Add the bias vector to it
+			fullBlocksPerGrid = ((layer_sizes[i] + blockSize - 1) / blockSize);
+			kernAddVectors << <fullBlocksPerGrid, blockSize >> > (layer_sizes[i], b[i], z[i], z[i]);
+			// Apply the Relu activation function
+			kernReluActivationForward << <fullBlocksPerGrid, blockSize >> > (layer_sizes[i], z[i], a[i]);
+		}
+		// Now the softmax output for the final layer which will give the probability of belonging to each class
+		// We will first calculate the z for the final layer
+		mmul(w[L], a[L - 1], z[L], layer_sizes[L], layer_sizes[L - 1], 1,0,0,layer_sizes[L], layer_sizes[L-1], layer_sizes[L]);
+		fullBlocksPerGrid = ((layer_sizes[L] + blockSize - 1) / blockSize);
+		kernAddVectors << <fullBlocksPerGrid, blockSize >> > (layer_sizes[L], b[L], z[L], z[L]);
+		// We will then calculate the sum(e^(z[L]))
+		// Doing it on the CPU because in the stream compaction code, the cpu implementation was faster for smaller inputs.
+		double *y_pred = new double[layer_sizes[L]];
+		cudaMemcpy(y_pred, z[L], layer_sizes[L] * sizeof(double), cudaMemcpyDeviceToHost);
+		double exp_sum = 0;
+		for (int i = 0; i < layer_sizes[L]; i++) {
+			exp_sum += expf(y_pred[i]);
+		}
+		// Now apply softmax activation
+
+		fullBlocksPerGrid = ((layer_sizes[L] + blockSize - 1) / blockSize);
+		
+		kernSoftmaxActivation << <fullBlocksPerGrid, blockSize >> > (layer_sizes[L], z[L], a[L], exp_sum);
+		cudaMemcpy(y_pred, a[L], layer_sizes[L] * sizeof(double), cudaMemcpyDeviceToHost);
+		return y_pred;
+		
+	}
+	void NeuralNet::backward(double *y) {
+		int L = layer_sizes.size() - 1;
+		// We will first populate da[L] as the derivative of loss with respect to y_pred.
+		double *y_cuda;
+		cudaMalloc((void**)&y_cuda, layer_sizes[L] * sizeof(double));
+		cudaMemcpy(y_cuda, y, layer_sizes[L] * sizeof(double), cudaMemcpyHostToDevice);
+		dim3 fullBlocksPerGrid;
+		fullBlocksPerGrid = ((layer_sizes[L] + blockSize - 1) / blockSize);
+		kernDerivativeLoss<<<fullBlocksPerGrid, blockSize>>>(layer_sizes[L], y_cuda, a[L], da[L]);
+		// The equations for the backpropagation are
+		// dz[l] = da[l]*g'[l](z[l]) where * means element wise
+		// dw[l] = dz[l]a[l-1].T
+		// db[l] = dz[l]
+		// da[l-1] = W[l].Tdz[l]
+		// Now the softmax derivative for the last but one layer
+		/*double *sum_cp = new double[layer_sizes[L]];
+		cudaMemcpy(sum_cp, z[L], layer_sizes[L] * sizeof(double), cudaMemcpyDeviceToHost);
+		double exp_sum = 0;
+		for (int i = 0; i < layer_sizes[L]; i++) {
+			exp_sum += expf(sum_cp[i]);
+		}*/
+		//kernSoftmaxDerivative << <fullBlocksPerGrid, blockSize >> > (layer_sizes[L], z[L], ghat[L], exp_sum);
+		//kernElementWiseMultiplication << <fullBlocksPerGrid, blockSize >> > (layer_sizes[L], da[L], ghat[L], dz[L]);
+		kernSubVectors << <fullBlocksPerGrid, blockSize >> > (layer_sizes[L], y_cuda, a[L], dz[L]);
+		// dw[l] = dz[l]a[l-1].T
+		mmul(dz[L], a[L - 1], dw[L], layer_sizes[L], 1, layer_sizes[L-1], 0, 1,layer_sizes[L], layer_sizes[L-1], layer_sizes[L]);
+
+		//db[l] = dz[l]
+		cudaMemcpy(db[L], dz[L], layer_sizes[L] * sizeof(double), cudaMemcpyDeviceToDevice);
+		//da[l - 1] = W[l].Tdz[l]
+		mmul(w[L], dz[L], da[L - 1], layer_sizes[L - 1], layer_sizes[L], 1, 1, 0, layer_sizes[L], layer_sizes[L], layer_sizes[L - 1]);
+		//Now for the ReLU layers
+		for (int i = L - 1; i > 0; i--) {
+			kernReluDerivative << <fullBlocksPerGrid, blockSize >> > (layer_sizes[i], z[i], ghat[i]);
+			kernElementWiseMultiplication << <fullBlocksPerGrid, blockSize >> > (layer_sizes[i], da[i], ghat[i], dz[i]);
+			mmul(dz[i], a[i - 1], dw[i], layer_sizes[i], 1, layer_sizes[i - 1], 0, 1, layer_sizes[i], layer_sizes[i - 1], layer_sizes[i]);
+			cudaMemcpy(db[i], dz[i], layer_sizes[i] * sizeof(double), cudaMemcpyDeviceToDevice);
+			mmul(w[i], dz[i], da[i - 1], layer_sizes[i - 1], layer_sizes[i], 1, 1, 0, layer_sizes[i], layer_sizes[i], layer_sizes[i - 1]);
+		}
+		// Now we will update the weights and bias
+		//printcuda(dw[1], layer_sizes[1] * layer_sizes[0], "dw1");
+		//printcuda(w[1], layer_sizes[1] * layer_sizes[0], "w1");
+		//printCuda(w[2], layer_sizes[2] * layer_sizes[1], "W2");
+		//printCuda(w[3], layer_sizes[3] * layer_sizes[2], "W3");
+		for (int i = 1; i <= L; i++) {
+			fullBlocksPerGrid = ((layer_sizes[i]*layer_sizes[i-1] + blockSize - 1) / blockSize);
+			kernUpdateParameters <<< fullBlocksPerGrid, blockSize >> > (layer_sizes[i] * layer_sizes[i - 1], w[i], dw[i], 0.0005);
+			fullBlocksPerGrid = ((layer_sizes[i] + blockSize - 1) / blockSize);
+			kernUpdateParameters <<< fullBlocksPerGrid, blockSize >>> (layer_sizes[i], b[i], db[i], 0.0005);
+		}
+		// Avoid the memory leaks
+		cudaFree(y_cuda);
+
+	}
+	NeuralNet::~NeuralNet() {
+		// Here comes the destructor, will free those memories ...
+		for (auto x : w)
+			cudaFree(x);
+		for (auto x : dw)
+			cudaFree(x);
+		for (auto x : b)
+			cudaFree(x);
+		for (auto x : db)
+			cudaFree(x);
+		for (auto x : z)
+			cudaFree(x);
+		for (auto x : dz)
+			cudaFree(x);
+		for (auto x : a)
+			cudaFree(x);
+		for (auto x : da)
+			cudaFree(x);
+		for (auto x : ghat)
+			cudaFree(x);
+
+		
+		cublasDestroy(handle);
+	}
     // TODO: __global__
 
     /**
diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h
index 2096228..0f5c5c6 100644
--- a/Project2-Character-Recognition/character_recognition/mlp.h
+++ b/Project2-Character-Recognition/character_recognition/mlp.h
@@ -1,9 +1,47 @@
 #pragma once
 
 #include "common.h"
+#include<cublas_v2.h>
+#include<vector>
+using namespace std;
 
 namespace CharacterRecognition {
     Common::PerformanceTimer& timer();
+	class NeuralNet {
+		//The weight matrices
+		vector<double*>w;
+		//The bias vectors
+		vector<double*>b;
+		// z^[l] = W^[l]a^[l-1] + b^[l]
+		vector<double*>z;
+		//a^[l] = g^[l](z^[l])
+		// a^[0] is the input
+		vector<double*>a;
+		// The derivatives of weight matrices
+		vector<double*>dw;
+		// The derivatives of bias matrices
+		vector<double*>db;
+		// The derivatives of z
+		vector<double*>dz;
+		// The derivatives of a
+		vector<double*>da;
+		// The layer sizes
+		vector<int>layer_sizes;
+		// The activation differentiation vectors
+		vector<double*>ghat;
+	public:
+		//The constructor takes in the input size and the number of output classes
+		NeuralNet(int input_size, int classes, vector<int>layers);
+		// The forward function will do the forward propogation
+		double* forward(double *input);
+		// The backward function is responsible for calculating gradients
+		// and applying the update formula
+		// w^[l] = w^[l] - alpha*dw^[l]
+		void backward(double* y);
+		// Calculates the loss given predicted value of y and actual value of y and stores it in loss;
+		double calculateLoss(double *ypred, double* y, int classes);
+		// The descrutor will free up the memory;
+		~NeuralNet();
+	};
 
-    // TODO: implement required elements for MLP sections 1 and 2 here
 }
diff --git a/Project2-Character-Recognition/data-set/t10k-images.idx3-ubyte b/Project2-Character-Recognition/data-set/t10k-images.idx3-ubyte
new file mode 100644
index 0000000..1170b2c
Binary files /dev/null and b/Project2-Character-Recognition/data-set/t10k-images.idx3-ubyte differ
diff --git a/Project2-Character-Recognition/data-set/t10k-labels.idx1-ubyte b/Project2-Character-Recognition/data-set/t10k-labels.idx1-ubyte
new file mode 100644
index 0000000..d1c3a97
Binary files /dev/null and b/Project2-Character-Recognition/data-set/t10k-labels.idx1-ubyte differ
diff --git a/Project2-Character-Recognition/data-set/train-images.idx3-ubyte b/Project2-Character-Recognition/data-set/train-images.idx3-ubyte
new file mode 100644
index 0000000..bbce276
Binary files /dev/null and b/Project2-Character-Recognition/data-set/train-images.idx3-ubyte differ
diff --git a/Project2-Character-Recognition/data-set/train-labels.idx1-ubyte b/Project2-Character-Recognition/data-set/train-labels.idx1-ubyte
new file mode 100644
index 0000000..d6b4c5d
Binary files /dev/null and b/Project2-Character-Recognition/data-set/train-labels.idx1-ubyte differ
diff --git a/Project2-Character-Recognition/img/accuracy.JPG b/Project2-Character-Recognition/img/accuracy.JPG
new file mode 100644
index 0000000..eb805bc
Binary files /dev/null and b/Project2-Character-Recognition/img/accuracy.JPG differ
diff --git a/Project2-Character-Recognition/img/for_back.JPG b/Project2-Character-Recognition/img/for_back.JPG
new file mode 100644
index 0000000..3f7a0c5
Binary files /dev/null and b/Project2-Character-Recognition/img/for_back.JPG differ
diff --git a/Project2-Character-Recognition/img/learning_rate.png b/Project2-Character-Recognition/img/learning_rate.png
new file mode 100644
index 0000000..7ea59e7
Binary files /dev/null and b/Project2-Character-Recognition/img/learning_rate.png differ
diff --git a/Project2-Character-Recognition/img/loss.JPG b/Project2-Character-Recognition/img/loss.JPG
new file mode 100644
index 0000000..bfe31ce
Binary files /dev/null and b/Project2-Character-Recognition/img/loss.JPG differ
diff --git a/Project2-Character-Recognition/img/loss_charac.png b/Project2-Character-Recognition/img/loss_charac.png
new file mode 100644
index 0000000..20e3727
Binary files /dev/null and b/Project2-Character-Recognition/img/loss_charac.png differ
diff --git a/Project2-Character-Recognition/img/loss_mnist.png b/Project2-Character-Recognition/img/loss_mnist.png
new file mode 100644
index 0000000..bf5206c
Binary files /dev/null and b/Project2-Character-Recognition/img/loss_mnist.png differ
diff --git a/Project2-Character-Recognition/img/lr.png b/Project2-Character-Recognition/img/lr.png
new file mode 100644
index 0000000..8700e34
Binary files /dev/null and b/Project2-Character-Recognition/img/lr.png differ
diff --git a/Project2-Character-Recognition/img/neural_network.JPG b/Project2-Character-Recognition/img/neural_network.JPG
new file mode 100644
index 0000000..e74cdd4
Binary files /dev/null and b/Project2-Character-Recognition/img/neural_network.JPG differ
diff --git a/Project2-Character-Recognition/img/neuron.JPG b/Project2-Character-Recognition/img/neuron.JPG
new file mode 100644
index 0000000..0e816f7
Binary files /dev/null and b/Project2-Character-Recognition/img/neuron.JPG differ
diff --git a/Project2-Character-Recognition/img/relu.png b/Project2-Character-Recognition/img/relu.png
new file mode 100644
index 0000000..cd23337
Binary files /dev/null and b/Project2-Character-Recognition/img/relu.png differ
diff --git a/Project2-Character-Recognition/img/workflow.png b/Project2-Character-Recognition/img/workflow.png
new file mode 100644
index 0000000..4006176
Binary files /dev/null and b/Project2-Character-Recognition/img/workflow.png differ
diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp
index 11dd534..df53681 100644
--- a/Project2-Character-Recognition/src/main.cpp
+++ b/Project2-Character-Recognition/src/main.cpp
@@ -1,152 +1,230 @@
-/**
- * @file      main.cpp
- * @brief     Stream compaction test program
- * @authors   Kai Ninomiya
- * @date      2015
- * @copyright University of Pennsylvania
- */
-
 #include <cstdio>
 #include <character_recognition/mlp.h>
 #include <character_recognition/common.h>
 #include "testing_helpers.hpp"
+#include "cublas_v2.h"
+#include <opencv2/opencv.hpp>
+#include <opencv2/core/core.hpp>
+#include <opencv2/highgui/highgui.hpp>
+#include<Windows.h>
+#include<vector>
+#include <fstream>
+#include<stdio.h>
+#include<stdlib.h>
+#include<regex>
+#include<random>
+using namespace std;
+
+#define mnist 0
+
+std::vector<std::string> read_directory(const std::string& name)
+{
+	std::vector<std::string> v;
+	std::string pattern(name);
+	pattern.append("\\*");
+	WIN32_FIND_DATA data;
+	HANDLE hFind;
+	if ((hFind = FindFirstFile(pattern.c_str(), &data)) != INVALID_HANDLE_VALUE) {
+		do {
+			if (string(data.cFileName).find("bmp") != std::string::npos)
+				v.push_back(name + data.cFileName);
+		} while (FindNextFile(hFind, &data) != 0);
+		FindClose(hFind);
+	}
+	return v;
+}
+
+// http://eric-yuan.me/cpp-read-mnist/
+int ReverseInt(int i)
+{
+	unsigned char ch1, ch2, ch3, ch4;
+	ch1 = i & 255;
+	ch2 = (i >> 8) & 255;
+	ch3 = (i >> 16) & 255;
+	ch4 = (i >> 24) & 255;
+	return((int)ch1 << 24) + ((int)ch2 << 16) + ((int)ch3 << 8) + ch4;
+}
+// http://eric-yuan.me/cpp-read-mnist/
+void read_Mnist_Label(string filename, vector<double> &vec)
+{
+	 ifstream file (filename, ios::binary);
+
+	 if (file.is_open()){
+
+     int magic_number = 0;
+     int number_of_images = 0;
+     int n_rows = 0;
+     int n_cols = 0;
+     file.read((char*) &magic_number, sizeof(magic_number));
+     magic_number = ReverseInt(magic_number);
+     file.read((char*) &number_of_images,sizeof(number_of_images));
+     number_of_images = ReverseInt(number_of_images);
+     for(int i = 0; i < number_of_images; ++i)
+     {
+		 unsigned char temp = 0;
+         file.read((char*) &temp, sizeof(temp));
+         vec[i]= (double)temp;
+     }
+	}
+}
+
+// http://eric-yuan.me/cpp-read-mnist/
+void read_Mnist(string filename, vector<cv::Mat> &vec) {
+	ifstream file(filename, ios::binary);
+	if (file.is_open())
+	{
+		int magic_number = 0;
+		int number_of_images = 0;
+		int n_rows = 0;
+		int n_cols = 0;
+		file.read((char*)&magic_number, sizeof(magic_number));
+		magic_number = ReverseInt(magic_number);
+		file.read((char*)&number_of_images, sizeof(number_of_images));
+		number_of_images = ReverseInt(number_of_images);
+		file.read((char*)&n_rows, sizeof(n_rows));
+		n_rows = ReverseInt(n_rows);
+		file.read((char*)&n_cols, sizeof(n_cols));
+		n_cols = ReverseInt(n_cols);
+		for (int i = 0; i < number_of_images; ++i)
+		{
+			cv::Mat tp = cv::Mat::zeros(n_rows, n_cols, CV_8UC1);
+			for (int r = 0; r < n_rows; ++r)
+			{
+				for (int c = 0; c < n_cols; ++c)
+				{
+					unsigned char temp = 0;
+					file.read((char*)&temp, sizeof(temp));
+					tp.at<uchar>(r, c) = (int)temp;
+				}
+			}
+			vec.push_back(tp);
+		}
+	}
+}
 
-const int SIZE = 1 << 8; // feel free to change the size of array
-const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int *a = new int[SIZE];
-int *b = new int[SIZE];
-int *c = new int[SIZE];
+
+void csv_write(double loss, int iteration_count, string file_name) {
+	std::ofstream outfile;
+	outfile.open(file_name, std::ios_base::app);
+	outfile << iteration_count << "," << loss <<endl;
+}
+
+double* mattodouble(cv::Mat image) {
+	double* inp = new double[image.rows*image.cols];
+	if (image.isContinuous()) {
+		for (int i = 0; i < image.rows*image.cols; i++) {
+			inp[i] = (double)image.data[i];
+		}
+	}
+	return inp;
+}
+double* onehotmnist(double index) {
+	double *lab = new double[9];
+	memset(lab, 0, 10 * sizeof(double));
+	lab[(int)(index)] = 1;
+	return lab;
+}
 
 int main(int argc, char* argv[]) {
-    // Scan tests
-
-    printf("\n");
-    printf("****************\n");
-    printf("** SCAN TESTS **\n");
-    printf("****************\n");
-
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    // initialize b using StreamCompaction::CPU::scan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
-    // At first all cases passed because b && c are all zeroes.
-    zeroArray(SIZE, b);
-    printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-	onesArray(SIZE, c);
-	printDesc("1s array for finding bugs");
-	StreamCompaction::Naive::scan(SIZE, c, a);
-	printArray(SIZE, c, true); */
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    printf("\n");
-    printf("*****************************\n");
-    printf("** STREAM COMPACTION TESTS **\n");
-    printf("*****************************\n");
-
-    // Compaction tests
-
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    int count, expectedCount, expectedNPOT;
-
-    // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
-    zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedNPOT = count;
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
-    count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, power-of-two");
-    count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, non-power-of-two");
-    count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    system("pause"); // stop Win32 console from closing on exit
-	delete[] a;
-	delete[] b;
-	delete[] c;
+
+	vector<double*>images;
+	vector<double*>labels;
+	vector<double*>images_test;
+	vector<double*>labels_test;
+	int classes = 0, input_dim = 0;
+	if (mnist) {
+		string filename = R"(..\data-set\train-images.idx3-ubyte)";
+		int number_of_images = 60000;
+		int image_size = 28 * 28;
+		classes = 10;
+		input_dim = image_size;
+		vector<cv::Mat>images_train;
+		read_Mnist(filename, images_train);
+		filename = R"(..\data-set\train-labels.idx1-ubyte)";
+		number_of_images = 60000;
+		vector<double> labels_train(number_of_images);
+		read_Mnist_Label(filename, labels_train);
+
+		for (auto train : images_train) {
+			images.push_back(mattodouble(train));
+		}
+		for (auto label : labels_train) {
+			labels.push_back(onehotmnist(label));
+		}
+
+		filename = R"(..\data-set\t10k-images.idx3-ubyte)";
+		number_of_images = 10000;
+		vector<cv::Mat>images_te;
+		read_Mnist(filename, images_te);
+		filename = R"(..\data-set\t10k-labels.idx1-ubyte)";
+		vector<double> labels_te(number_of_images);
+		read_Mnist_Label(filename, labels_te);
+		for (auto img : images_te) {
+			images_test.push_back(mattodouble(img));
+		}
+		for (auto label : labels_te) {
+			labels_test.push_back(onehotmnist(label));
+		}
+
+
+	}
+
+	else {
+		classes = 52;
+		input_dim = 14*14;
+		std::string path = "..\\data-set\\";
+		std::vector<std::string>v = read_directory(path);
+		for (auto x : v) {
+		cv::Mat image = cv::imread(x, 0);
+		cv::Size size(14, 14);
+		std::string output = x.substr(12, 2);
+		int label_index = stoi(output);
+		cv::resize(image, image, size);
+		if (image.isContinuous()) {
+			images.push_back(new double[image.rows*image.cols]);
+			for (int i = 0; i < image.rows*image.cols; i++) {
+			images.back()[i] = (double)image.data[i];
+		}
+		labels.push_back(new double[classes]);
+		memset(labels.back(), 0, classes * sizeof(double));
+		labels.back()[label_index - 1] = 1;
+		}
+	}
+		images_test = images;
+		labels_test = labels;
+	}
+	vector<int>layers = { 100, 50, 25 };
+	vector<double>losses;
+	vector<double>accuracy;
+	CharacterRecognition::NeuralNet nn(input_dim, classes, layers);
+	double loss_epoch = 0;
+	int cnt = 0;
+	for (int i = 0; i < 120000; i++) {
+		int j = rand() % images.size();
+
+		double *output = nn.forward(images[j]);
+		loss_epoch += nn.calculateLoss(output, labels[j], classes);
+		if (i%500 == 0)
+		{
+			losses.push_back(loss_epoch/images.size());
+			cnt++;
+			//csv_write(loss_epoch, cnt , R"(..\output_mnist_loss_500.csv)");
+			loss_epoch = 0.0;
+		}
+		nn.backward(labels[j]);
+	}
+	double train_correct = 0;
+	double test_correct = 0;
+	for (int j = 0; j < images.size(); j++) {
+		double *output = nn.forward(images[j]);
+		train_correct += std::distance(labels[j], std::max_element(labels[j], labels[j] + classes)) == std::distance(output, std::max_element(output, output + classes));
+	}
+	cout << "Train Accuracy: " << train_correct / images.size() << endl;
+
+	for (int j = 0; j < images_test.size(); j++) {
+		double *output = nn.forward(images_test[j]);
+		test_correct += std::distance(labels_test[j], std::max_element(labels_test[j], labels_test[j] + classes)) == std::distance(output, std::max_element(output, output + classes));
+	}
+	cout <<"Test Accuracy: "<< test_correct/images_test.size() << endl;
+	return 0;
 }
diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md
index 0e38ddb..c8df7d7 100644
--- a/Project2-Stream-Compaction/README.md
+++ b/Project2-Stream-Compaction/README.md
@@ -1,14 +1,96 @@
 CUDA Stream Compaction
 ======================
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Dewang Sultania
+  * [LinkedIn](https://www.linkedin.com/in/dewang-sultania/)
+* Tested on: Windows 10, Intel Xeon E-2176M @ 2.70GHz 16GB, Quadro P2000 4GB (Personal Computer)
+
+### Table of Contents
+1. [Overview](#overview)
+2. [CPU Scan and Stream Compaction](#cpu)
+3. [Naive GPU Scan](#naive)
+4. [Work-Efficient GPU Scan and Stream Compaction](#work)
+5. [Thrust Implementation](#thrust)
+6. [Results](#results)
+7. [Performance Analysis](#performance)
+8. [Future Work](#future)
+
+<a name = "overview"/>
+
+## Overview
+
+In this project, I implemented Stream Compaction from scratch on CUDA, It has applications like ray tracing and collision detection. The stream compaction method implemented here removes 0's from an array of S.
+
+<a name = "cpu"/>
+
+## CPU Scan and Stream Compaction
+
+The CPU scan algorithm is implemented in two ways: 
+
+- Without scanning: The without scan is simple, it takes in an input array, maintains an output array index, iterates linearly through the input, checks if it's 0 or not, puts it in the output array and updates the index.
+
+- With scanning: The with scanning function emulates the GPU version of the algorithm on the cpu by calculating the exclusive scan of a mask array which tells if you want to keep the element or not. The scan gives us the index of the location at which the element has to be stored in the compressed array. The algorithm is illustrated as follows:
+
+![](img/stream_compaction.jpg)
+
+<a name = "naive"/>
+
+
+
+## Naïve GPU Scan
+
+The naïve scan algorithm takes elements based on an offset and adds them up based on the level depth. The following figures show the algorithm and the pictorial representation of the same:
+
+Naïve Scan             |  Algorithm
+:-------------------------:|:-------------------------:
+![](img/naive.JPG) | ![](img/naive_algo.JPG)
+
+<a name = "work"/>
+
+
+
+## Work Efficient GPU Scan and stream compaction
+
+In Work Efficient scan we use a balanced binary tree (in concept) to perform scan in two phases:
+- Up-Sweep (Parallel Reduction)
+- Down-Sweep
+
+These up sweep algorithm is illustrated in the pictures below:
+
+Up Sweep             |  Algorithm
+:-------------------------:|:-------------------------:
+![](img/up_sweep.JPG) | ![](img/up_sweep_algo.JPG)
+
+
+In Down sweep we traverse back down the tree using partial sums to build the scan in place. The following steps are performed:
+- Set root to zero
+- At each pass, a node passes its value to its left child and sets the right child to the sum of the previous left child's value and its value.
+
+Down Sweep             |  Algorithm
+:-------------------------:|:-------------------------:
+![](img/down_sweep.JPG) | ![](img/down_sweep_algo.JPG)
+
+This computes an exclusive scan, which we can use to do stream compaction in a similar way, we use work efficient sweep to compute exclusive scan of the mask array and then that gives us the index of the elements in the output array.
+
+<a name = "thrust"/>
+
+## Thrust Implementation
+
+The thrust implementation uses thrust::exclusive_scan to compute the exclusive scan of the array. This has been implemented as a benchmark against our own implementation.
+
+<a name = "results"/>
+
+## Results
+![](img/results.JPG)
+## Performance Analysis
+
+![](img/Array_pow_2.png)
+
+![](img/Array_non_2.png)
 
-### (TODO: Your README)
+![](img/thrust_2.png)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](img/thrust_non_2.png)
 
+The CPU implementation was better than work efficient for smaller array sizes, but worked slowly compared to Work efficient for larger size arrays. I think the parallelism really kicks in when array sizes are really large. Thrust was super slow for me, I might have to dig into this deeper but my initial hunch is that memory allocation on thrust pointers takes significant amount of time.
\ No newline at end of file
diff --git a/Project2-Stream-Compaction/img/Array_non_2.png b/Project2-Stream-Compaction/img/Array_non_2.png
new file mode 100644
index 0000000..fbee4dd
Binary files /dev/null and b/Project2-Stream-Compaction/img/Array_non_2.png differ
diff --git a/Project2-Stream-Compaction/img/Array_pow_2.png b/Project2-Stream-Compaction/img/Array_pow_2.png
new file mode 100644
index 0000000..ad0def5
Binary files /dev/null and b/Project2-Stream-Compaction/img/Array_pow_2.png differ
diff --git a/Project2-Stream-Compaction/img/CPU_Non_2.png b/Project2-Stream-Compaction/img/CPU_Non_2.png
new file mode 100644
index 0000000..50d715d
Binary files /dev/null and b/Project2-Stream-Compaction/img/CPU_Non_2.png differ
diff --git a/Project2-Stream-Compaction/img/CPU_Power_2.png b/Project2-Stream-Compaction/img/CPU_Power_2.png
new file mode 100644
index 0000000..44dce76
Binary files /dev/null and b/Project2-Stream-Compaction/img/CPU_Power_2.png differ
diff --git a/Project2-Stream-Compaction/img/down_sweep.JPG b/Project2-Stream-Compaction/img/down_sweep.JPG
new file mode 100644
index 0000000..8ac2a00
Binary files /dev/null and b/Project2-Stream-Compaction/img/down_sweep.JPG differ
diff --git a/Project2-Stream-Compaction/img/down_sweep_algo.JPG b/Project2-Stream-Compaction/img/down_sweep_algo.JPG
new file mode 100644
index 0000000..0a2f11c
Binary files /dev/null and b/Project2-Stream-Compaction/img/down_sweep_algo.JPG differ
diff --git a/Project2-Stream-Compaction/img/eff.png b/Project2-Stream-Compaction/img/eff.png
new file mode 100644
index 0000000..c0200b4
Binary files /dev/null and b/Project2-Stream-Compaction/img/eff.png differ
diff --git a/Project2-Stream-Compaction/img/naive.JPG b/Project2-Stream-Compaction/img/naive.JPG
new file mode 100644
index 0000000..222b762
Binary files /dev/null and b/Project2-Stream-Compaction/img/naive.JPG differ
diff --git a/Project2-Stream-Compaction/img/naive_algo.JPG b/Project2-Stream-Compaction/img/naive_algo.JPG
new file mode 100644
index 0000000..fb8defd
Binary files /dev/null and b/Project2-Stream-Compaction/img/naive_algo.JPG differ
diff --git a/Project2-Stream-Compaction/img/results.JPG b/Project2-Stream-Compaction/img/results.JPG
new file mode 100644
index 0000000..d6162c5
Binary files /dev/null and b/Project2-Stream-Compaction/img/results.JPG differ
diff --git a/Project2-Stream-Compaction/img/stream_compaction.JPG b/Project2-Stream-Compaction/img/stream_compaction.JPG
new file mode 100644
index 0000000..d4678a1
Binary files /dev/null and b/Project2-Stream-Compaction/img/stream_compaction.JPG differ
diff --git a/Project2-Stream-Compaction/img/thrust_2.png b/Project2-Stream-Compaction/img/thrust_2.png
new file mode 100644
index 0000000..711312b
Binary files /dev/null and b/Project2-Stream-Compaction/img/thrust_2.png differ
diff --git a/Project2-Stream-Compaction/img/thrust_non_2.png b/Project2-Stream-Compaction/img/thrust_non_2.png
new file mode 100644
index 0000000..3898178
Binary files /dev/null and b/Project2-Stream-Compaction/img/thrust_non_2.png differ
diff --git a/Project2-Stream-Compaction/img/up_sweep.JPG b/Project2-Stream-Compaction/img/up_sweep.JPG
new file mode 100644
index 0000000..cbb3378
Binary files /dev/null and b/Project2-Stream-Compaction/img/up_sweep.JPG differ
diff --git a/Project2-Stream-Compaction/img/up_sweep_algo.JPG b/Project2-Stream-Compaction/img/up_sweep_algo.JPG
new file mode 100644
index 0000000..f5f38cd
Binary files /dev/null and b/Project2-Stream-Compaction/img/up_sweep_algo.JPG differ
diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp
index d016553..fd98f07 100644
--- a/Project2-Stream-Compaction/src/main.cpp
+++ b/Project2-Stream-Compaction/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 3; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt
index cdbef77..4bb0dc2 100644
--- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt
+++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt
@@ -13,5 +13,5 @@ set(SOURCE_FILES
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/Project2-Stream-Compaction/stream_compaction/common.cu b/Project2-Stream-Compaction/stream_compaction/common.cu
index 2ed6d63..c3cd64e 100644
--- a/Project2-Stream-Compaction/stream_compaction/common.cu
+++ b/Project2-Stream-Compaction/stream_compaction/common.cu
@@ -23,7 +23,13 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+			int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+			if (idata[index] != 0)
+				bools[index] = 1;
+			else
+				bools[index] = 0;
         }
 
         /**
@@ -32,8 +38,12 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+			int index = (blockIdx.x*blockDim.x) + threadIdx.x;
+			if (index >= n)
+				return;
+			if (bools[index] == 1)
+				odata[indices[index]] = idata[index];
+		}
 
     }
 }
diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu
index a2d3e6c..6d7ecc2 100644
--- a/Project2-Stream-Compaction/stream_compaction/cpu.cu
+++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu
@@ -1,4 +1,5 @@
 #include <cstdio>
+#include<iostream>
 #include "cpu.h"
 
 #include "common.h"
@@ -18,9 +19,20 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
-	        timer().endCpuTimer();
+			int timer_flag = 0;
+			try {
+				timer().startCpuTimer();
+			}
+			catch (...) {
+				timer_flag = 1;
+			}
+			std::cout << "Timer had already started" << std::endl;
+			odata[0] = 0;
+			for (int i = 1; i < n; i++) {
+				odata[i] = odata[i - 1] + idata[i - 1];
+			}
+			if(timer_flag == 0)
+				timer().endCpuTimer();
         }
 
         /**
@@ -30,9 +42,13 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
-            // TODO
+			int index_out = 0;
+			for (int i = 0; i < n; i++) {
+				if (idata[i] != 0)
+					odata[index_out++] = idata[i];
+			}
 	        timer().endCpuTimer();
-            return -1;
+            return index_out;
         }
 
         /**
@@ -42,9 +58,26 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
-	        // TODO
-	        timer().endCpuTimer();
-            return -1;
+			int *mask = new int[n];
+			int *mask_scan = new int[n];
+			memset(mask, 0, sizeof(mask));
+			memset(mask_scan, 0, sizeof(mask_scan));
+			for (int i = 0; i < n; i++) {
+				if (idata[i] == 0)
+					mask[i] = 0;
+				else
+					mask[i] = 1;
+			}
+			scan(n, mask_scan, mask);
+			int max_index = 0;
+			for (int i = 0; i < n; i++) {
+				if (mask[i] == 1) {
+					odata[mask_scan[i]] = idata[i];
+					max_index = mask_scan[i];
+				}
+			}
+			timer().endCpuTimer();
+            return max_index+1;
         }
     }
 }
diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu
index 2db346e..a0100de 100644
--- a/Project2-Stream-Compaction/stream_compaction/efficient.cu
+++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu
@@ -1,40 +1,112 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
+#include <iostream>
 #include "common.h"
 #include "efficient.h"
 
+#define blockSize 128
 namespace StreamCompaction {
-    namespace Efficient {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
-        }
+	namespace Efficient {
+		using StreamCompaction::Common::PerformanceTimer;
+		PerformanceTimer& timer()
+		{
+			static PerformanceTimer timer;
+			return timer;
+		}
+		__global__ void kernUpSweep(int n, int d, int *itemp) {
+			int power = 1 << (d + 1);
+			int k = ((blockIdx.x*blockDim.x) + threadIdx.x)*power;
+			if (k >= n) {
+				return;
+			}
+			int power_2 = 1 << d;
+			itemp[k + power - 1] += itemp[k + power_2 - 1];
+		}
+		__global__ void kernDownSweep(int n, int d, int *itemp) {
+			int power = 1 << (d + 1);
+			int k = ((blockIdx.x*blockDim.x) + threadIdx.x)*power;
+			if (k >= n) {
+				return;
+			}
+			
+			int power_2 = 1 << d;
+			int temp = itemp[k + power_2 - 1];
+			itemp[k + power_2 - 1] = itemp[k + power - 1];
+			itemp[k + power - 1] += temp;
+		}
+		/**
+		 * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+		 */
+		void scan(int n, int *odata, const int *idata) {
+			int *scan;
+			int D = ilog2ceil(n);
+			int tot_size = (1 << D);
+			cudaMalloc((void**)&scan, tot_size * sizeof(int));
+			checkCUDAError("CUDA Malloc failed");
+			cudaMemcpy(scan, idata, tot_size * sizeof(int), cudaMemcpyHostToDevice);
+			checkCUDAError("Copy Failed");
+			int timer_flag = 0;
+			try {
+				timer().startGpuTimer();
+			}
+			catch (...){
+				timer_flag = 1;
+			}
+			for (int d = 0; d < ilog2ceil(tot_size); d++) {
+				dim3 fullBlocksPerGrid (((1 << (D - d - 1)) + blockSize - 1) / blockSize);
+				kernUpSweep << <fullBlocksPerGrid, blockSize >> > (tot_size, d, scan);
+			}
+			cudaMemset(scan + tot_size - 1, 0, sizeof(int));
+			for (int d = ilog2ceil(tot_size)-1; d >= 0; d--) {
+				dim3 fullBlocksPerGrid(((1 << (D - d - 1)) + blockSize - 1) / blockSize);
+				kernDownSweep << <fullBlocksPerGrid, blockSize >> > (tot_size, d, scan);
+			}
+			if(timer_flag == 0)
+				timer().endGpuTimer();
+			cudaMemcpy(odata, scan, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(scan);
+		}
 
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-        }
+		/**
+		 * Performs stream compaction on idata, storing the result into odata.
+		 * All zeroes are discarded.
+		 *
+		 * @param n      The number of elements in idata.
+		 * @param odata  The array into which to store elements.
+		 * @param idata  The array of elements to compact.
+		 * @returns      The number of elements remaining after compaction.
+		 */
+		int compact(int n, int *odata, const int *idata) {
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+			int *mask, *mask_scan, *itemp, *otemp;
+			cudaMalloc((void**)&mask, n * sizeof(int));
+			checkCUDAError("CUDA Malloc failed");
+			cudaMalloc((void**)&mask_scan, n * sizeof(int));
+			checkCUDAError("CUDA Malloc failed");
+			cudaMalloc((void**)&itemp, n * sizeof(int));
+			checkCUDAError("CUDA Malloc failed");
+			cudaMalloc((void**)&otemp, n * sizeof(int));
+			checkCUDAError("CUDA Malloc failed");
 
-        /**
-         * Performs stream compaction on idata, storing the result into odata.
-         * All zeroes are discarded.
-         *
-         * @param n      The number of elements in idata.
-         * @param odata  The array into which to store elements.
-         * @param idata  The array of elements to compact.
-         * @returns      The number of elements remaining after compaction.
-         */
-        int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-            return -1;
-        }
-    }
-}
+			cudaMemcpy(itemp, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			checkCUDAError("Copy Failed");
+			timer().startGpuTimer();
+			StreamCompaction::Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, mask, itemp);
+			int *cpu_otemp = new int[n];
+			int *cpu_itemp = new int[n];
+			cudaMemcpy(cpu_itemp, mask, n * sizeof(int), cudaMemcpyDeviceToHost);
+			scan(n, cpu_otemp, cpu_itemp);
+			int elements = cpu_itemp[n - 1] == 0 ? cpu_otemp[n - 1] : cpu_otemp[n - 1] + 1;
+			cudaMemcpy(mask_scan, cpu_otemp, n * sizeof(int), cudaMemcpyHostToDevice);
+			StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, otemp, itemp, mask, mask_scan);
+			timer().endGpuTimer();
+
+			cudaMemcpy(odata, otemp, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(mask);
+			cudaFree(itemp);
+			cudaFree(otemp);
+			cudaFree(mask_scan);
+			return elements;
+		}
+	}
+}
\ No newline at end of file
diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu
index 4308876..b82c7d4 100644
--- a/Project2-Stream-Compaction/stream_compaction/naive.cu
+++ b/Project2-Stream-Compaction/stream_compaction/naive.cu
@@ -2,7 +2,9 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "naive.h"
+#include "device_launch_parameters.h"
 
+#define blockSize 32
 namespace StreamCompaction {
     namespace Naive {
         using StreamCompaction::Common::PerformanceTimer;
@@ -13,13 +15,39 @@ namespace StreamCompaction {
         }
         // TODO: __global__
 
+		__global__ void kernNaive(int n, int d, int *itemp, int *otemp) {
+			int k = (blockIdx.x*blockDim.x) + threadIdx.x;
+			if (k >= n) {
+				return;
+			}
+			if (k >= (1 << (d - 1)))
+				otemp[k] = itemp[k - (1 << (d - 1))] + itemp[k];
+			else
+				otemp[k] = itemp[k];
+		}
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
+			int *itemp, *otemp;
+			cudaMalloc((void**)&itemp, n * sizeof(int));
+			checkCUDAError("Malloc for input temp array failed");
+			cudaMalloc((void**)&otemp, n * sizeof(int));
+			checkCUDAError("Malloc for output temp array failed");
+			cudaMemcpy(itemp, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+			timer().startGpuTimer();
+			for (int d = 1; d <= ilog2ceil(n); d++) {
+				kernNaive <<< fullBlocksPerGrid, blockSize >>> (n, d, itemp, otemp);
+				std::swap(itemp, otemp);
+			}
+			timer().endGpuTimer();
+			std::swap(itemp, otemp);
+			cudaMemcpy(odata + 1, otemp, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
+			odata[0] = 0;
+			cudaFree(itemp);
+			cudaFree(otemp);
+
         }
     }
 }
diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu
index 1def45e..65ea286 100644
--- a/Project2-Stream-Compaction/stream_compaction/thrust.cu
+++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu
@@ -22,6 +22,12 @@ namespace StreamCompaction {
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+			thrust::host_vector<int>host_in(idata, idata + n);
+			thrust::device_vector<int>dev_in(n);
+			thrust::device_vector<int>dev_out(n);
+			thrust::copy(host_in.begin(), host_in.end(), dev_in.begin());
+			thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin());
+			thrust::copy(dev_out.begin(), dev_out.end(), odata);
             timer().endGpuTimer();
         }
     }
diff --git a/README.md b/README.md
index 3a0b2fe..76b1532 100644
--- a/README.md
+++ b/README.md
@@ -3,14 +3,10 @@ CUDA Number Algorithms
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Dewang Sultania
+  * [LinkedIn](https://www.linkedin.com/in/dewang-sultania/)
+* Tested on: Windows 10, Intel Xeon E-2176M @ 2.70GHz 16GB, Quadro P2000 4GB (Personal Computer)
 
-### (TODO: Your README)
-
-Link to the readmes of the other two subprojects.
-
-Add anything else you think is relevant up to this point.
-(Remember, this is public, so don't put anything here that you don't want to share with the world.)
+1) [Stream Compaction](./Project2-Stream-Compaction/README.md)
+2) [Character Recognition](./Project2-Character-Recognition/README.md)