diff --git a/README.md b/README.md
index 6e02afa..0d60152 100644
--- a/README.md
+++ b/README.md
@@ -1,133 +1,24 @@
-Project-2
-=========
-
 A Study in Parallel Algorithms : Stream Compaction
 
-# INTRODUCTION
-Many of the algorithms you have learned thus far in your career have typically
-been developed from a serial standpoint.  When it comes to GPUs, we are mainly
-looking at massively parallel work.  Thus, it is necessary to reorient our
-thinking.  In this project, we will be implementing a couple different versions
-of prefix sum.  We will start with a simple single thread serial CPU version,
-and then move to a naive GPU version.  Each part of this homework is meant to
-follow the logic of the previous parts, so please do not do this homework out of
-order.
-
-This project will serve as a stream compaction library that you may use (and
-will want to use) in your
-future projects.  For that reason, we suggest you create proper header and CUDA
-files so that you can reuse this code later.  You may want to create a separate
-cpp file that contains your main function so that you can test the code you
-write.
-
-# OVERVIEW
-Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
-
-## SCAN
-Scan or prefix sum is the summation of the elements in an array such that the
-resulting array is the summation of the terms before it.  Prefix sum can either
-be inclusive, meaning the current term is a summation of all the elements before
-it and itself, or exclusive, meaning the current term is a summation of all
-elements before it excluding itself. 
-
-Inclusive:
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 3 7 13 20 29 39 ]
-
-Exclusive
-
-In : [ 3 4 6 7 9 10 ]
-
-Out : [ 0 3 7 13 20 29 ]
-
-Note that the resulting prefix sum will always be n + 1 elements if the input
-array is of length n.  Similarly, the first element of the exclusive prefix sum
-will always be 0.  In the following sections, all references to prefix sum will
-be to the exclusive version of prefix sum.
-
-## SCATTER
-The scatter section of stream compaction takes the results of the previous scan
-in order to reorder the elements to form a compact array.
-
-For example, let's say we have the following array:
-[ 0 0 3 4 0 6 6 7 0 1 ]
-
-We would only like to consider the non-zero elements in this zero, so we would
-like to compact it into the following array:
-[ 3 4 6 6 7 1 ]
-
-We can perform a transform on input array to transform it into a boolean array:
-
-In :  [ 0 0 3 4 0 6 6 7 0 1 ]
-
-Out : [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Performing a scan on the output, we get the following array :
-
-In :  [ 0 0 1 1 0 1 1 1 0 1 ]
-
-Out : [ 0 0 0 1 2 2 3 4 5 5 ]
-
-Notice that the output array produces a corresponding index array that we can
-use to create the resulting array for stream compaction. 
-
-# PART 1 : REVIEW OF PREFIX SUM
-Given the definition of exclusive prefix sum, please write a serial CPU version
-of prefix sum.  You may write this in the cpp file to separate this from the
-CUDA code you will be writing in your .cu file. 
-
-# PART 2 : NAIVE PREFIX SUM
-We will now parallelize this the previous section's code.  Recall from lecture
-that we can parallelize this using a series of kernel calls.  In this portion,
-you are NOT allowed to use shared memory.
-
-### Questions 
-* Compare this version to the serial version of exclusive prefix scan. Please
-  include a table of how the runtimes compare on different lengths of arrays.
-* Plot a graph of the comparison and write a short explanation of the phenomenon you
-  see here.
-
-# PART 3 : OPTIMIZING PREFIX SUM
-In the previous section we did not take into account shared memory.  In the
-previous section, we kept everything in global memory, which is much slower than
-shared memory.
-
-## PART 3a : Write prefix sum for a single block
-Shared memory is accessible to threads of a block. Please write a version of
-prefix sum that works on a single block.  
+There are two main components of stream compaction: scan and scatter.
 
-## PART 3b : Generalizing to arrays of any length.
-Taking the previous portion, please write a version that generalizes prefix sum
-to arbitrary length arrays, this includes arrays that will not fit on one block.
+Here is a comparison of the various mehtods I used to scan:
 
-### Questions
-* Compare this version to the parallel prefix sum using global memory.
-* Plot a graph of the comparison and write a short explanation of the phenomenon
-  you see here.
+![](http://i.imgur.com/AaR3gk0.png)
 
-# PART 4 : ADDING SCATTER
-First create a serial version of scatter by expanding the serial version of
-prefix sum.  Then create a GPU version of scatter.  Combine the function call
-such that, given an array, you can call stream compact and it will compact the
-array for you.  Finally, write a version using thrust. 
+As you can see, the serial version is faster for small arrays, but is quickly out matched as the array length grows.  The global
+memory version is always just a bit slower than the shared memory version, which makes sense, as the only difference is the slowdown
+that comes from fetching from global memory often.  The work efficient algorithm that I've implemented must have a bug in it, because
+it only becomes comparable to the naive shared memory version after the array is over 10 million elements long.  Further investigation is
+needed.
 
-### Questions
-* Compare your version of stream compact to your version using thrust.  How do
-  they compare?  How might you optimize yours more, or how might thrust's stream
-  compact be optimized.
+And here is a comparison of my scatter implementation and thrust's.  I think I'm using a slow
+thrust version of this, becuase I don't think my basic version in CUDA should be as fast as thrust.  But, to be honest,
+I'm not sure how else to optimize my implementation of scatter any further.  It has 3 global memory reads that are absolutely necessary,
+and a branch.
 
-# EXTRA CREDIT (+10)
-For extra credit, please optimize your prefix sum for work parallelism and to
-deal with bank conflicts.  Information on this can be found in the GPU Gems
-chapter listed in the references.  
+![](http://i.imgur.com/V55kt3w.png)
 
-# SUBMISSION
-Please answer all the questions in each of the subsections above and write your
-answers in the README by overwriting the README file.  In future projects, we
-expect your analysis to be similar to the one we have led you through in this
-project.  Like other projects, please open a pull request and email Harmony.
 
 # REFERENCES
 "Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.
diff --git a/src/main.cpp b/src/main.cpp
new file mode 100644
index 0000000..fcaa421
--- /dev/null
+++ b/src/main.cpp
@@ -0,0 +1,309 @@
+#include <iostream>
+
+#include "streamCompaction.h"
+
+using namespace std;
+
+void serialSum(){
+	int numElements = 256;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	ds.serialScan();
+
+	// for (int i=0; i<ds.numAlive(); i+=1){
+	// 	cout<<ds.m_indices[i];
+	// 	if (i<ds.numAlive()-1) cout<<",";
+	// }
+	// cout<<endl;
+}
+
+void naive(){
+	int numElements = 25;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	int bound = 0;
+	while(ds.numAlive () > 0){
+		int toKill = rand() % ds.numAlive();
+		ds.kill(toKill);
+		ds.compactWorkEfficientArbitrary ();
+
+		cout<<"killing "<<toKill<<", "<<ds.numAlive()<<" streams remain"<<endl;
+
+		ds.fetchDataFromGPU();
+
+		for (int i=0; i<ds.numAlive(); i+=1){
+			dataPacket cur;
+			ds.getData(i, cur);
+			cout<<cur.index;
+			if (i<ds.numAlive()-1) cout<<",";
+		}
+		cout<<endl<<endl;
+		bound+=1;
+	}
+}
+
+void naiveSumGlobal(){
+
+	int ne[] = {100, 1000, 10000, 100000, 1000000, 10000000};
+	for (int  i=0; i<6; i+=1){
+		int numElements = ne[i];
+
+		dataPacket * ints = new dataPacket[numElements];
+		for (int i=0; i<numElements; i+=1){
+			ints[i] = dataPacket(i);
+		}
+
+		DataStream ds(numElements, ints);
+
+		// cout<<"starting with "<<ds.m_numElements<<" streams"<<endl;
+
+		// for (int i=0; i<ds.numAlive(); i+=1){
+		// 	cout<<ds.m_indices[i];
+		// 	if (i<ds.numAlive()-1) cout<<",";
+		// }
+		// cout<<endl;
+
+		
+		ds.compactNaiveSumGlobal();
+
+		// for (int i=0; i<ds.numAlive(); i+=1){
+		// 	cout<<ds.m_indices[i];
+		// 	if (i<ds.numAlive()-1) cout<<",";
+		// }
+		// cout<<endl;
+	}
+}
+
+void naiveCompactGlobal(){
+	int numElements = 100000;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	int bound = 0;
+	while(ds.numAlive () > 0){
+		for (int i=0; i<numElements/25; i+=1){
+			int toKill = rand() % ds.numAlive();
+			ds.kill(toKill);
+		}
+		//
+		ds.compactNaiveSumGlobal ();
+		//
+
+		cout<<"killing ~"<<numElements/25<<" streams, "<<ds.numAlive()<<" streams remain"<<endl;
+
+		ds.fetchDataFromGPU();
+
+		// for (int i=0; i<ds.numAlive(); i+=1){
+		// 	dataPacket cur;
+		// 	ds.getData(i, cur);
+		// 	cout<<cur.index;
+		// 	if (i<ds.numAlive()-1) cout<<",";
+		// }
+		cout<<endl;
+	}
+}
+
+void naiveSumSharedSingleBlock(){
+	int numElements = 16;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	ds.compactNaiveSumSharedSingleBlock();
+
+	for (int i=0; i<ds.numAlive(); i+=1){
+		cout<<ds.m_indices[i];
+		if (i<ds.numAlive()-1) cout<<",";
+	}
+	cout<<endl;
+}
+
+void compactNaiveSumSharedArbitrary(){
+	int ne[] = {100, 1000, 10000, 100000, 1000000, 10000000};
+	for (int  i=0; i<6; i+=1){
+		int numElements = ne[i];
+
+		dataPacket * ints = new dataPacket[numElements];
+		for (int i=0; i<numElements; i+=1){
+			ints[i] = dataPacket(i);
+		}
+
+		DataStream ds(numElements, ints);
+
+		ds.compactNaiveSumGlobal();
+	}
+}
+
+void naiveCompactSharedArbitrary(){
+	int numElements = 32;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	int bound = 0;
+	while(ds.numAlive () > 0 && bound < 10){
+
+		int toKill = rand() % ds.numAlive();
+		ds.kill(toKill);
+		dataPacket cur;
+		ds.getData(toKill, cur);
+		cout<<"killed "<<cur.index<<endl;
+
+		toKill = rand() % ds.numAlive();
+		ds.kill(toKill);
+		ds.getData(toKill, cur);
+		cout<<"killed "<<cur.index<<endl;
+
+		ds.compactNaiveSumSharedArbitrary ();
+
+		cout<<ds.numAlive()<<" streams remain"<<endl;
+
+		ds.fetchDataFromGPU();
+
+		for (int i=0; i<ds.numAlive(); i+=1){
+			ds.getData(i, cur);
+			cout<<cur.index;
+			if (i<ds.numAlive()-1) cout<<",";
+		}
+		cout<<endl<<endl;
+		bound+=1;
+	}
+}
+
+void workEfficientArbitrary(){
+	// int numElements = 40;
+
+	// dataPacket * ints = new dataPacket[numElements];
+	// for (int i=0; i<numElements; i+=1){
+	// 	ints[i] = dataPacket(i);
+	// }
+
+	// DataStream ds(numElements, ints);
+
+	// cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	// ds.compactWorkEfficientArbitrary();
+
+	// // for (int i=0; i<ds.numAlive(); i+=1){
+	// // 	cout<<ds.m_indices[i];
+	// // 	if (i<ds.numAlive()-1) cout<<",";
+	// // }
+	// // cout<<endl;
+
+	// // for (int i=0; i<numElements/(THREADS_PER_BLOCK*2); i+=1){
+	// // 	cout<<ds.m_auxSums[i];
+	// // 	if (i<ds.numAlive()-1) cout<<",";
+	// // }
+	// // cout<<endl;
+
+	int numElements = 100000;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	int bound = 0;
+	while(ds.numAlive () > 0){
+		for (int i=0; i<numElements/25; i+=1){
+			int toKill = rand() % ds.numAlive();
+			ds.kill(toKill);
+		}
+		ds.compactWorkEfficientArbitrary ();
+
+		cout<<"killing ~"<<numElements/25<<" streams, "<<ds.numAlive()<<" streams remain"<<endl;
+
+		ds.fetchDataFromGPU();
+
+		// for (int i=0; i<ds.numAlive(); i+=1){
+		// 	dataPacket cur;
+		// 	ds.getData(i, cur);
+		// 	cout<<cur.index;
+		// 	if (i<ds.numAlive()-1) cout<<",";
+		// }
+		cout<<endl;
+	}
+}
+
+void workEfficientCompactSharedArbitrary(){
+	int numElements = 33;
+
+	dataPacket * ints = new dataPacket[numElements];
+	for (int i=0; i<numElements; i+=1){
+		ints[i] = dataPacket(i);
+	}
+
+	DataStream ds(numElements, ints);
+
+	cout<<"starting with "<<numElements<<" streams"<<endl;
+
+	int bound = 0;
+	while(ds.numAlive () > 0 && bound < 20){
+		int toKill = rand() % ds.numAlive();
+		// toKill = 10;
+		ds.kill(toKill);
+		ds.compactWorkEfficientArbitrary ();
+
+		dataPacket cur;
+		ds.getData(toKill, cur);
+		cout<<"killing "<<cur.index<<", "<<ds.numAlive()<<" streams remain"<<endl;
+
+		ds.fetchDataFromGPU();
+
+		for (int i=0; i<ds.numAlive(); i+=1){
+			ds.getData(i, cur);
+			cout<<cur.index;
+			if (i<ds.numAlive()-1) cout<<",";
+		}
+		cout<<endl<<endl;
+		bound+=1;
+	}
+}
+
+int main(){
+	//testStreamCompaction();
+	srand (time(NULL));
+	// naiveCompactGlobal ();
+	// naiveCompactSharedArbitrary ();
+	naiveSumGlobal ();
+	return 0;
+}
\ No newline at end of file
diff --git a/src/streamCompaction.cu b/src/streamCompaction.cu
new file mode 100644
index 0000000..574cf6b
--- /dev/null
+++ b/src/streamCompaction.cu
@@ -0,0 +1,670 @@
+#include <stdio.h>
+#include <cuda.h>
+#include <cmath>
+#include <vector>
+#include <iostream>
+
+#include "streamCompaction.h"
+
+using namespace std;
+#include <thrust/copy.h>
+
+#define NUM_BANKS 16  
+#define LOG_NUM_BANKS 4  
+#define CONFLICT_FREE_OFFSET(n) \  
+    ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))  
+
+void checkCUDAError(const char *msg) {
+  cudaError_t err = cudaGetLastError();
+  if( cudaSuccess != err) {
+    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 
+  }
+} 
+
+__global__ void sum(int* in, int* out, int n, int d1){
+  int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (k<n){
+    int ink = in[k];
+    if (k>=d1){
+      out[k] = in[k-d1] + ink;
+    }
+    else{
+      out[k] = ink;
+    }
+  }
+}
+
+__global__ void shift(int* in, int* out, int n){
+  int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+  
+  out[0] = 0;
+  if (k<n && k>0){
+    out[k] = in[k-1];
+  }
+}
+
+__global__ void naiveSumGlobal(int* in, int* out, int n){
+
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  int logn = ceil(log(float(n))/log(2.0f));
+  for (int d=1; d<=logn; d++){
+    
+    int offset = powf(2.0f, d-1);
+    
+    if (index >= offset){
+      out[index] = in[index-offset] + in[index];
+    }
+    else{
+      out[index] = in[index]; 
+    }
+    __syncthreads();
+
+    int* temp = in;
+    in = out;
+    out = temp;
+  }
+}
+
+__global__ void naiveSumSharedSingleBlock(int* in, int* out, int n){
+
+  int index = threadIdx.x;
+
+  if (index >= n) return;
+
+  extern __shared__ int shared[];
+  int *tempIn = &shared[0];
+  int *tempOut = &shared[n];
+
+  tempOut[index] = (index > 0) ? in[index-1] : 0;  
+
+  __syncthreads();
+
+  for (int offset = 1; offset <= n; offset *= 2){
+    int* temp = tempIn;
+    tempIn = tempOut;
+    tempOut = temp;
+
+    if (index >= offset){
+      tempOut[index] = tempIn[index-offset] + tempIn[index];
+    }
+    else{
+      tempOut[index] = tempIn[index]; 
+    }
+    __syncthreads();
+  }
+  out[index] = tempOut[index];
+}
+
+__global__ void naiveSumSharedArbitrary(int* in, int* out, int n, int* sums=0){
+
+  int localIndex = threadIdx.x;
+  int globalIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  extern __shared__ int shared[];
+  int *tempIn = &shared[0];
+  int *tempOut = &shared[n];
+
+  tempOut[localIndex] = in[globalIndex];  
+  
+  __syncthreads();
+
+  for (int offset = 1; offset < n; offset *= 2){
+    int* temp = tempIn;
+    tempIn = tempOut;
+    tempOut = temp;
+
+    if (localIndex >= offset){
+      tempOut[localIndex] = tempIn[localIndex-offset] + tempIn[localIndex];
+    }
+    else{
+      tempOut[localIndex] = tempIn[localIndex]; 
+    }
+    __syncthreads();
+  }
+
+  if (sums) sums[blockIdx.x] = tempOut[n-1];
+  out[globalIndex] = tempOut[localIndex];
+}
+
+__global__ void workEfficientSumSingleBlock(int* in, int* out, int n){
+
+  extern __shared__ float temp[];
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int offset = 1;
+
+  if (2*index+1<=n){
+    temp[2*index] = in[2*index];
+    temp[2*index+1] = in[2*index+1];
+
+    for (int d = n>>1; d>0; d >>= 1){
+      __syncthreads();
+      if (index < d){
+        int ai = offset * (2*index+1) - 1;
+        int bi = offset * (2*index+2) - 1;
+
+        temp[bi] += temp[ai];
+      }
+      offset *= 2;
+    }
+
+    if (index == 0) temp[n - 1] = 0;
+
+    for (int d = 1; d<n; d*=2){
+      offset >>= 1;
+      __syncthreads();
+      if (index < d){
+
+        int ai = offset * (2*index+1) - 1;
+        int bi = offset * (2*index+2) - 1;
+
+        if (ai < n && bi < n){
+          float t = temp[ai];
+          temp[ai] = temp[bi];
+          temp[bi] += t;
+        }
+      }
+    }
+    __syncthreads();
+
+    out[2*index] = temp[2*index];
+    out[2*index+1] = temp[2*index+1];
+  }
+
+}
+
+__global__ void workEfficientArbitrary(int* in, int* out, int n, int* sums=0){
+
+  extern __shared__ float temp[];
+
+  int offset = 1;
+  int index = threadIdx.x;
+
+  int indexA = index;  
+  int indexB = index + (n/2);  
+  int bankOffsetA = CONFLICT_FREE_OFFSET(indexA);
+  int bankOffsetB = CONFLICT_FREE_OFFSET(indexB);
+  temp[indexA + bankOffsetA] = in[indexA];
+  temp[indexB + bankOffsetB] = in[indexB];
+
+  for (int d = n>>1; d>0; d >>= 1){
+    __syncthreads();
+    if (index < d){
+      int ai = offset * (2*index+1) - 1;
+      int bi = offset * (2*index+2) - 1;
+
+      ai += CONFLICT_FREE_OFFSET(ai);
+      bi += CONFLICT_FREE_OFFSET(bi);
+
+      temp[bi] += temp[ai];
+    }
+    offset *= 2;
+  }
+  
+  if (index == 0){
+    if (sums) sums[blockIdx.x] = temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)];
+    temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)] = 0;
+  }
+
+  for (int d = 1; d<n; d*=2){
+    offset >>= 1;
+    __syncthreads();
+    if (index < d){
+
+      int ai = offset * (2*index+1) - 1;
+      int bi = offset * (2*index+2) - 1;
+
+      ai += CONFLICT_FREE_OFFSET(ai);
+      bi += CONFLICT_FREE_OFFSET(bi);
+
+      if (ai < n && bi < n){
+        float t = temp[ai];
+        temp[ai] = temp[bi];
+        temp[bi] += t;
+      }
+    }
+  }
+  __syncthreads();
+
+  out[indexA] = temp[indexA + bankOffsetA];  
+  out[indexB] = temp[indexB + bankOffsetB];  
+}
+
+__global__ void addIncs(int* cudaAuxIncs, int* cudaIndicesB, int n){
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+
+  // if (index < n){
+    // cudaIndicesB[index] = blockIdx.x; //cudaAuxIncs[blockIdx.x];
+    cudaIndicesB[index] += cudaAuxIncs[blockIdx.x];
+  // }
+}
+
+__global__ void streamCompaction(dataPacket* inRays, int* indices, dataPacket* outRays, int numElements){
+  int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (k<numElements){
+    dataPacket inRay = inRays[k];
+    if (inRay.alive){
+      outRays[indices[k]] = inRay;
+    }
+  }
+}
+
+struct isAlive
+{
+  __host__ __device__
+  bool operator()(const dataPacket& dp)
+  {
+    return dp.alive;
+  }
+};
+
+struct isEven
+{
+  __host__ __device__
+  bool operator()(const int x)
+  {
+    return (x%2 == 0);
+  }
+};
+
+struct isOne
+{
+  __host__ __device__
+  bool operator()(const int x)
+  {
+    return (x == 1);
+  }
+};
+
+__global__ void killStream(int index, dataPacket* inRays, int* indices, int numElements){
+  int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (k<numElements){
+    if (k == index){
+      inRays[k].alive = false;
+      indices[k] = 0;
+    }
+  }
+}
+
+__global__ void resetStreams(dataPacket* inRays, int* indices, int numElements){
+  int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (k<numElements){
+    inRays[k].alive = true;
+    indices[k] = 1;
+  }
+}
+
+void testStreamCompaction(){
+  //Testing stream compaction
+  int numElements = 10;
+  dataPacket* arrayOfElements = new dataPacket[numElements];
+  for (int i=0; i<numElements; i+=1){
+    dataPacket rb(i);
+    arrayOfElements[i] = rb;
+  }
+
+  arrayOfElements[1].alive=false;
+  arrayOfElements[4].alive=false;
+  arrayOfElements[5].alive=false;
+  arrayOfElements[7].alive=false;
+  arrayOfElements[8].alive=false;
+
+
+  dataPacket* cudaArrayA;
+  dataPacket* cudaArrayB;
+
+  cudaMalloc((void**)&cudaArrayA, numElements*sizeof(dataPacket));
+  cudaMalloc((void**)&cudaArrayB, numElements*sizeof(dataPacket));
+
+  int* testin;
+  int* testout;
+  int* cputest = new int[numElements];
+
+  for (int i=0; i<numElements; i++){
+    if (arrayOfElements[i].alive){
+      cputest[i]=1;
+    }
+    else{
+      cputest[i]=0;
+    }
+  }
+
+  cudaMalloc((void**)&testin, numElements*sizeof(int));
+  cudaMalloc((void**)&testout, numElements*sizeof(int));
+
+  cudaMemcpy(cudaArrayA, arrayOfElements, numElements*sizeof(dataPacket), cudaMemcpyHostToDevice);  
+  cudaMemcpy(cudaArrayB, arrayOfElements, numElements*sizeof(dataPacket), cudaMemcpyHostToDevice);  
+  cudaMemcpy(testin, cputest, numElements*sizeof(int), cudaMemcpyHostToDevice);
+  cudaMemcpy(testout, cputest, numElements*sizeof(int), cudaMemcpyHostToDevice);
+
+  for (int i=0; i<numElements; i++){
+    std::cout<<arrayOfElements[i].index<<", "<<cputest[i]<<std::endl;
+  }
+
+  dim3 threadsPerBlock(64);
+  dim3 fullBlocksPerGrid(int(ceil(float(numElements)/64.0f)));
+
+  //scan
+  for (int d=1; d<=ceil(log(numElements)/log(2))+1; d++){
+    sum<<<fullBlocksPerGrid, threadsPerBlock>>>(testin, testout, numElements, int(pow(2.0f,d-1)));
+    cudaThreadSynchronize();
+    cudaMemcpy(cputest, testout, numElements*sizeof(int), cudaMemcpyDeviceToHost);
+
+
+    int* temp = testin;
+    testin=testout;
+    testout=temp;
+  }
+  //Compact
+  streamCompaction<<<fullBlocksPerGrid, threadsPerBlock>>>(cudaArrayA, testin, cudaArrayB, numElements);
+  cudaArrayA = cudaArrayB;
+  cudaThreadSynchronize();
+
+  cudaMemcpy(&numElements, &testin[numElements-1], 1*sizeof(int), cudaMemcpyDeviceToHost);
+
+  std::cout<<"number of rays left: "<<numElements<<std::endl;
+
+  // for (int i=0; i<numElements; i++){
+  //   std::cout<<cputest[i]<<std::endl;
+  // }    
+
+  cudaMemcpy(cputest, testin, numElements*sizeof(int), cudaMemcpyDeviceToHost);
+  cudaMemcpy(arrayOfElements, cudaArrayA, numElements*sizeof(dataPacket), cudaMemcpyDeviceToHost);
+
+
+  for (int i=0; i<numElements; i++){
+    std::cout<<arrayOfElements[i].index<<std::endl;
+
+  }
+  std::cout<<"___________________________________"<<std::endl;
+
+
+  delete [] cputest;
+  cudaFree(testin);
+  cudaFree(testout);
+
+  delete [] arrayOfElements;
+  cudaFree(cudaArrayA);
+  cudaFree(cudaArrayB);
+}
+
+DataStream::DataStream(int numElements, dataPacket * data){
+  m_data = data;
+
+  if (numElements % (THREADS_PER_BLOCK*2)) numElements+=1;
+
+  m_numElementsAlive = numElements;
+
+  if (numElements % (THREADS_PER_BLOCK*2) != 0){
+    int counter = 1;
+    while (THREADS_PER_BLOCK*2*counter < numElements){
+      counter += 1;
+    }
+    numElements = THREADS_PER_BLOCK*2*counter;
+  }
+
+  m_numElements = numElements;
+
+  m_indices = new int[numElements];
+  for (int i=0; i<numElements; i+=1){
+    if (i < m_numElementsAlive){
+      m_indices[i] = 1;
+    }
+    else{
+      m_indices[i] = 0;
+    }
+  }
+
+  m_auxSums = new int[numElements/(THREADS_PER_BLOCK*2)];
+  for (int i=0; i<numElements/(THREADS_PER_BLOCK*2); i+=1){
+    m_auxSums[i] = 0;
+  }
+
+  //cudaInit (cudaDataA, cudaDataB, cudaIndicesA, cudaIndicesB);
+  cudaMalloc ((void**)&cudaDataA, numElements*sizeof (dataPacket));
+  cudaMalloc ((void**)&cudaDataB, numElements*sizeof (dataPacket));
+  cudaMalloc ((void**)&cudaIndicesA, numElements*sizeof (int));
+  cudaMalloc ((void**)&cudaIndicesB, numElements*sizeof (int));
+  cudaMalloc ((void**)&cudaAuxSums, numElements/(THREADS_PER_BLOCK*2)*sizeof (int));
+  cudaMalloc ((void**)&cudaAuxIncs, numElements/(THREADS_PER_BLOCK*2)*sizeof (int));
+
+  cudaMemcpy(cudaDataA, m_data, numElements*sizeof(dataPacket), cudaMemcpyHostToDevice);
+  cudaMemcpy(cudaDataB, m_data, numElements*sizeof(dataPacket), cudaMemcpyHostToDevice);
+  cudaMemcpy(cudaIndicesA, m_indices, numElements*sizeof(int), cudaMemcpyHostToDevice);
+  cudaMemcpy(cudaIndicesB, m_indices, numElements*sizeof(int), cudaMemcpyHostToDevice);
+  cudaMemcpy(cudaAuxSums, m_auxSums, numElements/(THREADS_PER_BLOCK*2)*sizeof(int), cudaMemcpyHostToDevice);
+  cudaMemcpy(cudaAuxIncs, m_auxSums, numElements/(THREADS_PER_BLOCK*2)*sizeof(int), cudaMemcpyHostToDevice);
+}
+
+DataStream::~DataStream(){
+  cudaFree (cudaDataA);
+  cudaFree (cudaDataB);
+  cudaFree (cudaIndicesA);
+  cudaFree (cudaIndicesB);
+  cudaFree (cudaAuxSums);
+  cudaFree (cudaAuxIncs);
+
+  delete [] m_data;
+  delete [] m_indices;
+  delete [] m_auxSums;
+}
+
+void DataStream::serialScan(){
+  clock_t t = clock ();
+  m_indices[0] = 0;
+  for (int i=1; i<m_numElementsAlive; i+=1){
+    m_indices[i] = m_indices[i] + m_indices[i-1];
+  }
+  t = clock() - t;
+  cout<<(float)t/CLOCKS_PER_SEC<<endl;
+}
+
+void DataStream::globalSum(int* in, int* out, int n){
+  int threadsPerBlock = THREADS_PER_BLOCK;
+
+  dim3 threadsPerBlockL(threadsPerBlock);
+  dim3 fullBlocksPerGridL(m_numElements/threadsPerBlock);
+
+  for (int d=1; d<=ceil(log(m_numElementsAlive)/log(2)); d++){
+    sum<<<fullBlocksPerGridL, threadsPerBlockL>>>(in, out, m_numElementsAlive, powf(2.0f, d-1));
+    cudaThreadSynchronize();
+    int* temp = in;
+    in = out;
+    out = temp;
+  }
+  shift<<<fullBlocksPerGridL, threadsPerBlockL>>>(in, out, m_numElementsAlive);
+}
+
+void DataStream::thrustStreamCompact(){
+  clock_t t = clock ();
+  thrust::copy_if (m_data, m_data+m_numElements, m_indices, m_data, isOne());
+  t = clock() - t;
+  cout<<(float)t/CLOCKS_PER_SEC<<endl;
+}
+
+void DataStream::compactWorkEfficientArbitrary(){
+
+  int numElements = m_numElements;
+  int threadsPerBlock = THREADS_PER_BLOCK; // 8
+  int procsPefBlock = threadsPerBlock*2;   // 16
+
+  dim3 initialScanThreadsPerBlock(procsPefBlock/2);        //8
+  dim3 initialScanBlocksPerGrid(numElements/procsPefBlock);//
+
+  int sumSize = numElements/(THREADS_PER_BLOCK*2);
+  if (sumSize<2) sumSize+=2;
+
+  dim3 initialScanThreadsPerBlock2(sumSize/2);        //16
+  dim3 initialScanBlocksPerGrid2(sumSize/(sumSize/2)+1);//1024/16
+
+  dim3 initialScanThreadsPerBlock3(procsPefBlock);        //8
+  dim3 initialScanBlocksPerGrid3(numElements/procsPefBlock);//3
+
+  dim3 threadsPerBlockL(threadsPerBlock);
+  dim3 fullBlocksPerGridL(int(ceil(float(m_numElementsAlive)/float(threadsPerBlock))));
+
+  workEfficientArbitrary<<<initialScanBlocksPerGrid, initialScanThreadsPerBlock, procsPefBlock*sizeof(int)>>>(cudaIndicesA, cudaIndicesB, procsPefBlock, cudaAuxSums);
+
+  for (int d=1; d<=ceil(log(sumSize)/log(2)); d++){
+    sum<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaAuxSums, cudaAuxIncs, sumSize, powf(2.0f, d-1));
+    cudaThreadSynchronize();
+    int* temp = cudaAuxSums;
+    cudaAuxSums = cudaAuxIncs;
+    cudaAuxIncs = temp;
+  }
+  shift<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaAuxSums, cudaAuxIncs, m_numElements/(THREADS_PER_BLOCK*2));
+  addIncs<<<initialScanBlocksPerGrid3, initialScanThreadsPerBlock3>>>(cudaAuxIncs, cudaIndicesB, m_numElements);
+
+  //Stream compation from A into B, then save back into A
+  streamCompaction<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaDataA, cudaIndicesB, cudaDataB, m_numElementsAlive);
+  dataPacket * temp = cudaDataA;
+  cudaDataA = cudaDataB;
+  cudaDataB = temp;
+
+  // update numrays
+  cudaMemcpy(&m_numElementsAlive, &cudaIndicesB[m_numElementsAlive], sizeof(int), cudaMemcpyDeviceToHost);
+
+  resetStreams<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaDataA, cudaIndicesA, m_numElementsAlive);
+}
+
+void DataStream::compactNaiveSumGlobal(){
+
+  int threadsPerBlock = THREADS_PER_BLOCK;
+
+  dim3 threadsPerBlockL(threadsPerBlock);
+  dim3 fullBlocksPerGridL(m_numElements/threadsPerBlock);
+
+  clock_t t = clock();
+  for (int d=1; d<=ceil(log(m_numElementsAlive)/log(2)); d++){
+    sum<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaIndicesA, cudaIndicesB, m_numElementsAlive, powf(2.0f, d-1));
+    checkCUDAError("kernel failed 1 !");
+    cudaThreadSynchronize();
+    int* temp = cudaIndicesA;
+    cudaIndicesA = cudaIndicesB;
+    cudaIndicesB = temp;
+  }
+  shift<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaIndicesA, cudaIndicesB, m_numElementsAlive);
+  checkCUDAError("kernel failed 1 !");
+  t = clock() - t;
+
+  cudaMemcpy(m_indices, cudaIndicesB, m_numElementsAlive*sizeof(int), cudaMemcpyDeviceToHost);
+
+  cout<<m_indices[m_numElementsAlive-1]<<": ";
+  cout<<(float) t / CLOCKS_PER_SEC<<endl;
+
+  // for (int i=0; i<numAlive(); i+=1){
+  //     cout<<m_indices[i];
+  //     if (i<numAlive()-1) cout<<",";
+  // }
+  // cout<<endl;
+
+
+  // Stream compation from A into B, then save back into A
+  streamCompaction<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaDataA, cudaIndicesB, cudaDataB, m_numElementsAlive);
+  dataPacket * temp = cudaDataA;
+  cudaDataA = cudaDataB;
+  cudaDataB = temp;
+
+  // update numrays
+  cudaMemcpy(&m_numElementsAlive, &cudaIndicesA[m_numElementsAlive-1], sizeof(int), cudaMemcpyDeviceToHost);
+  resetStreams<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaDataA, cudaIndicesA, m_numElementsAlive);
+}
+
+void DataStream::compactNaiveSumSharedSingleBlock(){
+
+  int threadsPerBlock = THREADS_PER_BLOCK;
+
+  dim3 threadsPerBlockL(threadsPerBlock);
+  dim3 fullBlocksPerGridL(int(ceil(float(m_numElementsAlive)/float(threadsPerBlock))));
+
+  naiveSumSharedSingleBlock<<<fullBlocksPerGridL, threadsPerBlockL, 2*m_numElements*sizeof(int)>>>(cudaIndicesA, cudaIndicesB, m_numElements);
+  checkCUDAError("kernel failed!");
+
+  cudaMemcpy(m_indices, cudaIndicesB, m_numElements*sizeof(int), cudaMemcpyDeviceToHost);
+}
+
+void DataStream::compactNaiveSumSharedArbitrary(){
+
+  ////////////////////////////////////////////////////////////////////////////////////////
+  int threadsPerBlock = THREADS_PER_BLOCK;
+
+  dim3 threadsPerBlockL(threadsPerBlock*2);
+  dim3 fullBlocksPerGridL(m_numElements/(threadsPerBlock*2));
+
+  naiveSumSharedArbitrary<<<fullBlocksPerGridL, threadsPerBlockL, 2*m_numElements/(m_numElements/(threadsPerBlock*2))*sizeof(int)>>>(cudaIndicesA, cudaIndicesB, threadsPerBlock*2, cudaAuxSums);
+  checkCUDAError("kernel failed 1 !");
+  ////////////////////////////////////////////////////////////////////////////////////////
+
+  ////////////////////////////////////////////////////////////////////////////////////////
+  int sumSize = m_numElements/(THREADS_PER_BLOCK*2);
+  dim3 initialScanThreadsPerBlock2(threadsPerBlock);        
+  dim3 initialScanBlocksPerGrid2(sumSize/threadsPerBlock+1);
+
+  dim3 threadsPerBlockOld(threadsPerBlock);
+  dim3 fullBlocksPerGridOld(int(ceil(float(sumSize)/float(threadsPerBlock))));
+  
+  // cudaMemcpy(cudaAuxIncs, cudaAuxSums, m_numElements/(THREADS_PER_BLOCK*2)*sizeof(int), cudaMemcpyDeviceToDevice);
+
+  for (int d=1; d<=ceil(log(sumSize)/log(2)); d++){
+    sum<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaAuxSums, cudaAuxIncs, sumSize, powf(2.0f, d-1));
+    cudaThreadSynchronize();
+    int* temp = cudaAuxSums;
+    cudaAuxSums = cudaAuxIncs;
+    cudaAuxIncs = temp;
+  }
+  shift<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaAuxSums, cudaAuxIncs, m_numElements/(THREADS_PER_BLOCK*2));
+
+  addIncs<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaAuxIncs, cudaIndicesB, m_numElements);
+
+  shift<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaIndicesB, cudaIndicesA, m_numElementsAlive);
+  int * temp = cudaIndicesA;
+  cudaIndicesA = cudaIndicesB;
+  cudaIndicesB = temp;
+
+  dim3 threadsPerBlockLL(threadsPerBlock);
+  dim3 fullBlocksPerGridLL(m_numElements/threadsPerBlock);
+
+  clock_t t = clock();
+  //Stream compation from A into B, then save back into A
+  streamCompaction<<<fullBlocksPerGridLL, threadsPerBlockLL>>>(cudaDataA, cudaIndicesB, cudaDataB, m_numElementsAlive);
+  dataPacket * tempDP = cudaDataA;
+  cudaDataA = cudaDataB;
+  cudaDataB = tempDP;
+  t = clock() - t;
+  cout<<(float)t / CLOCKS_PER_SEC<<endl;
+
+  // // update numrays
+  cudaMemcpy(&m_numElementsAlive, &cudaIndicesA[m_numElementsAlive-1], sizeof(int), cudaMemcpyDeviceToHost);
+  resetStreams<<<fullBlocksPerGridL, threadsPerBlockL>>>(cudaDataA, cudaIndicesA, m_numElementsAlive);
+}
+
+bool DataStream::getData(int index, dataPacket& data){
+
+  if (index > m_numElements) return false;
+
+  data = m_data[index];
+  return true;
+}
+
+int DataStream::numAlive(){
+  return m_numElementsAlive;
+}
+
+void DataStream::fetchDataFromGPU(){
+  cudaMemcpy(m_data, cudaDataA, m_numElementsAlive*sizeof(dataPacket), cudaMemcpyDeviceToHost);
+}
+
+void DataStream::kill(int index){
+  if (index > m_numElementsAlive) return;
+
+  dim3 threadsPerBlockL(64);
+  dim3 fullBlocksPerGridL(int(ceil(float(m_numElementsAlive)/64.0f)));
+
+  killStream<<<fullBlocksPerGridL, threadsPerBlockL>>>(index, cudaDataA, cudaIndicesA, m_numElementsAlive);
+
+  cudaMemcpy(m_indices, cudaIndicesA, m_numElementsAlive*sizeof(int), cudaMemcpyDeviceToHost);
+}
\ No newline at end of file
diff --git a/src/streamCompaction.h b/src/streamCompaction.h
new file mode 100755
index 0000000..5585e70
--- /dev/null
+++ b/src/streamCompaction.h
@@ -0,0 +1,81 @@
+// CIS565 CUDA Raytracer: A parallel raytracer for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
+// Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
+// This file includes code from:
+//       Rob Farber for CUDA-GL interop, from CUDA Supercomputing For The Masses: http://www.drdobbs.com/architecture-and-design/cuda-supercomputing-for-the-masses-part/222600097
+//       Peter Kutz and Yining Karl Li's GPU Pathtracer: http://gpupathtracer.blogspot.com/
+//       Yining Karl Li's TAKUA Render, a massively parallel pathtracing renderer: http://www.yiningkarlli.com
+
+#ifndef STREAM_COMPACTION_H
+#define STREAM_COMPACTION_H
+
+#include <stdio.h>
+#include <thrust/random.h>
+#include <cuda.h>
+#include <cmath>
+#include <time.h>
+#include <map>
+#include <thrust/copy.h>
+#include <time.h>
+
+#define THREADS_PER_BLOCK 256
+
+struct dataPacket{
+	int index;
+	bool alive;
+	dataPacket(){
+		index = -1;
+		alive = true;
+	}
+	dataPacket(int i){
+		index = i;
+		alive = true;
+	}
+};
+
+class DataStream{
+private:
+
+	dataPacket * m_data;
+
+	dataPacket * cudaDataA;
+	dataPacket * cudaDataB;
+
+	int * cudaIndicesA;
+	int * cudaIndicesB;
+	
+	int * cudaAuxSums;
+	int * cudaAuxIncs;
+
+	void globalSum(int* in, int* out, int n);
+	void thrustStreamCompact();
+
+public:
+	int * m_indices;
+	int * m_auxSums;
+
+	int m_numElementsAlive, m_numElements;
+
+	DataStream(int numElements, dataPacket * data);
+	~DataStream();
+
+	void serialScan();
+	void serialScatter();
+
+	void compactWorkEfficientArbitrary();
+	void compactNaiveSumGlobal();
+	void compactNaiveSumSharedSingleBlock();
+	void compactNaiveSumSharedArbitrary();
+	bool getData(int index, dataPacket& data);
+	int numAlive();
+	void kill(int index);
+	void fetchDataFromGPU();
+
+};
+
+void cudaVectorSum(int * indicesA, int * indicesB, int numElements, float k);
+
+void cudaInit (dataPacket * a, dataPacket * b, int * ia, int * ib);
+
+void testStreamCompaction();
+
+#endif