diff --git a/README.md b/README.md
index a82ea0f..2d14392 100644
--- a/README.md
+++ b/README.md
@@ -1,213 +1,118 @@
-CUDA Stream Compaction
-======================
+# CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Terry Sun; Arch Linux, Intel i5-4670, GTX 750
 
-### (TODO: Your README)
+## Library
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project contains a `stream_compaction` library and some associated tests.
 
-Instructions (delete me)
-========================
+`CPU`: A CPU implementation of `scan` and `scatter`, for reference and
+performance comparisons. Runs in O(n) / O(n) adds.
 
-This is due Sunday, September 13 at midnight.
-
-**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
-from scratch. This algorithm is widely used, and will be important for
-accelerating your path tracer project.
-
-Your stream compaction implementations in this project will simply remove `0`s
-from an array of `int`s. In the path tracer, you will remove terminated paths
-from an array of rays.
-
-In addition to being useful for your path tracer, this project is meant to
-reorient your algorithmic thinking to the way of the GPU. On GPUs, many
-algorithms can benefit from massive parallelism and, in particular, data
-parallelism: executing the same code many times simultaneously with different
-data.
-
-You'll implement a few different versions of the *Scan* (*Prefix Sum*)
-algorithm. First, you'll implement a CPU version of the algorithm to reinforce
-your understanding. Then, you'll write a few GPU implementations: "naive" and
-"work-efficient." Finally, you'll use some of these to implement GPU stream
-compaction.
-
-**Algorithm overview & details:** There are two primary references for details
-on the implementation of scan and stream compaction.
-
-* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
-  for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
-* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).
-
-Your GPU stream compaction implementation will live inside of the
-`stream_compaction` subproject. This way, you will be able to easily copy it
-over for use in your GPU path tracer.
-
-
-## Part 0: The Usual
-
-This project (and all other CUDA projects in this course) requires an NVIDIA
-graphics card with CUDA capability. Any card with Compute Capability 2.0
-(`sm_20`) or greater will work. Check your GPU on this
-[compatibility table](https://developer.nvidia.com/cuda-gpus).
-If you do not have a personal machine with these specs, you may use those
-computers in the Moore 100B/C which have supported GPUs.
-
-**HOWEVER**: If you need to use the lab computer for your development, you will
-not presently be able to do GPU performance profiling. This will be very
-important for debugging performance bottlenecks in your program.
-
-### Useful existing code
-
-* `stream_compaction/common.h`
-  * `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
-  * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
-* `main.cpp`
-  * Some testing code for your implementations.
-
-
-## Part 1: CPU Scan & Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/cpu.cu`, implement:
-
-* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
-* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
-  the `scan` function.
-* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
-  function. Map the input array to an array of 0s and 1s, scan it, and use
-  scatter to produce the output. You will need a **CPU** scatter implementation
-  for this (see slides or GPU Gems chapter for an explanation).
-
-These implementations should only be a few lines long.
-
-
-## Part 2: Naive GPU Scan Algorithm
-
-In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
-
-This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
-taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
-shared memory, but is limited to operating on very small arrays! Instead, write
-this using global memory only. As a result of this, you will have to do
-`ilog2ceil(n)` separate kernel invocations.
-
-Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
-code in the online version of Chapter 39 are known to have a few small errors
-(in superscripting, missing braces, bad indentation, etc.)
-
-Since the parallel scan algorithm operates on a binary tree structure, it works
-best with arrays with power-of-two length. Make sure your implementation works
-on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
-- your intermediate array sizes will need to be rounded to the next power of
-two.
-
-
-## Part 3: Work-Efficient GPU Scan & Stream Compaction
-
-### 3.1. Scan
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::scan`
-
-All of the text in Part 2 applies.
-
-* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
-* Beware of errors in Example 39-2.
-* Test non-power-of-two sized arrays.
-
-### 3.2. Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::compact`
-
-For compaction, you will also need to implement the scatter algorithm presented
-in the slides and the GPU Gems chapter.
-
-In `stream_compaction/common.cu`, implement these for use in `compact`:
-
-* `StreamCompaction::Common::kernMapToBoolean`
-* `StreamCompaction::Common::kernScatter`
-
-
-## Part 4: Using Thrust's Implementation
-
-In `stream_compaction/thrust.cu`, implement:
-
-* `StreamCompaction::Thrust::scan`
-
-This should be a very short function which wraps a call to the Thrust library
-function `thrust::exclusive_scan(first, last, result)`.
-
-To measure timing, be sure to exclude memory operations by passing
-`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
-GPU).  You can create a `thrust::device_vector` by creating a
-`thrust::host_vector` from the given pointer, then casting it.
-
-
-## Part 5: Radix Sort (Extra Credit) (+10)
-
-Add an additional module to the `stream_compaction` subproject. Implement radix
-sort using one of your scan implementations. Add tests to check its correctness.
-
-
-## Write-up
-
-1. Update all of the TODOs at the top of this README.
-2. Add a description of this project including a list of its features.
-3. Add your performance analysis (see below).
-
-All extra credit features must be documented in your README, explaining its
-value (with performance comparison, if applicable!) and showing an example how
-it works. For radix sort, show how it is called and an example of its output.
-
-Always profile with Release mode builds and run without debugging.
-
-### Questions
-
-* Roughly optimize the block sizes of each of your implementations for minimal
-  run time on your GPU.
-  * (You shouldn't compare unoptimized implementations to each other!)
-
-* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
-  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
-  (with array size on the independent axis).
-  * You should use CUDA events for timing. Be sure **not** to include any
-    explicit memory operations in your performance measurements, for
-    comparability.
-  * To guess at what might be happening inside the Thrust implementation, take
-    a look at the Nsight timeline for its execution.
-
-* Write a brief explanation of the phenomena you see here.
-  * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
-    it different for each implementation?
-
-* Paste the output of the test program into a triple-backtick block in your
-  README.
-  * If you add your own tests (e.g. for radix sort or to test additional corner
-    cases), be sure to mention it explicitly.
-
-These questions should help guide you in performance analysis on future
-assignments, as well.
-
-## Submit
-
-If you have modified any of the `CMakeLists.txt` files at all (aside from the
-list of `SOURCE_FILES`), you must test that your project can build in Moore
-100B/C. Beware of any build issues discussed on the Google Group.
-
-1. Open a GitHub pull request so that we can see that you have finished.
-   The title should be "Submission: YOUR NAME".
-2. Send an email to the TA (gmail: kainino1+cis565@) with:
-   * **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
-   * Direct link to your pull request on GitHub
-   * In the form of a grade (0-100+) with comments, evaluate your own
-     performance on the project.
-   * Feedback on the project itself, if any.
+`Naive`: A naive (non-work-efficient) implementation of `scan`, performing O(n)
+adds and O(logn) iterations.
+
+`Efficient`: A work-efficient implementation of `scan` and `compact`. Also
+contins `dv_scan`, the actual in-place scan implementation which takes a device
+memory pointer directly (useful for other CUDA functions which need scan,
+bypassing the need to generate the host-memory-pointers that `Efficient::scan`
+would take). Performs O(nlogn) adds and runs 2logn iterations.
+
+`Common`:
+* `kernMapToBooleans`, used as the `filter`ing function in `Efficient::compact`.
+* `kernScatter`, used in `Efficient::compact` and `Radix::sort`.
+
+`Radix`: `sort` is so close to working but... doesn't work :(
+
+## Performance Analysis
+
+I did performance analysis with `CudaEvent`s for the GPU algorithm
+implementations and `std::chrono` for the CPU implementations. As before, code
+for this can be found on the `performance` (to avoid cluttering the main
+codebase). Raw data (csv format) can be found in `data/`.
+
+### Some fun charts
+
+Measuring the performance of scan with a block size of 128 (where applicable).
+
+![](data/scan_perf_zoomed_out.png)
+
+I cut the top of the CPU line off and my chart is still horribly skewed. Let's
+try again:
+
+![](data/scan_perf_zoomed_in.png)
+
+Interestingly, the sharp(ish) increase in `thrust::scan` around N=14 is
+consistent between runs. Maybe it has to do with an increase in memory
+allocation around that size.
+
+`Naive` performs about twice as well as `Efficient`, which makes sense as the
+work-efficient scan takes twice as many iterations of kernel calls. I suspect a
+smarter method of spawning threads (only creating as many as you need instead of
+creating 2^N every time and only using a subset) would improve performance on
+the efficient algorithm, as it would result in more threads having the exact
+same sequence of instructions to be executed. I think the performance gain on
+efficient might be greater than `Naive` in this case because the `Efficient`
+algorithm uses more iterations but fewer threads in each case, which would
+explain why having a work-efficient algorithm is preferable. (I was planning
+on testing this but -- as you can see -- I ran out of time.)
+
+There's a small amount of moving memory from the device to host in
+`Efficient::scan` - I don't if that has an appreciable impact, since it only
+needs to copy `sizeof(int)`. `Efficient::compact` has even more memory copying
+to retrieve the size of the compacted stream.
+
+![](data/gpu_by_block_size.png)
+
+Tested on an array size of 2^16. `Naive::scan` and `Efficient::scan` are both
+roughly optimal at a block size of 128.
+
+The performance of `Efficient::compact` is dominated by `Efficient::scan`. The
+only other computation happening in `compact` is `kernMapToBoolean` and
+`kernScatter`, both of which are constant (in fact, 1 operation per thread), and
+memory copying (see above).
+
+Compact performance goes much the same way, to nobody's surprise:
+
+![](data/compact_by_array_size.png)
+
+
+## Test output
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  33  36  27  15  43  35  36  42  49  21  12  27  40 ...  28   0 ]
+==== cpu scan, power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ]
+==== cpu scan, non-power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6329 6330 ]
+    passed 
+==== naive scan, power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ]
+    passed 
+==== naive scan, non-power-of-two ====
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ]
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6329 6330 ]
+    passed 
+==== thrust scan, power-of-two ====
+    passed 
+==== thrust scan, non-power-of-two ====
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   0   1   1   1   1   0   0   1   1   0   1   0 ...   0   0 ]
+==== work-efficient compact, power-of-two ====
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    passed 
+```
diff --git a/data/compact_by_array_size.png b/data/compact_by_array_size.png
new file mode 100644
index 0000000..d426344
Binary files /dev/null and b/data/compact_by_array_size.png differ
diff --git a/data/cpu_by_arr_size.csv b/data/cpu_by_arr_size.csv
new file mode 100644
index 0000000..ee8a805
--- /dev/null
+++ b/data/cpu_by_arr_size.csv
@@ -0,0 +1,21 @@
+size, scan, compactWithoutScan, compactWithScan
+0, 0.018390, 0.017990, 0.062040
+1, 0.036900, 0.036660, 0.125410
+2, 0.056000, 0.056050, 0.191390
+3, 0.075810, 0.077990, 0.260930
+4, 0.098610, 0.103510, 0.342300
+5, 0.130480, 0.136290, 0.460860
+6, 0.176600, 0.193570, 0.614170
+7, 0.249980, 0.288220, 0.837180
+8, 0.379200, 0.451990, 1.180500
+9, 0.595360, 0.787670, 1.739130
+10, 1.010410, 1.459940, 2.873860
+11, 1.808420, 2.799750, 4.983760
+12, 3.382930, 5.454100, 9.189020
+13, 6.617830, 10.682090, 17.652109
+14, 13.106260, 20.930850, 34.538570
+15, 26.239680, 41.440238, 95.065141
+16, 53.197820, 82.259148, 232.273063
+17, 106.942953, 163.591766, 524.566375
+18, 214.327141, 326.009719, 1154.182750
+19, 444.831375, 655.437125, 2664.925000
diff --git a/data/gpu_by_array_size.csv b/data/gpu_by_array_size.csv
new file mode 100644
index 0000000..a9dd910
--- /dev/null
+++ b/data/gpu_by_array_size.csv
@@ -0,0 +1,17 @@
+block size, naive::scan, efficient::scan, efficient::compact, thrust::scan
+4, 0.019901, 0.032131, 0.042268, 0.014861
+5, 0.022966, 0.039091, 0.048816, 0.013974
+6, 0.026545, 0.046097, 0.055637, 0.013844
+7, 0.029734, 0.052625, 0.062891, 0.014107
+8, 0.032616, 0.059878, 0.069612, 0.013777
+9, 0.035588, 0.066927, 0.077829, 0.014007
+10, 0.039123, 0.074742, 0.085684, 0.013909
+11, 0.042336, 0.084190, 0.094274, 0.017210
+12, 0.049155, 0.093326, 0.103361, 0.018924
+13, 0.051340, 0.112999, 0.124146, 0.025679
+14, 0.066044, 0.153436, 0.166579, 0.040478
+15, 0.092430, 0.233036, 0.248662, 0.040362
+16, 0.145964, 0.397543, 0.418071, 0.049354
+17, 0.249689, 0.733529, 0.763885, 0.072880
+18, 0.521292, 1.435356, 1.482930, 0.095224
+19, 1.806152, 3.125029, 3.233285, 0.178910
diff --git a/data/gpu_by_block_size.csv b/data/gpu_by_block_size.csv
new file mode 100644
index 0000000..3bf25a8
--- /dev/null
+++ b/data/gpu_by_block_size.csv
@@ -0,0 +1,10 @@
+block size, naive::scan, efficient::scan, efficient::compact, thrust::scan
+2, 1.705963, 3.371424, 3.477096, 0.152979
+3, 0.869121, 1.678385, 1.735070, 0.153120
+4, 0.475849, 0.892212, 0.926461, 0.156361
+5, 0.268811, 0.502830, 0.531393, 0.159694
+6, 0.156100, 0.404411, 0.432535, 0.163947
+7, 0.144814, 0.393823, 0.414350, 0.167775
+8, 0.146012, 0.393572, 0.414207, 0.172095
+9, 0.148004, 0.393960, 0.411924, 0.175363
+10, 0.163155, 0.406132, 0.423445, 0.179432
diff --git a/data/gpu_by_block_size.png b/data/gpu_by_block_size.png
new file mode 100644
index 0000000..03565ab
Binary files /dev/null and b/data/gpu_by_block_size.png differ
diff --git a/data/scan_perf_zoomed_in.png b/data/scan_perf_zoomed_in.png
new file mode 100644
index 0000000..d9f63ac
Binary files /dev/null and b/data/scan_perf_zoomed_in.png differ
diff --git a/data/scan_perf_zoomed_out.png b/data/scan_perf_zoomed_out.png
new file mode 100644
index 0000000..5214b17
Binary files /dev/null and b/data/scan_perf_zoomed_out.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 675da35..90ef5cf 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -6,20 +6,148 @@
  * @copyright University of Pennsylvania
  */
 
+#include <chrono>
 #include <cstdio>
+
 #include <stream_compaction/cpu.h>
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
 #include "testing_helpers.hpp"
 
+#define BENCHMARK
+
+void benchmarkCPU() {
+    const int iterations = 100;
+    int totalScan = 0;
+    int totalCompactWithout = 0;
+    int totalCompactWith = 0;
+    printf("size, scan, compactWithoutScan, compactWithScan\n");
+    for (int s = 4; s < 20; s++) {
+        int SIZE = 1 << s;
+        int a[SIZE];
+        int b[SIZE];
+
+        for (int i = 0; i < iterations; i++) {
+            genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+            a[SIZE - 1] = 0;
+
+            auto begin = std::chrono::high_resolution_clock::now();
+            StreamCompaction::CPU::scan(SIZE, b, a);
+            auto end = std::chrono::high_resolution_clock::now();
+            int diff = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count();
+            totalScan += diff;
+
+            begin = std::chrono::high_resolution_clock::now();
+            StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+            end = std::chrono::high_resolution_clock::now();
+            diff = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count();
+            totalCompactWithout += diff;
+
+            begin = std::chrono::high_resolution_clock::now();
+            StreamCompaction::CPU::compactWithScan(SIZE, b, a);
+            end = std::chrono::high_resolution_clock::now();
+            diff = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count();
+            totalCompactWith += diff;
+        }
+        printf("%d, %f, %f, %f\n", s,
+                (float)totalScan / iterations / 1000.0,
+                (float)totalCompactWithout / iterations / 1000.0,
+                (float)totalCompactWith / iterations / 1000.0
+                );
+    }
+}
+
+void benchmarkGPUBlockSize() {
+    const int iterations = 100;
+    printf("block size, naive::scan, efficient::scan, efficient::compact\n");
+    int SIZE = 1 << 16;
+    int a[SIZE];
+    int b[SIZE];
+    for (int block = 2; block < 11; block++) {
+        int blockSize = 1 << block;
+
+        float totalNaive = 0;
+        float totalEfficientScan = 0;
+        float totalEfficientCompact = 0;
+        for (int i = 0; i < iterations; i++) {
+            genArray(SIZE - 1, a, 50);
+            a[SIZE - 1] = 0;
+            zeroArray(SIZE, b);
+
+            float timeElapsed;
+
+            StreamCompaction::Naive::scan(SIZE, b, a, &timeElapsed, blockSize);
+            totalNaive += timeElapsed;
+
+            StreamCompaction::Efficient::scan(SIZE, b, a, &timeElapsed, blockSize);
+            totalEfficientScan += timeElapsed;
+
+            StreamCompaction::Efficient::compact(SIZE, b, a, &timeElapsed, blockSize);
+            totalEfficientCompact += timeElapsed;
+        }
+        printf("%d, %f, %f, %f\n", block,
+                totalNaive / iterations,
+                totalEfficientScan / iterations,
+                totalEfficientCompact / iterations
+              );
+    }
+}
+
+void benchmarkGPUArraySize() {
+    const int iterations = 100;
+    printf("block size, naive::scan, efficient::scan, efficient::compact, thrust::scan\n");
+    for (int s = 4; s < 20; s++) {
+        int SIZE = 1 << s;
+        int a[SIZE];
+        int b[SIZE];
+
+        int blockSize = 1 << 7;
+
+        float totalNaive = 0;
+        float totalEfficientScan = 0;
+        float totalEfficientCompact = 0;
+        float totalThrust = 0;
+        for (int i = 0; i < iterations; i++) {
+            genArray(SIZE - 1, a, 50);
+            a[SIZE - 1] = 0;
+            zeroArray(SIZE, b);
+
+            float timeElapsed;
+
+            StreamCompaction::Naive::scan(SIZE, b, a, &timeElapsed, blockSize);
+            totalNaive += timeElapsed;
+
+            StreamCompaction::Efficient::scan(SIZE, b, a, &timeElapsed, blockSize);
+            totalEfficientScan += timeElapsed;
+
+            StreamCompaction::Efficient::compact(SIZE, b, a, &timeElapsed, blockSize);
+            totalEfficientCompact += timeElapsed;
+
+            StreamCompaction::Thrust::scan(SIZE, b, a, &timeElapsed);
+            totalThrust += timeElapsed;
+        }
+        printf("%d, %f, %f, %f, %f\n", s,
+                totalNaive / iterations,
+                totalEfficientScan / iterations,
+                totalEfficientCompact / iterations,
+                totalThrust / iterations
+              );
+    }
+}
+
 int main(int argc, char* argv[]) {
+#ifdef BENCHMARK
+    benchmarkCPU();
+    benchmarkGPUBlockSize();
+    benchmarkGPUArraySize();
+#else
     const int SIZE = 1 << 8;
-    const int NPOT = SIZE - 3;
+    //const int SIZE = 4;
+    const int NPOT_SIZE = SIZE - 3;
     int a[SIZE], b[SIZE], c[SIZE];
 
-    // Scan tests
-
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
@@ -36,33 +164,33 @@ int main(int argc, char* argv[]) {
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printArray(NPOT, b, true);
-    printCmpResult(NPOT, b, c);
+    StreamCompaction::CPU::scan(NPOT_SIZE, c, a);
+    printArray(NPOT_SIZE, b, true);
+    printCmpResult(NPOT_SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, power-of-two");
     StreamCompaction::Naive::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
+    StreamCompaction::Naive::scan(NPOT_SIZE, c, a);
     //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
+    printCmpResult(NPOT_SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+    StreamCompaction::Efficient::scan(NPOT_SIZE, c, a);
+    printArray(NPOT_SIZE, c, true);
+    printCmpResult(NPOT_SIZE, b, c);
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
@@ -72,42 +200,40 @@ int main(int argc, char* argv[]) {
 
     zeroArray(SIZE, c);
     printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
+    StreamCompaction::Thrust::scan(NPOT_SIZE, c, a);
+    //printArray(NPOT_SIZE, c, true);
+    printCmpResult(NPOT_SIZE, b, c);
 
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
     printf("*****************************\n");
 
-    // Compaction tests
-
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 2);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
-    int count, expectedCount, expectedNPOT;
+    int count, expectedCount, expectedNPOT_SIZE;
 
     zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
+//    printDesc("cpu compact without scan, power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
+//    printArray(count, b, true);
+//    printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
-    expectedNPOT = count;
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
+//    printDesc("cpu compact without scan, non-power-of-two");
+    count = StreamCompaction::CPU::compactWithoutScan(NPOT_SIZE, c, a);
+    expectedNPOT_SIZE = count;
+//    printArray(count, c, true);
+//    printCmpLenResult(count, expectedNPOT_SIZE, b, c);
 
     zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
+//    printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
+//    printArray(count, c, true);
+//    printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, power-of-two");
@@ -117,7 +243,28 @@ int main(int argc, char* argv[]) {
 
     zeroArray(SIZE, c);
     printDesc("work-efficient compact, non-power-of-two");
-    count = StreamCompaction::Efficient::compact(NPOT, c, a);
+    count = StreamCompaction::Efficient::compact(NPOT_SIZE, c, a);
     //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
+    printCmpLenResult(count, expectedNPOT_SIZE, b, c);
+
+    printf("\n");
+    printf("**********************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("**********************\n");
+
+    genArray(SIZE - 1, a, 3);  // Leave a 0 at the end to test that edge case
+    a[0] = 0;
+    a[1] = 2;
+    a[2] = 3;
+    a[3] = 1;
+    //a = { 0, 1, 2, 3 };
+    //a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, power-of-two");
+    StreamCompaction::Radix::sort(SIZE, c, a);
+    printArray(SIZE, c, true);
+    printArrayOrderResult(SIZE, c);
+#endif
 }
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index f6b572f..4c93526 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -33,6 +33,23 @@ void printCmpLenResult(int n, int expN, T *a, T *b) {
             cmpArrays(n, a, b) ? "FAIL VALUE" : "passed");
 }
 
+template<typename T>
+int testArrayOrder(int n, T *a) {
+    for (int i = 0; i < n-1; i++) {
+        if (a[i] < a[i+1]) {
+            printf("    (a[%d] = %d) < (a[%d] = %d)\n", i, a[i], i+1, a[i+1]);
+            return 1;
+        }
+    }
+    return 0;
+}
+
+template<typename T>
+void printArrayOrderResult(int n, T *a) {
+    printf("    %s \n",
+            testArrayOrder(n, a) ? "FAIL VALUE" : "passed");
+}
+
 void zeroArray(int n, int *a) {
     for (int i = 0; i < n; i++) {
         a[i] = 0;
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index cdbef77..bcc484e 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -9,6 +9,8 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "radix.h"
+    "radix.cu"
     )
 
 cuda_add_library(stream_compaction
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index fe872d4..cf9b41f 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,16 +23,22 @@ namespace Common {
  * which map to 0 will be removed, and elements which map to 1 will be kept.
  */
 __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-    // TODO
+    int k = threadIdx.x;
+    if (k >= n) { return; }
+
+    bools[k] = (idata[k] != 0) ? 1 : 0;
 }
 
-/**
- * Performs scatter on an array. That is, for each element in idata,
- * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
- */
-__global__ void kernScatter(int n, int *odata,
-        const int *idata, const int *bools, const int *indices) {
-    // TODO
+__global__ void kernScatter(int n, int *odata, int *indices, int *idata) {
+    int k = threadIdx.x;
+    if (k >= n) { return; }
+    if (k == n-1) {
+        // always take the last element
+        // `compact` will adjust size appropriately
+        odata[indices[k]] = idata[k];
+    } else if (indices[k] != indices[k+1]) {
+        odata[indices[k]] = idata[k];
+    }
 }
 
 }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index 4f52663..73fcccd 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -29,7 +29,6 @@ namespace StreamCompaction {
 namespace Common {
     __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
 
-    __global__ void kernScatter(int n, int *odata,
-            const int *idata, const int *bools, const int *indices);
+    __global__ void kernScatter(int n, int *odata, int *indices, int *idata);
 }
 }
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index e600c29..99c7880 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -8,8 +8,11 @@ namespace CPU {
  * CPU scan (prefix sum).
  */
 void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+    int total = 0;
+    for (int i = 0; i < n; i++) {
+        odata[i] = total;
+        total += idata[i];
+    }
 }
 
 /**
@@ -18,8 +21,25 @@ void scan(int n, int *odata, const int *idata) {
  * @returns the number of elements remaining after compaction.
  */
 int compactWithoutScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+    int count = 0;
+    for (int i = 0; i < n; i++) {
+        if (idata[i] != 0) {
+            odata[count++] = idata[i];
+        }
+    }
+    return count;
+}
+
+/**
+  * CPU scatter algorithm.
+  *
+  * @returns the number of elements remaining.
+  */
+int scatter(int n, int *odata, const int *indices, const int *input) {
+    for (int i = 0; i < n; i++) {
+        odata[indices[i]] = input[i];
+    }
+    return indices[n-1];
 }
 
 /**
@@ -28,8 +48,20 @@ int compactWithoutScan(int n, int *odata, const int *idata) {
  * @returns the number of elements remaining after compaction.
  */
 int compactWithScan(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+    int *predicate_data = (int *)malloc(n * sizeof(int));
+    for (int i = 0; i < n; i++) {
+        predicate_data[i] = idata[i] == 0 ? 0 : 1;
+    }
+
+    int *scan_data = (int *)malloc(n * sizeof(int));
+    scan(n, scan_data, predicate_data);
+
+    int count = scatter(n, odata, scan_data, idata);
+
+    free(predicate_data);
+    free(scan_data);
+
+    return count;
 }
 
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index b2f739b..2fb27f5 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -6,14 +6,91 @@
 namespace StreamCompaction {
 namespace Efficient {
 
-// TODO: __global__
+int BLOCK_SIZE = (2 << 7);
+
+__global__ void kUpSweep(int d, int *data) {
+    int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+    int exp_d  = (int)exp2f(d);
+    int exp_d1 = (int)exp2f(d+1);
+    if (k % exp_d1 == 0) {
+        data[k + exp_d1 - 1] += data[k + exp_d - 1];
+    }
+}
+
+__global__ void kDownSweep(int d, int *data) {
+    int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+    if (k % (int)exp2f(d+1) == 0) {
+        int left  = k + (int)exp2f(d) - 1;
+        int right = k + (int)exp2f(d+1) - 1;
+        int t = data[left];
+        data[left] = data[right];
+        data[right] += t;
+    }
+}
+
+/*
+ * In-place scan on `dev_idata`, which must be a device memory pointer.
+ */
+void dv_scan(int n, int *dev_idata) {
+    int numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
+
+    for (int d = 0; d < ilog2ceil(n)-1; d++) {
+        kUpSweep<<<numBlocks, BLOCK_SIZE>>>(d, dev_idata);
+        checkCUDAError("scan");
+    }
+
+    int z = 0;
+    cudaMemcpy(&dev_idata[n-1], &z, sizeof(int), cudaMemcpyHostToDevice);
+
+    for (int d = ilog2ceil(n)-1; d >= 0; d--) {
+        kDownSweep<<<numBlocks, BLOCK_SIZE>>>(d, dev_idata);
+        checkCUDAError("scan");
+    }
+}
 
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+void scan(int size, int *odata, const int *input, float *time, int blockSize) {
+    BLOCK_SIZE = blockSize;
+    int *idata;
+    int n;
+
+    if (size & (size-1) != 0) { // if size is not a power of 2
+        n = (int)exp2f(ilog2ceil(size));
+        idata = (int*)malloc(n * sizeof(int));
+        memcpy(idata, input, n * sizeof(int));
+        for (int j = size; j < n; j++) {
+            idata[j] = 0;
+        }
+    } else {
+        n = size;
+        idata = (int*)malloc(n * sizeof(int));
+        memcpy(idata, input, n * sizeof(int));
+    }
+
+
+    int array_size = n * sizeof(int);
+    int *dv_idata;
+
+    cudaMalloc((void**) &dv_idata, array_size);
+    cudaMemcpy(dv_idata, idata, array_size, cudaMemcpyHostToDevice);
+
+        cudaEvent_t begin, end;
+        cudaEventCreate(&begin);
+        cudaEventCreate(&end);
+        cudaEventRecord(begin, 0);
+
+    dv_scan(n, dv_idata);
+
+        cudaEventRecord(end, 0);
+        cudaEventSynchronize(end);
+        cudaEventElapsedTime(time, begin, end);
+        cudaEventDestroy(begin);
+        cudaEventDestroy(end);
+
+    cudaMemcpy(odata, dv_idata, array_size, cudaMemcpyDeviceToHost);
+    cudaFree(dv_idata);
 }
 
 /**
@@ -25,9 +102,68 @@ void scan(int n, int *odata, const int *idata) {
  * @param idata  The array of elements to compact.
  * @returns      The number of elements remaining after compaction.
  */
-int compact(int n, int *odata, const int *idata) {
-    // TODO
-    return -1;
+int compact(int size, int *odata, const int *input, float *time, int blockSize) {
+    BLOCK_SIZE = blockSize;
+
+    int *idata;
+    int n;
+
+    if (size & (size-1) != 0) { // if size is not a power of 2
+        n = (int)exp2f(ilog2ceil(size));
+        idata = (int*)malloc(n * sizeof(int));
+        memcpy(idata, input, n * sizeof(int));
+        for (int j = size; j < n; j++) {
+            idata[j] = 0;
+        }
+    } else {
+        n = size;
+        idata = (int*)malloc(n * sizeof(int));
+        memcpy(idata, input, n * sizeof(int));
+    }
+
+    int *dev_indices;
+    int *dev_odata;
+    int *dev_idata;
+    int array_size = n * sizeof(int);
+    int numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
+
+    cudaMalloc((void**) &dev_indices, array_size);
+    cudaMalloc((void**) &dev_odata, array_size);
+
+    cudaMalloc((void**) &dev_idata, array_size);
+    cudaMemcpy(dev_idata, idata, array_size, cudaMemcpyHostToDevice);
+
+        cudaEvent_t begin, end;
+        cudaEventCreate(&begin);
+        cudaEventCreate(&end);
+        cudaEventRecord(begin, 0);
+
+    StreamCompaction::Common::kernMapToBoolean<<<numBlocks, blockSize>>>(n, dev_indices, dev_idata);
+
+    int last;
+    cudaMemcpy(&last, dev_indices + n-1, sizeof(int), cudaMemcpyDeviceToHost);
+
+    dv_scan(n, dev_indices);
+    int streamSize;
+    cudaMemcpy(&streamSize, dev_indices + n-1, sizeof(int), cudaMemcpyDeviceToHost);
+
+    StreamCompaction::Common::kernScatter<<<numBlocks, blockSize>>>(n, dev_odata, dev_indices, dev_idata);
+
+        cudaEventRecord(end, 0);
+        cudaEventSynchronize(end);
+        cudaEventElapsedTime(time, begin, end);
+        cudaEventDestroy(begin);
+        cudaEventDestroy(end);
+
+    cudaMemcpy(odata, dev_odata, array_size, cudaMemcpyDeviceToHost);
+
+    // The kernel always copies the last elt.
+    // Adjust the size to include it if desired.
+    if (last == 1) {
+        streamSize++;
+    }
+
+    return streamSize;
 }
 
 }
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 395ba10..76fbc36 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -2,8 +2,9 @@
 
 namespace StreamCompaction {
 namespace Efficient {
-    void scan(int n, int *odata, const int *idata);
+    void dv_scan(int n, int *dev_idata);
+    void scan(int size, int *odata, const int *input, float *time, int blockSize);
 
-    int compact(int n, int *odata, const int *idata);
+    int compact(int n, int *odata, const int *idata, float *time, int blockSize);
 }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 3d86b60..ba0aa94 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -6,14 +6,68 @@
 namespace StreamCompaction {
 namespace Naive {
 
-// TODO: __global__
+__global__ void kScan(int d, int *odata, const int *idata) {
+    int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+    if (k >= (int)exp2f(d-1)) {
+        odata[k] = idata[k - (int)exp2f(d-1)] + idata[k];
+    } else {
+        odata[k] = idata[k];
+    }
+}
+
+__global__ void kShift(int n, int *odata, int *idata) {
+    int k = (blockDim.x * blockIdx.x) + threadIdx.x;
+    if (k >= n) { return; }
+    if (k == 0) {
+        odata[0] = 0;
+    } else {
+        odata[k] = idata[k-1];
+    }
+}
 
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO
-    printf("TODO\n");
+__host__ void scan(int n, int *odata, const int *idata, float *time, int blockSize) {
+    int array_size = n * sizeof(int);
+    int numBlocks = (n + blockSize - 1) / blockSize;
+
+    int *A;
+    int *B;
+
+    cudaMalloc((void**) &A, array_size);
+    cudaMalloc((void**) &B, array_size);
+    cudaMemcpy(A, idata, array_size, cudaMemcpyHostToDevice);
+
+    int *in;
+    int *out;
+
+    cudaEvent_t begin, end;
+    cudaEventCreate(&begin);
+    cudaEventCreate(&end);
+
+    cudaEventRecord(begin, 0);
+
+    for (int d = 1; d < ilog2ceil(n)+1; d++) {
+        in  = (d % 2 == 1) ? A : B;
+        out = (d % 2 == 1) ? B : A;
+        kScan<<<numBlocks, blockSize>>>(d, out, in);
+        checkCUDAError("scan");
+    }
+
+    // shift odata to the right for exclusive scan
+    kShift<<<numBlocks, blockSize>>>(n, in, out);
+
+    cudaEventRecord(end, 0);
+    cudaEventSynchronize(end);
+    cudaEventElapsedTime(time, begin, end);
+    cudaEventDestroy(begin);
+    cudaEventDestroy(end);
+
+    cudaMemcpy(odata, in, array_size, cudaMemcpyDeviceToHost);
+
+    cudaFree(A);
+    cudaFree(B);
 }
 
 }
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 21152d6..3c8578d 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -2,6 +2,6 @@
 
 namespace StreamCompaction {
 namespace Naive {
-    void scan(int n, int *odata, const int *idata);
+    void scan(int n, int *odata, const int *idata, float *time, int blockSize);
 }
 }
diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu
new file mode 100644
index 0000000..2369c2d
--- /dev/null
+++ b/stream_compaction/radix.cu
@@ -0,0 +1,147 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "radix.h"
+#include "efficient.h"
+
+namespace StreamCompaction {
+namespace Radix {
+
+/*
+ * Get INVERTED `d`th digit of odata[k].
+ */
+__global__ void kDigit(int n, int d, int *dv_odata, const int *dv_idata) {
+    int k = threadIdx.x;
+    if (k >= n) { return; }
+    dv_odata[k] = (dv_idata[k] & (1 << d)) > 0 ? 1 : 0;
+}
+
+__global__ void kInvert(int n, int *odata, const int *idata) {
+    int k = threadIdx.x;
+    if (k >= n) { return; }
+    odata[k] = idata[k] == 0 ? 1 : 0;
+}
+
+__global__ void kMapToIndex(int n, int *odata, int *b, int *f_indices, int pivot) {
+    int k = threadIdx.x;
+    if (k >= n) { return; }
+    odata[k] = (b[k] == 1) ? (k - f_indices[k] + pivot) : f_indices[k];
+}
+
+/*
+ * Implement split on device memory.
+ * Returns totalFalses (eg. the split point).
+ */
+__host__ int split(int n, int d, int *dv_odata, int *dv_idata) {
+    printf("---- split %d %d ----\n", n, d);
+    int array_size = n * sizeof(int);
+    int *TMP = (int*)malloc(array_size);
+
+    int *b;
+    int *e;
+    int *t;
+    int *indices;
+    cudaMalloc((void**) &b, array_size);
+    cudaMalloc((void**) &e, array_size);
+    cudaMalloc((void**) &t, array_size);
+    cudaMalloc((void**) &indices, array_size);
+
+    kDigit<<<1, n>>>(n, d, b, dv_idata); // b
+        printf("b: ");
+        cudaMemcpy(TMP, b, array_size, cudaMemcpyDeviceToHost);
+        for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n");
+    kInvert<<<1, n>>>(n, e, b); // e
+        printf("e: ");
+        cudaMemcpy(TMP, e, array_size, cudaMemcpyDeviceToHost);
+        for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n");
+
+    int lastElt;
+    cudaMemcpy(&lastElt, e + n-1, sizeof(int), cudaMemcpyDeviceToHost);
+
+    StreamCompaction::Efficient::dv_scan(n, e); // f IN PLACE OF e
+
+    int totalFalses;
+    cudaMemcpy(&totalFalses, e + n-1, sizeof(int), cudaMemcpyDeviceToHost);
+    totalFalses += lastElt;
+
+    printf("totalFalses = %d\n", totalFalses);
+
+    kMapToIndex<<<1, n>>>(n, indices, b, e, totalFalses);
+        printf("indices: ");
+        cudaMemcpy(TMP, indices, array_size, cudaMemcpyDeviceToHost);
+        for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n");
+
+    StreamCompaction::Common::kernScatter<<<1, n>>>(n, dv_odata, indices, dv_idata); // scatter
+        printf("scattered: ");
+        cudaMemcpy(TMP, dv_odata, array_size, cudaMemcpyDeviceToHost);
+        for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n");
+
+    cudaFree(b);
+    return totalFalses;
+}
+
+int testArrayOrder(int n, int *a) {
+    for (int i = 0; i < n-1; i++) {
+        if (a[i] > a[i+1]) {
+            return 1;
+        }
+    }
+    return 0;
+}
+
+/*
+ * odata and idata are device memory points.
+ */
+__host__ void sortRecursive(int n, int d, int dmax, int *odata, int *idata) {
+    if (d >= dmax) { return; }
+    int pivot = split(n, d, odata, idata);
+    //sortRecursive(n, d+1, dmax, odata, odata);
+    //if (pivot != 0) {
+    //    sortRecursive(pivot, d+1, dmax, odata, odata);
+    //}
+    //if (pivot != n) {
+    //    sortRecursive(n-pivot, d+1, dmax, odata+n, odata+n);
+    //}
+}
+
+__host__ void sortRecursive2(int n, int d, int dmax, int *odata, int *idata) {
+    if (d <= 0) { return; }
+    int pivot = split(n, d, odata, idata);
+    if (pivot != 0) {
+        sortRecursive(pivot, d-1, dmax, odata, odata);
+    }
+    if (pivot != n) {
+        sortRecursive(n-pivot, d-1, dmax, odata+n, odata+n);
+    }
+}
+
+__host__ void sort(int n, int *odata, const int *idata) {
+    int max = idata[0];
+    for (int i = 0; i < n; i++) {
+        if (idata[i] > max) {
+            max = idata[i];
+        }
+    }
+    int maxDigits = ilog2ceil(max);
+
+    int *dv_odata;
+    int *dv_idata;
+    int array_size = n * sizeof(int);
+
+    cudaMalloc((void**) &dv_odata, array_size);
+    cudaMalloc((void**) &dv_idata, array_size);
+    cudaMemcpy(dv_idata, idata, array_size, cudaMemcpyHostToDevice);
+
+    //sortRecursive(n, 0, maxDigits, dv_odata, dv_idata);
+    sortRecursive2(n, 0, maxDigits, dv_odata, dv_idata);
+
+    cudaMemcpy(odata, dv_odata, array_size, cudaMemcpyDeviceToHost);
+
+    //for (int i = 0; i < n; i++) { printf("%d\t%d\n", idata[i], odata[i]); }
+
+    cudaFree(dv_odata);
+    cudaFree(dv_idata);
+}
+
+}
+}
diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h
new file mode 100644
index 0000000..57d21d8
--- /dev/null
+++ b/stream_compaction/radix.h
@@ -0,0 +1,7 @@
+#pragma once
+
+namespace StreamCompaction {
+namespace Radix {
+    void sort(int n, int *odata, const int *idata);
+}
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index d8dbb32..9791b97 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -12,10 +12,24 @@ namespace Thrust {
 /**
  * Performs prefix-sum (aka scan) on idata, storing the result into odata.
  */
-void scan(int n, int *odata, const int *idata) {
-    // TODO use `thrust::exclusive_scan`
-    // example: for device_vectors dv_in and dv_out:
-    // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+void scan(int n, int *odata, const int *idata, float *time) {
+    thrust::device_vector<int> ivec(idata, idata+n);
+    thrust::device_vector<int> ovec(odata, odata+n);
+
+        cudaEvent_t begin, end;
+        cudaEventCreate(&begin);
+        cudaEventCreate(&end);
+        cudaEventRecord(begin, 0);
+
+    thrust::exclusive_scan(ivec.begin(), ivec.end(), ovec.begin());
+
+        cudaEventRecord(end, 0);
+        cudaEventSynchronize(end);
+        cudaEventElapsedTime(time, begin, end);
+        cudaEventDestroy(begin);
+        cudaEventDestroy(end);
+
+    thrust::copy(ovec.begin(), ovec.end(), odata);
 }
 
 }
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index 06707f3..4408a15 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -2,6 +2,6 @@
 
 namespace StreamCompaction {
 namespace Thrust {
-    void scan(int n, int *odata, const int *idata);
+    void scan(int n, int *odata, const int *idata, float *time);
 }
 }