diff --git a/README.md b/README.md
index a82ea0f..75cb93f 100644
--- a/README.md
+++ b/README.md
@@ -3,211 +3,143 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* SANCHIT GARG
+* Tested on: Mac OSX 10.10.4, i7 @ 2.4 GHz, GT 650M 1GB (Personal Computer)
-### (TODO: Your README)
+### SANCHIT GARG: ReadMe
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+In this assignment, we implemented the exclusive scan and stream compaction algorithm both on the CPU and the GPU. Then we compared their performances.
+References : http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
-Instructions (delete me)
-========================
-
-This is due Sunday, September 13 at midnight.
-
-**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
-from scratch. This algorithm is widely used, and will be important for
-accelerating your path tracer project.
-
-Your stream compaction implementations in this project will simply remove `0`s
-from an array of `int`s. In the path tracer, you will remove terminated paths
-from an array of rays.
-
-In addition to being useful for your path tracer, this project is meant to
-reorient your algorithmic thinking to the way of the GPU. On GPUs, many
-algorithms can benefit from massive parallelism and, in particular, data
-parallelism: executing the same code many times simultaneously with different
-data.
-
-You'll implement a few different versions of the *Scan* (*Prefix Sum*)
-algorithm. First, you'll implement a CPU version of the algorithm to reinforce
-your understanding. Then, you'll write a few GPU implementations: "naive" and
-"work-efficient." Finally, you'll use some of these to implement GPU stream
-compaction.
+## Part 1: CPU Scan & Stream Compaction
-**Algorithm overview & details:** There are two primary references for details
-on the implementation of scan and stream compaction.
+A serial CPU exclusive scan and stream compaction was implemented.
-* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
- for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
-* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).
-Your GPU stream compaction implementation will live inside of the
-`stream_compaction` subproject. This way, you will be able to easily copy it
-over for use in your GPU path tracer.
+## Part 2: Naive GPU Scan Algorithm
+A Naive GPU exclusive scan and Stream Compaction was implemented.
-## Part 0: The Usual
-This project (and all other CUDA projects in this course) requires an NVIDIA
-graphics card with CUDA capability. Any card with Compute Capability 2.0
-(`sm_20`) or greater will work. Check your GPU on this
-[compatibility table](https://developer.nvidia.com/cuda-gpus).
-If you do not have a personal machine with these specs, you may use those
-computers in the Moore 100B/C which have supported GPUs.
+## Part 3: Work-Efficient GPU Scan & Stream Compaction
-**HOWEVER**: If you need to use the lab computer for your development, you will
-not presently be able to do GPU performance profiling. This will be very
-important for debugging performance bottlenecks in your program.
+The Work-Efficient GPU exclusive scan and Stream Compaction was implemented.
-### Useful existing code
-* `stream_compaction/common.h`
- * `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
- * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
-* `main.cpp`
- * Some testing code for your implementations.
+## Part 4: Using Thrust's Implementation
+The Thrust library's exclusive scan was also implemented to compare the result with our implementations.
-## Part 1: CPU Scan & Stream Compaction
-This stream compaction method will remove `0`s from an array of `int`s.
+## Part 5: Radix Sort
-In `stream_compaction/cpu.cu`, implement:
+Implemented the Parallel Radix Sort algorithm as explained in the reference.
+A Namespace "Radix"
-* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
-* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
- the `scan` function.
-* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
- function. Map the input array to an array of 0s and 1s, scan it, and use
- scatter to produce the output. You will need a **CPU** scatter implementation
- for this (see slides or GPU Gems chapter for an explanation).
+## Performance Analysis
-These implementations should only be a few lines long.
+My implementation observed the following pattern. The time are all in milliseconds. I used 1024 threads per block for all GPU implementation
+#### Values
-## Part 2: Naive GPU Scan Algorithm
+
-In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
+#### Graph
-This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
-taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
-shared memory, but is limited to operating on very small arrays! Instead, write
-this using global memory only. As a result of this, you will have to do
-`ilog2ceil(n)` separate kernel invocations.
+
-Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
-code in the online version of Chapter 39 are known to have a few small errors
-(in superscripting, missing braces, bad indentation, etc.)
+#### Analysis
-Since the parallel scan algorithm operates on a binary tree structure, it works
-best with arrays with power-of-two length. Make sure your implementation works
-on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
-- your intermediate array sizes will need to be rounded to the next power of
-two.
+The bottleneck in the Naive implementation would be copying the output array after every scan step to use it as the input array. Switching the arrays was giving incorrect results.
+The bottleneck for the Work-Efficient implementation should be a lot of memory access in the kernel functions. This is a slow process and hence reduces the performance of the implementation.
+### Output
-## Part 3: Work-Efficient GPU Scan & Stream Compaction
+The console output of the program is as follows. Note that a test for Radix Sort was also written. I used the sort function under the header file "algorithm" and compared my implementations result with it. This was done to test the correctness of the Parallel Radix Sort implementation.
-### 3.1. Scan
+****************
+** SCAN TESTS **
+****************
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::scan`
+ [ 30 41 15 22 11 41 10 37 48 41 44 30 26 ... 20 0 ]
+==== cpu scan, power-of-two ====
-All of the text in Part 2 applies.
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ]
+==== cpu scan, non-power-of-two ====
-* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
-* Beware of errors in Example 39-2.
-* Test non-power-of-two sized arrays.
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26031 26064 ]
+ passed
-### 3.2. Stream Compaction
+==== naive scan, power-of-two ====
-This stream compaction method will remove `0`s from an array of `int`s.
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ]
+ passed
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::compact`
+==== naive scan, non-power-of-two ====
-For compaction, you will also need to implement the scatter algorithm presented
-in the slides and the GPU Gems chapter.
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 0 0 ]
+ passed
-In `stream_compaction/common.cu`, implement these for use in `compact`:
+==== work-efficient scan, power-of-two ====
-* `StreamCompaction::Common::kernMapToBoolean`
-* `StreamCompaction::Common::kernScatter`
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ]
+ passed
+==== work-efficient scan, non-power-of-two ====
-## Part 4: Using Thrust's Implementation
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26031 26064 ]
+ passed
-In `stream_compaction/thrust.cu`, implement:
+==== thrust scan, power-of-two ====
-* `StreamCompaction::Thrust::scan`
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ]
+ passed
-This should be a very short function which wraps a call to the Thrust library
-function `thrust::exclusive_scan(first, last, result)`.
+==== thrust scan, non-power-of-two ====
-To measure timing, be sure to exclude memory operations by passing
-`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
-GPU). You can create a `thrust::device_vector` by creating a
-`thrust::host_vector` from the given pointer, then casting it.
+ [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26031 26064 ]
+ passed
-## Part 5: Radix Sort (Extra Credit) (+10)
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
-Add an additional module to the `stream_compaction` subproject. Implement radix
-sort using one of your scan implementations. Add tests to check its correctness.
+ [ 2 3 3 0 1 1 2 1 2 1 2 0 2 ... 0 0 ]
+==== cpu compact without scan, power-of-two ====
-## Write-up
+ [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 2 1 ]
+ passed
-1. Update all of the TODOs at the top of this README.
-2. Add a description of this project including a list of its features.
-3. Add your performance analysis (see below).
+==== cpu compact without scan, non-power-of-two ====
-All extra credit features must be documented in your README, explaining its
-value (with performance comparison, if applicable!) and showing an example how
-it works. For radix sort, show how it is called and an example of its output.
+ [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 1 2 ]
+ passed
-Always profile with Release mode builds and run without debugging.
+==== cpu compact with scan ====
-### Questions
+ [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 2 1 ]
+ passed
-* Roughly optimize the block sizes of each of your implementations for minimal
- run time on your GPU.
- * (You shouldn't compare unoptimized implementations to each other!)
+==== work-efficient compact, power-of-two ====
-* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
- Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
- (with array size on the independent axis).
- * You should use CUDA events for timing. Be sure **not** to include any
- explicit memory operations in your performance measurements, for
- comparability.
- * To guess at what might be happening inside the Thrust implementation, take
- a look at the Nsight timeline for its execution.
+ [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 2 1 ]
+ passed
-* Write a brief explanation of the phenomena you see here.
- * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
- it different for each implementation?
+==== work-efficient compact, non-power-of-two ====
-* Paste the output of the test program into a triple-backtick block in your
- README.
- * If you add your own tests (e.g. for radix sort or to test additional corner
- cases), be sure to mention it explicitly.
+ [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 1 2 ]
+ passed
-These questions should help guide you in performance analysis on future
-assignments, as well.
+****************
+** RADIX SORT **
+****************
-## Submit
+==== Radix Sort, sizeAr elements ====
-If you have modified any of the `CMakeLists.txt` files at all (aside from the
-list of `SOURCE_FILES`), you must test that your project can build in Moore
-100B/C. Beware of any build issues discussed on the Google Group.
+ [ 30 91 15 72 61 41 10 37 98 41 94 80 26 96 10 88 ]
+
+ [ 10 10 15 26 30 37 41 41 61 72 80 88 91 94 96 98 ]
+ passed
-1. Open a GitHub pull request so that we can see that you have finished.
- The title should be "Submission: YOUR NAME".
-2. Send an email to the TA (gmail: kainino1+cis565@) with:
- * **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
- * Direct link to your pull request on GitHub
- * In the form of a grade (0-100+) with comments, evaluate your own
- performance on the project.
- * Feedback on the project itself, if any.
diff --git a/cis565_stream_compaction_test.launch b/cis565_stream_compaction_test.launch
index 4267429..07b70cd 100644
--- a/cis565_stream_compaction_test.launch
+++ b/cis565_stream_compaction_test.launch
@@ -8,8 +8,8 @@
-
-
+
+
@@ -18,8 +18,8 @@
-
+
diff --git a/images/PerformanceGraph.png b/images/PerformanceGraph.png
new file mode 100644
index 0000000..bc1c6da
Binary files /dev/null and b/images/PerformanceGraph.png differ
diff --git a/images/Values.png b/images/Values.png
new file mode 100644
index 0000000..d9395a5
Binary files /dev/null and b/images/Values.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 7308451..fa9d9a8 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,10 +11,14 @@
#include
#include
#include
+#include
#include "testing_helpers.hpp"
+//For testing Radix Sort
+#include
+
int main(int argc, char* argv[]) {
- const int SIZE = 1 << 8;
+ const int SIZE = 1 << 10;
const int NPOT = SIZE - 3;
int a[SIZE], b[SIZE], c[SIZE];
@@ -42,37 +46,37 @@ int main(int argc, char* argv[]) {
zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
- //printArray(SIZE, c, true);
+ printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
- //printArray(SIZE, c, true);
+ printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);
zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
- //printArray(SIZE, c, true);
+ printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
- //printArray(NPOT, c, true);
+ printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
- //printArray(SIZE, c, true);
+ printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
- //printArray(NPOT, c, true);
+ printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
printf("\n");
@@ -110,12 +114,37 @@ int main(int argc, char* argv[]) {
zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
- //printArray(count, c, true);
+ printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);
zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
- //printArray(count, c, true);
+ printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+
+ printf("\n");
+ printf("****************\n");
+ printf("** RADIX SORT **\n");
+ printf("****************\n");
+
+ const int sizeAr = 1 << 4;
+ int maxValue = 100;
+ int unsortedArray[sizeAr],
+ sortedArray[sizeAr];
+
+ genArray(sizeAr, unsortedArray, maxValue);
+ zeroArray(sizeAr, sortedArray);
+ printDesc("Radix Sort, elements");
+ printArray(sizeAr, unsortedArray, true);
+
+ //Sort using Radix Sort implementation
+ RadixSort::sort(sizeAr, maxValue, sortedArray, unsortedArray);
+ //Sort using (std::sort)
+ std::sort(std::begin(unsortedArray), std::end(unsortedArray));
+
+ printArray(sizeAr, sortedArray, true);
+
+ //Compare results
+ printCmpLenResult(sizeAr, sizeAr, sortedArray, unsortedArray);
}
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index f6b572f..88e8de2 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -19,7 +19,7 @@ void printDesc(const char *desc) {
template
void printCmpResult(int n, T *a, T *b) {
- printf(" %s \n",
+ printf(" %s \n\n",
cmpArrays(n, a, b) ? "FAIL VALUE" : "passed");
}
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index cdbef77..1ebe8aa 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -9,6 +9,8 @@ set(SOURCE_FILES
"efficient.cu"
"thrust.h"
"thrust.cu"
+ "radixSort.h"
+ "radixSort.cu"
)
cuda_add_library(stream_compaction
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index fe872d4..a3b8988 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -24,6 +24,13 @@ namespace Common {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+ if(index < n)
+ {
+ bools[index] = (idata[index] == 0) ? 0 : 1;
+ }
}
/**
@@ -33,6 +40,16 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
+
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+ if(index < n)
+ {
+ if(bools[index] == 1)
+ {
+ odata[indices[index]] = idata[index];
+ }
+ }
}
}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index e600c29..771cb5a 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,15 +1,35 @@
#include
#include "cpu.h"
+#include
namespace StreamCompaction {
namespace CPU {
+#define SHOW_TIMING 0
+
/**
* CPU scan (prefix sum).
*/
void scan(int n, int *odata, const int *idata) {
- // TODO
- printf("TODO\n");
+
+ cudaEvent_t start, stop;
+ cudaEventCreate(&start);
+ cudaEventCreate(&stop);
+ cudaEventRecord(start);
+
+ odata[0] = 0;
+
+ for(int i=1; i
#include "common.h"
#include "efficient.h"
+#include
namespace StreamCompaction {
namespace Efficient {
-// TODO: __global__
+#define SHOW_TIMING 0
+
+int numBlocks, numThreads = 1024;
+
+void printArray(int n, int * a)
+{
+ printf("\n");
+ for(int i=0; i>>(n, p, dev_idata);
+ }
+
+ //downSweep
+ updateArray<<<1,1>>>(n-1, 0, dev_idata);
+
+ i = ilog2ceil(n)-1;
+ p = 1;
+ p <<= i;
+
+ for(; i>=0; --i)
+ {
+ downSweep<<>>(n, p, dev_idata);
+ p >>= 1;
+ }
+
+ cudaEventRecord(stop);
+ cudaEventSynchronize(stop);
+ float milliseconds = 0;
+ cudaEventElapsedTime(&milliseconds, start, stop);
+ if(SHOW_TIMING)
+ std::cout<<"Total time in milliseconds : "<>>(n, dev_temp, dev_idata);
+
+ scan(n, dev_scanData, dev_temp);
+
+ setK<<<1,1>>>(dev_k, dev_scanData, n-1);
+
+ cudaMemcpy(k, dev_k, sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaMalloc((void**)&dev_odata, (*k) * sizeof(int));
+
+ StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_temp, dev_scanData);
+
+ cudaMemcpy(odata, dev_odata, (*k) * sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaFree(dev_idata);
+ cudaFree(dev_temp);
+ cudaFree(dev_scanData);
+ cudaFree(dev_k);
+
+ return (*k);
+
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 3d86b60..bcf4675 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,18 +2,64 @@
#include
#include "common.h"
#include "naive.h"
+#include
+#include
namespace StreamCompaction {
namespace Naive {
-// TODO: __global__
+#define SHOW_TIMING 0
+int numBlocks, numThreads = 256;
+
+int * dev_odata;
+int * dev_idata;
+
+ __global__ void scanStep(int n, int jump, int *odata, int *idata)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+
+ if(index >= jump && index < n)
+ {
+ odata[index] = idata[index] + idata[index - jump];
+ }
+ }
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- // TODO
- printf("TODO\n");
+
+ cudaMalloc((void**)&dev_odata, n * sizeof(int));
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+
+ cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ numBlocks = n / numThreads + 1;
+
+ cudaEvent_t start, stop;
+ cudaEventCreate(&start);
+ cudaEventCreate(&stop);
+ cudaEventRecord(start);
+
+ int p=1;
+ for(int i=1; i< ilog2ceil(n)+1; ++i)
+ {
+ scanStep<<>>(n, p, dev_odata, dev_idata);
+ p <<= 1;
+ cudaMemcpy(dev_idata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToDevice);
+ }
+
+ cudaEventRecord(stop);
+ cudaEventSynchronize(stop);
+ float milliseconds = 0;
+ cudaEventElapsedTime(&milliseconds, start, stop);
+ if(SHOW_TIMING)
+ std::cout<<"Total time in milliseconds : "<
+#include
+#include "common.h"
+#include "radixSort.h"
+#include "efficient.h"
+
+#define TEST 0
+int numBlocks, numThread = 1024;
+
+namespace RadixSort {
+
+void printArray(int size, int * a)
+{
+ printf("\n");
+ for(int i=0; i>>(n, i, dev_b, dev_idata);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_b, n*sizeof(int), cudaMemcpyDeviceToHost);
+ printArray(n, hst_temp);
+ }
+
+ //Create e
+ cudaMalloc((void**)&dev_e, n * sizeof(int));
+ createEArray<<>>(n, dev_e, dev_b);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_e, n*sizeof(int), cudaMemcpyDeviceToHost);
+ printArray(n, hst_temp);
+ }
+
+ //Create f by using efficient scan
+ cudaMalloc((void**)&dev_f, n * sizeof(int));
+ StreamCompaction::Efficient::scan(n, dev_f, dev_e);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_f, n*sizeof(int), cudaMemcpyDeviceToHost);
+ printArray(n, hst_temp);
+ }
+
+ //Finding total false
+ int *dev_totalFalse;
+ cudaMalloc((void**)&dev_totalFalse, sizeof(int));
+ getTotalFalse<<<1, 1>>>(n-1, dev_totalFalse, dev_f, dev_e);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_totalFalse, sizeof(int), cudaMemcpyDeviceToHost);
+ printf("\n%d %d\n", hst_temp[0], n-1);
+ }
+
+ //Create t
+ cudaMalloc((void**)&dev_t, n * sizeof(int));
+ createTArray<<>>(n, dev_t, dev_f, dev_totalFalse);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_t, n*sizeof(int), cudaMemcpyDeviceToHost);
+ printArray(n, hst_temp);
+ }
+
+ //Create d
+ cudaMalloc((void**)&dev_d, n * sizeof(int));
+ createDArray<<>>(n, dev_d, dev_b, dev_t, dev_f);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_d, n*sizeof(int), cudaMemcpyDeviceToHost);
+ printArray(n, hst_temp);
+ }
+
+ //Shuffle
+ scatter<<>>(n, dev_odata, dev_idata, dev_d);
+ if(TEST)
+ {
+ cudaMemcpy(hst_temp, dev_odata, n*sizeof(int), cudaMemcpyDeviceToHost);
+ printArray(n, hst_temp);
+ }
+
+ cudaFree(dev_b);
+ cudaFree(dev_e);
+ cudaFree(dev_f);
+ cudaFree(dev_t);
+ cudaFree(dev_d);
+ cudaFree(dev_totalFalse);
+ }
+
+ void sort(int n, int maxValue, int *odata, const int *idata)
+ {
+ int i = 1,
+ *dev_idata,
+ *dev_odata;
+
+ numBlocks = n / numThread + 1;
+
+ cudaMalloc((void**)&dev_odata, n * sizeof(int));
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ while(i <= maxValue)
+ {
+ split(n, i, dev_odata, dev_idata);
+
+ cudaMemcpy(dev_idata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToDevice);
+
+ i<<=1;
+ }
+
+ cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+
+ }
+}
diff --git a/stream_compaction/radixSort.h b/stream_compaction/radixSort.h
new file mode 100644
index 0000000..9ddca1a
--- /dev/null
+++ b/stream_compaction/radixSort.h
@@ -0,0 +1,7 @@
+#pragma once
+
+namespace RadixSort {
+
+ void sort(int n, int maxValue, int *odata, const int *idata);
+
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index d8dbb32..a668181 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -9,6 +9,8 @@
namespace StreamCompaction {
namespace Thrust {
+#define SHOW_TIMING 0
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
@@ -16,6 +18,27 @@ void scan(int n, int *odata, const int *idata) {
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+ thrust::host_vector thrustHst_idata(idata, idata+n);
+
+ thrust::device_vector thrustDev_idata(thrustHst_idata);
+ thrust::device_vector thrustDev_odata(n);
+
+ cudaEvent_t start, stop;
+ cudaEventCreate(&start);
+ cudaEventCreate(&stop);
+ cudaEventRecord(start);
+
+ thrust::exclusive_scan(thrustDev_idata.begin(), thrustDev_idata.end(), thrustDev_odata.begin());
+
+ cudaEventRecord(stop);
+ cudaEventSynchronize(stop);
+ float milliseconds = 0;
+ cudaEventElapsedTime(&milliseconds, start, stop);
+ if(SHOW_TIMING)
+ std::cout<<"Total time in milliseconds : "<