diff --git a/README.md b/README.md index a82ea0f..75cb93f 100644 --- a/README.md +++ b/README.md @@ -3,211 +3,143 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* SANCHIT GARG +* Tested on: Mac OSX 10.10.4, i7 @ 2.4 GHz, GT 650M 1GB (Personal Computer) -### (TODO: Your README) +### SANCHIT GARG: ReadMe -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +In this assignment, we implemented the exclusive scan and stream compaction algorithm both on the CPU and the GPU. Then we compared their performances. +References : http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html -Instructions (delete me) -======================== - -This is due Sunday, September 13 at midnight. - -**Summary:** In this project, you'll implement GPU stream compaction in CUDA, -from scratch. This algorithm is widely used, and will be important for -accelerating your path tracer project. - -Your stream compaction implementations in this project will simply remove `0`s -from an array of `int`s. In the path tracer, you will remove terminated paths -from an array of rays. - -In addition to being useful for your path tracer, this project is meant to -reorient your algorithmic thinking to the way of the GPU. On GPUs, many -algorithms can benefit from massive parallelism and, in particular, data -parallelism: executing the same code many times simultaneously with different -data. - -You'll implement a few different versions of the *Scan* (*Prefix Sum*) -algorithm. First, you'll implement a CPU version of the algorithm to reinforce -your understanding. Then, you'll write a few GPU implementations: "naive" and -"work-efficient." Finally, you'll use some of these to implement GPU stream -compaction. +## Part 1: CPU Scan & Stream Compaction -**Algorithm overview & details:** There are two primary references for details -on the implementation of scan and stream compaction. +A serial CPU exclusive scan and stream compaction was implemented. -* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx) - for Scan, Stream Compaction, and Work-Efficient Parallel Scan. -* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html). -Your GPU stream compaction implementation will live inside of the -`stream_compaction` subproject. This way, you will be able to easily copy it -over for use in your GPU path tracer. +## Part 2: Naive GPU Scan Algorithm +A Naive GPU exclusive scan and Stream Compaction was implemented. -## Part 0: The Usual -This project (and all other CUDA projects in this course) requires an NVIDIA -graphics card with CUDA capability. Any card with Compute Capability 2.0 -(`sm_20`) or greater will work. Check your GPU on this -[compatibility table](https://developer.nvidia.com/cuda-gpus). -If you do not have a personal machine with these specs, you may use those -computers in the Moore 100B/C which have supported GPUs. +## Part 3: Work-Efficient GPU Scan & Stream Compaction -**HOWEVER**: If you need to use the lab computer for your development, you will -not presently be able to do GPU performance profiling. This will be very -important for debugging performance bottlenecks in your program. +The Work-Efficient GPU exclusive scan and Stream Compaction was implemented. -### Useful existing code -* `stream_compaction/common.h` - * `checkCUDAError` macro: checks for CUDA errors and exits if there were any. - * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer. -* `main.cpp` - * Some testing code for your implementations. +## Part 4: Using Thrust's Implementation +The Thrust library's exclusive scan was also implemented to compare the result with our implementations. -## Part 1: CPU Scan & Stream Compaction -This stream compaction method will remove `0`s from an array of `int`s. +## Part 5: Radix Sort -In `stream_compaction/cpu.cu`, implement: +Implemented the Parallel Radix Sort algorithm as explained in the reference. +A Namespace "Radix" -* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum. -* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using - the `scan` function. -* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan` - function. Map the input array to an array of 0s and 1s, scan it, and use - scatter to produce the output. You will need a **CPU** scatter implementation - for this (see slides or GPU Gems chapter for an explanation). +## Performance Analysis -These implementations should only be a few lines long. +My implementation observed the following pattern. The time are all in milliseconds. I used 1024 threads per block for all GPU implementation +#### Values -## Part 2: Naive GPU Scan Algorithm +![](images/Values.png) -In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan` +#### Graph -This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet -taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses -shared memory, but is limited to operating on very small arrays! Instead, write -this using global memory only. As a result of this, you will have to do -`ilog2ceil(n)` separate kernel invocations. +![](images/PerformanceGraph.png) -Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA -code in the online version of Chapter 39 are known to have a few small errors -(in superscripting, missing braces, bad indentation, etc.) +#### Analysis -Since the parallel scan algorithm operates on a binary tree structure, it works -best with arrays with power-of-two length. Make sure your implementation works -on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory -- your intermediate array sizes will need to be rounded to the next power of -two. +The bottleneck in the Naive implementation would be copying the output array after every scan step to use it as the input array. Switching the arrays was giving incorrect results. +The bottleneck for the Work-Efficient implementation should be a lot of memory access in the kernel functions. This is a slow process and hence reduces the performance of the implementation. +### Output -## Part 3: Work-Efficient GPU Scan & Stream Compaction +The console output of the program is as follows. Note that a test for Radix Sort was also written. I used the sort function under the header file "algorithm" and compared my implementations result with it. This was done to test the correctness of the Parallel Radix Sort implementation. -### 3.1. Scan +**************** +** SCAN TESTS ** +**************** -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::scan` + [ 30 41 15 22 11 41 10 37 48 41 44 30 26 ... 20 0 ] +==== cpu scan, power-of-two ==== -All of the text in Part 2 applies. + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ] +==== cpu scan, non-power-of-two ==== -* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2. -* Beware of errors in Example 39-2. -* Test non-power-of-two sized arrays. + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26031 26064 ] + passed -### 3.2. Stream Compaction +==== naive scan, power-of-two ==== -This stream compaction method will remove `0`s from an array of `int`s. + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ] + passed -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::compact` +==== naive scan, non-power-of-two ==== -For compaction, you will also need to implement the scatter algorithm presented -in the slides and the GPU Gems chapter. + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 0 0 ] + passed -In `stream_compaction/common.cu`, implement these for use in `compact`: +==== work-efficient scan, power-of-two ==== -* `StreamCompaction::Common::kernMapToBoolean` -* `StreamCompaction::Common::kernScatter` + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ] + passed +==== work-efficient scan, non-power-of-two ==== -## Part 4: Using Thrust's Implementation + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26031 26064 ] + passed -In `stream_compaction/thrust.cu`, implement: +==== thrust scan, power-of-two ==== -* `StreamCompaction::Thrust::scan` + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26119 26139 ] + passed -This should be a very short function which wraps a call to the Thrust library -function `thrust::exclusive_scan(first, last, result)`. +==== thrust scan, non-power-of-two ==== -To measure timing, be sure to exclude memory operations by passing -`exclusive_scan` a `thrust::device_vector` (which is already allocated on the -GPU). You can create a `thrust::device_vector` by creating a -`thrust::host_vector` from the given pointer, then casting it. + [ 0 30 71 86 108 119 160 170 207 255 296 340 370 ... 26031 26064 ] + passed -## Part 5: Radix Sort (Extra Credit) (+10) +***************************** +** STREAM COMPACTION TESTS ** +***************************** -Add an additional module to the `stream_compaction` subproject. Implement radix -sort using one of your scan implementations. Add tests to check its correctness. + [ 2 3 3 0 1 1 2 1 2 1 2 0 2 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== -## Write-up + [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 2 1 ] + passed -1. Update all of the TODOs at the top of this README. -2. Add a description of this project including a list of its features. -3. Add your performance analysis (see below). +==== cpu compact without scan, non-power-of-two ==== -All extra credit features must be documented in your README, explaining its -value (with performance comparison, if applicable!) and showing an example how -it works. For radix sort, show how it is called and an example of its output. + [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 1 2 ] + passed -Always profile with Release mode builds and run without debugging. +==== cpu compact with scan ==== -### Questions + [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 2 1 ] + passed -* Roughly optimize the block sizes of each of your implementations for minimal - run time on your GPU. - * (You shouldn't compare unoptimized implementations to each other!) +==== work-efficient compact, power-of-two ==== -* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and - Thrust) to the serial CPU version of Scan. Plot a graph of the comparison - (with array size on the independent axis). - * You should use CUDA events for timing. Be sure **not** to include any - explicit memory operations in your performance measurements, for - comparability. - * To guess at what might be happening inside the Thrust implementation, take - a look at the Nsight timeline for its execution. + [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 2 1 ] + passed -* Write a brief explanation of the phenomena you see here. - * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is - it different for each implementation? +==== work-efficient compact, non-power-of-two ==== -* Paste the output of the test program into a triple-backtick block in your - README. - * If you add your own tests (e.g. for radix sort or to test additional corner - cases), be sure to mention it explicitly. + [ 2 3 3 1 1 2 1 2 1 2 2 2 3 ... 1 2 ] + passed -These questions should help guide you in performance analysis on future -assignments, as well. +**************** +** RADIX SORT ** +**************** -## Submit +==== Radix Sort, sizeAr elements ==== -If you have modified any of the `CMakeLists.txt` files at all (aside from the -list of `SOURCE_FILES`), you must test that your project can build in Moore -100B/C. Beware of any build issues discussed on the Google Group. + [ 30 91 15 72 61 41 10 37 98 41 94 80 26 96 10 88 ] + + [ 10 10 15 26 30 37 41 41 61 72 80 88 91 94 96 98 ] + passed -1. Open a GitHub pull request so that we can see that you have finished. - The title should be "Submission: YOUR NAME". -2. Send an email to the TA (gmail: kainino1+cis565@) with: - * **Subject**: in the form of `[CIS565] Project 2: PENNKEY` - * Direct link to your pull request on GitHub - * In the form of a grade (0-100+) with comments, evaluate your own - performance on the project. - * Feedback on the project itself, if any. diff --git a/cis565_stream_compaction_test.launch b/cis565_stream_compaction_test.launch index 4267429..07b70cd 100644 --- a/cis565_stream_compaction_test.launch +++ b/cis565_stream_compaction_test.launch @@ -8,8 +8,8 @@ - - + + @@ -18,8 +18,8 @@ - + diff --git a/images/PerformanceGraph.png b/images/PerformanceGraph.png new file mode 100644 index 0000000..bc1c6da Binary files /dev/null and b/images/PerformanceGraph.png differ diff --git a/images/Values.png b/images/Values.png new file mode 100644 index 0000000..d9395a5 Binary files /dev/null and b/images/Values.png differ diff --git a/src/main.cpp b/src/main.cpp index 7308451..fa9d9a8 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -11,10 +11,14 @@ #include #include #include +#include #include "testing_helpers.hpp" +//For testing Radix Sort +#include + int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; + const int SIZE = 1 << 10; const int NPOT = SIZE - 3; int a[SIZE], b[SIZE], c[SIZE]; @@ -42,37 +46,37 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -110,12 +114,37 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + + printf("\n"); + printf("****************\n"); + printf("** RADIX SORT **\n"); + printf("****************\n"); + + const int sizeAr = 1 << 4; + int maxValue = 100; + int unsortedArray[sizeAr], + sortedArray[sizeAr]; + + genArray(sizeAr, unsortedArray, maxValue); + zeroArray(sizeAr, sortedArray); + printDesc("Radix Sort, elements"); + printArray(sizeAr, unsortedArray, true); + + //Sort using Radix Sort implementation + RadixSort::sort(sizeAr, maxValue, sortedArray, unsortedArray); + //Sort using (std::sort) + std::sort(std::begin(unsortedArray), std::end(unsortedArray)); + + printArray(sizeAr, sortedArray, true); + + //Compare results + printCmpLenResult(sizeAr, sizeAr, sortedArray, unsortedArray); } diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index f6b572f..88e8de2 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -19,7 +19,7 @@ void printDesc(const char *desc) { template void printCmpResult(int n, T *a, T *b) { - printf(" %s \n", + printf(" %s \n\n", cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); } diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..1ebe8aa 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -9,6 +9,8 @@ set(SOURCE_FILES "efficient.cu" "thrust.h" "thrust.cu" + "radixSort.h" + "radixSort.cu" ) cuda_add_library(stream_compaction diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index fe872d4..a3b8988 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,13 @@ namespace Common { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if(index < n) + { + bools[index] = (idata[index] == 0) ? 0 : 1; + } } /** @@ -33,6 +40,16 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if(index < n) + { + if(bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..771cb5a 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,15 +1,35 @@ #include #include "cpu.h" +#include namespace StreamCompaction { namespace CPU { +#define SHOW_TIMING 0 + /** * CPU scan (prefix sum). */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + cudaEventRecord(start); + + odata[0] = 0; + + for(int i=1; i #include "common.h" #include "efficient.h" +#include namespace StreamCompaction { namespace Efficient { -// TODO: __global__ +#define SHOW_TIMING 0 + +int numBlocks, numThreads = 1024; + +void printArray(int n, int * a) +{ + printf("\n"); + for(int i=0; i>>(n, p, dev_idata); + } + + //downSweep + updateArray<<<1,1>>>(n-1, 0, dev_idata); + + i = ilog2ceil(n)-1; + p = 1; + p <<= i; + + for(; i>=0; --i) + { + downSweep<<>>(n, p, dev_idata); + p >>= 1; + } + + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + if(SHOW_TIMING) + std::cout<<"Total time in milliseconds : "<>>(n, dev_temp, dev_idata); + + scan(n, dev_scanData, dev_temp); + + setK<<<1,1>>>(dev_k, dev_scanData, n-1); + + cudaMemcpy(k, dev_k, sizeof(int), cudaMemcpyDeviceToHost); + + cudaMalloc((void**)&dev_odata, (*k) * sizeof(int)); + + StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_temp, dev_scanData); + + cudaMemcpy(odata, dev_odata, (*k) * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_temp); + cudaFree(dev_scanData); + cudaFree(dev_k); + + return (*k); + } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..bcf4675 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -2,18 +2,64 @@ #include #include "common.h" #include "naive.h" +#include +#include namespace StreamCompaction { namespace Naive { -// TODO: __global__ +#define SHOW_TIMING 0 +int numBlocks, numThreads = 256; + +int * dev_odata; +int * dev_idata; + + __global__ void scanStep(int n, int jump, int *odata, int *idata) + { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if(index >= jump && index < n) + { + odata[index] = idata[index] + idata[index - jump]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + numBlocks = n / numThreads + 1; + + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + cudaEventRecord(start); + + int p=1; + for(int i=1; i< ilog2ceil(n)+1; ++i) + { + scanStep<<>>(n, p, dev_odata, dev_idata); + p <<= 1; + cudaMemcpy(dev_idata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToDevice); + } + + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + if(SHOW_TIMING) + std::cout<<"Total time in milliseconds : "< +#include +#include "common.h" +#include "radixSort.h" +#include "efficient.h" + +#define TEST 0 +int numBlocks, numThread = 1024; + +namespace RadixSort { + +void printArray(int size, int * a) +{ + printf("\n"); + for(int i=0; i>>(n, i, dev_b, dev_idata); + if(TEST) + { + cudaMemcpy(hst_temp, dev_b, n*sizeof(int), cudaMemcpyDeviceToHost); + printArray(n, hst_temp); + } + + //Create e + cudaMalloc((void**)&dev_e, n * sizeof(int)); + createEArray<<>>(n, dev_e, dev_b); + if(TEST) + { + cudaMemcpy(hst_temp, dev_e, n*sizeof(int), cudaMemcpyDeviceToHost); + printArray(n, hst_temp); + } + + //Create f by using efficient scan + cudaMalloc((void**)&dev_f, n * sizeof(int)); + StreamCompaction::Efficient::scan(n, dev_f, dev_e); + if(TEST) + { + cudaMemcpy(hst_temp, dev_f, n*sizeof(int), cudaMemcpyDeviceToHost); + printArray(n, hst_temp); + } + + //Finding total false + int *dev_totalFalse; + cudaMalloc((void**)&dev_totalFalse, sizeof(int)); + getTotalFalse<<<1, 1>>>(n-1, dev_totalFalse, dev_f, dev_e); + if(TEST) + { + cudaMemcpy(hst_temp, dev_totalFalse, sizeof(int), cudaMemcpyDeviceToHost); + printf("\n%d %d\n", hst_temp[0], n-1); + } + + //Create t + cudaMalloc((void**)&dev_t, n * sizeof(int)); + createTArray<<>>(n, dev_t, dev_f, dev_totalFalse); + if(TEST) + { + cudaMemcpy(hst_temp, dev_t, n*sizeof(int), cudaMemcpyDeviceToHost); + printArray(n, hst_temp); + } + + //Create d + cudaMalloc((void**)&dev_d, n * sizeof(int)); + createDArray<<>>(n, dev_d, dev_b, dev_t, dev_f); + if(TEST) + { + cudaMemcpy(hst_temp, dev_d, n*sizeof(int), cudaMemcpyDeviceToHost); + printArray(n, hst_temp); + } + + //Shuffle + scatter<<>>(n, dev_odata, dev_idata, dev_d); + if(TEST) + { + cudaMemcpy(hst_temp, dev_odata, n*sizeof(int), cudaMemcpyDeviceToHost); + printArray(n, hst_temp); + } + + cudaFree(dev_b); + cudaFree(dev_e); + cudaFree(dev_f); + cudaFree(dev_t); + cudaFree(dev_d); + cudaFree(dev_totalFalse); + } + + void sort(int n, int maxValue, int *odata, const int *idata) + { + int i = 1, + *dev_idata, + *dev_odata; + + numBlocks = n / numThread + 1; + + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + while(i <= maxValue) + { + split(n, i, dev_odata, dev_idata); + + cudaMemcpy(dev_idata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToDevice); + + i<<=1; + } + + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); + + } +} diff --git a/stream_compaction/radixSort.h b/stream_compaction/radixSort.h new file mode 100644 index 0000000..9ddca1a --- /dev/null +++ b/stream_compaction/radixSort.h @@ -0,0 +1,7 @@ +#pragma once + +namespace RadixSort { + + void sort(int n, int maxValue, int *odata, const int *idata); + +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..a668181 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -9,6 +9,8 @@ namespace StreamCompaction { namespace Thrust { +#define SHOW_TIMING 0 + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ @@ -16,6 +18,27 @@ void scan(int n, int *odata, const int *idata) { // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::host_vector thrustHst_idata(idata, idata+n); + + thrust::device_vector thrustDev_idata(thrustHst_idata); + thrust::device_vector thrustDev_odata(n); + + cudaEvent_t start, stop; + cudaEventCreate(&start); + cudaEventCreate(&stop); + cudaEventRecord(start); + + thrust::exclusive_scan(thrustDev_idata.begin(), thrustDev_idata.end(), thrustDev_odata.begin()); + + cudaEventRecord(stop); + cudaEventSynchronize(stop); + float milliseconds = 0; + cudaEventElapsedTime(&milliseconds, start, stop); + if(SHOW_TIMING) + std::cout<<"Total time in milliseconds : "<