diff --git a/README.md b/README.md index a82ea0f..2d14392 100644 --- a/README.md +++ b/README.md @@ -1,213 +1,118 @@ -CUDA Stream Compaction -====================== +# CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +Terry Sun; Arch Linux, Intel i5-4670, GTX 750 -### (TODO: Your README) +## Library -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project contains a `stream_compaction` library and some associated tests. -Instructions (delete me) -======================== +`CPU`: A CPU implementation of `scan` and `scatter`, for reference and +performance comparisons. Runs in O(n) / O(n) adds. -This is due Sunday, September 13 at midnight. - -**Summary:** In this project, you'll implement GPU stream compaction in CUDA, -from scratch. This algorithm is widely used, and will be important for -accelerating your path tracer project. - -Your stream compaction implementations in this project will simply remove `0`s -from an array of `int`s. In the path tracer, you will remove terminated paths -from an array of rays. - -In addition to being useful for your path tracer, this project is meant to -reorient your algorithmic thinking to the way of the GPU. On GPUs, many -algorithms can benefit from massive parallelism and, in particular, data -parallelism: executing the same code many times simultaneously with different -data. - -You'll implement a few different versions of the *Scan* (*Prefix Sum*) -algorithm. First, you'll implement a CPU version of the algorithm to reinforce -your understanding. Then, you'll write a few GPU implementations: "naive" and -"work-efficient." Finally, you'll use some of these to implement GPU stream -compaction. - -**Algorithm overview & details:** There are two primary references for details -on the implementation of scan and stream compaction. - -* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx) - for Scan, Stream Compaction, and Work-Efficient Parallel Scan. -* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html). - -Your GPU stream compaction implementation will live inside of the -`stream_compaction` subproject. This way, you will be able to easily copy it -over for use in your GPU path tracer. - - -## Part 0: The Usual - -This project (and all other CUDA projects in this course) requires an NVIDIA -graphics card with CUDA capability. Any card with Compute Capability 2.0 -(`sm_20`) or greater will work. Check your GPU on this -[compatibility table](https://developer.nvidia.com/cuda-gpus). -If you do not have a personal machine with these specs, you may use those -computers in the Moore 100B/C which have supported GPUs. - -**HOWEVER**: If you need to use the lab computer for your development, you will -not presently be able to do GPU performance profiling. This will be very -important for debugging performance bottlenecks in your program. - -### Useful existing code - -* `stream_compaction/common.h` - * `checkCUDAError` macro: checks for CUDA errors and exits if there were any. - * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer. -* `main.cpp` - * Some testing code for your implementations. - - -## Part 1: CPU Scan & Stream Compaction - -This stream compaction method will remove `0`s from an array of `int`s. - -In `stream_compaction/cpu.cu`, implement: - -* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum. -* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using - the `scan` function. -* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan` - function. Map the input array to an array of 0s and 1s, scan it, and use - scatter to produce the output. You will need a **CPU** scatter implementation - for this (see slides or GPU Gems chapter for an explanation). - -These implementations should only be a few lines long. - - -## Part 2: Naive GPU Scan Algorithm - -In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan` - -This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet -taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses -shared memory, but is limited to operating on very small arrays! Instead, write -this using global memory only. As a result of this, you will have to do -`ilog2ceil(n)` separate kernel invocations. - -Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA -code in the online version of Chapter 39 are known to have a few small errors -(in superscripting, missing braces, bad indentation, etc.) - -Since the parallel scan algorithm operates on a binary tree structure, it works -best with arrays with power-of-two length. Make sure your implementation works -on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory -- your intermediate array sizes will need to be rounded to the next power of -two. - - -## Part 3: Work-Efficient GPU Scan & Stream Compaction - -### 3.1. Scan - -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::scan` - -All of the text in Part 2 applies. - -* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2. -* Beware of errors in Example 39-2. -* Test non-power-of-two sized arrays. - -### 3.2. Stream Compaction - -This stream compaction method will remove `0`s from an array of `int`s. - -In `stream_compaction/efficient.cu`, implement -`StreamCompaction::Efficient::compact` - -For compaction, you will also need to implement the scatter algorithm presented -in the slides and the GPU Gems chapter. - -In `stream_compaction/common.cu`, implement these for use in `compact`: - -* `StreamCompaction::Common::kernMapToBoolean` -* `StreamCompaction::Common::kernScatter` - - -## Part 4: Using Thrust's Implementation - -In `stream_compaction/thrust.cu`, implement: - -* `StreamCompaction::Thrust::scan` - -This should be a very short function which wraps a call to the Thrust library -function `thrust::exclusive_scan(first, last, result)`. - -To measure timing, be sure to exclude memory operations by passing -`exclusive_scan` a `thrust::device_vector` (which is already allocated on the -GPU). You can create a `thrust::device_vector` by creating a -`thrust::host_vector` from the given pointer, then casting it. - - -## Part 5: Radix Sort (Extra Credit) (+10) - -Add an additional module to the `stream_compaction` subproject. Implement radix -sort using one of your scan implementations. Add tests to check its correctness. - - -## Write-up - -1. Update all of the TODOs at the top of this README. -2. Add a description of this project including a list of its features. -3. Add your performance analysis (see below). - -All extra credit features must be documented in your README, explaining its -value (with performance comparison, if applicable!) and showing an example how -it works. For radix sort, show how it is called and an example of its output. - -Always profile with Release mode builds and run without debugging. - -### Questions - -* Roughly optimize the block sizes of each of your implementations for minimal - run time on your GPU. - * (You shouldn't compare unoptimized implementations to each other!) - -* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and - Thrust) to the serial CPU version of Scan. Plot a graph of the comparison - (with array size on the independent axis). - * You should use CUDA events for timing. Be sure **not** to include any - explicit memory operations in your performance measurements, for - comparability. - * To guess at what might be happening inside the Thrust implementation, take - a look at the Nsight timeline for its execution. - -* Write a brief explanation of the phenomena you see here. - * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is - it different for each implementation? - -* Paste the output of the test program into a triple-backtick block in your - README. - * If you add your own tests (e.g. for radix sort or to test additional corner - cases), be sure to mention it explicitly. - -These questions should help guide you in performance analysis on future -assignments, as well. - -## Submit - -If you have modified any of the `CMakeLists.txt` files at all (aside from the -list of `SOURCE_FILES`), you must test that your project can build in Moore -100B/C. Beware of any build issues discussed on the Google Group. - -1. Open a GitHub pull request so that we can see that you have finished. - The title should be "Submission: YOUR NAME". -2. Send an email to the TA (gmail: kainino1+cis565@) with: - * **Subject**: in the form of `[CIS565] Project 2: PENNKEY` - * Direct link to your pull request on GitHub - * In the form of a grade (0-100+) with comments, evaluate your own - performance on the project. - * Feedback on the project itself, if any. +`Naive`: A naive (non-work-efficient) implementation of `scan`, performing O(n) +adds and O(logn) iterations. + +`Efficient`: A work-efficient implementation of `scan` and `compact`. Also +contins `dv_scan`, the actual in-place scan implementation which takes a device +memory pointer directly (useful for other CUDA functions which need scan, +bypassing the need to generate the host-memory-pointers that `Efficient::scan` +would take). Performs O(nlogn) adds and runs 2logn iterations. + +`Common`: +* `kernMapToBooleans`, used as the `filter`ing function in `Efficient::compact`. +* `kernScatter`, used in `Efficient::compact` and `Radix::sort`. + +`Radix`: `sort` is so close to working but... doesn't work :( + +## Performance Analysis + +I did performance analysis with `CudaEvent`s for the GPU algorithm +implementations and `std::chrono` for the CPU implementations. As before, code +for this can be found on the `performance` (to avoid cluttering the main +codebase). Raw data (csv format) can be found in `data/`. + +### Some fun charts + +Measuring the performance of scan with a block size of 128 (where applicable). + +![](data/scan_perf_zoomed_out.png) + +I cut the top of the CPU line off and my chart is still horribly skewed. Let's +try again: + +![](data/scan_perf_zoomed_in.png) + +Interestingly, the sharp(ish) increase in `thrust::scan` around N=14 is +consistent between runs. Maybe it has to do with an increase in memory +allocation around that size. + +`Naive` performs about twice as well as `Efficient`, which makes sense as the +work-efficient scan takes twice as many iterations of kernel calls. I suspect a +smarter method of spawning threads (only creating as many as you need instead of +creating 2^N every time and only using a subset) would improve performance on +the efficient algorithm, as it would result in more threads having the exact +same sequence of instructions to be executed. I think the performance gain on +efficient might be greater than `Naive` in this case because the `Efficient` +algorithm uses more iterations but fewer threads in each case, which would +explain why having a work-efficient algorithm is preferable. (I was planning +on testing this but -- as you can see -- I ran out of time.) + +There's a small amount of moving memory from the device to host in +`Efficient::scan` - I don't if that has an appreciable impact, since it only +needs to copy `sizeof(int)`. `Efficient::compact` has even more memory copying +to retrieve the size of the compacted stream. + +![](data/gpu_by_block_size.png) + +Tested on an array size of 2^16. `Naive::scan` and `Efficient::scan` are both +roughly optimal at a block size of 128. + +The performance of `Efficient::compact` is dominated by `Efficient::scan`. The +only other computation happening in `compact` is `kernMapToBoolean` and +`kernScatter`, both of which are constant (in fact, 1 operation per thread), and +memory copying (see above). + +Compact performance goes much the same way, to nobody's surprise: + +![](data/compact_by_array_size.png) + + +## Test output + +``` +**************** +** SCAN TESTS ** +**************** + [ 33 36 27 15 43 35 36 42 49 21 12 27 40 ... 28 0 ] +==== cpu scan, power-of-two ==== + [ 0 33 69 96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ] +==== cpu scan, non-power-of-two ==== + [ 0 33 69 96 111 154 189 225 267 316 337 349 376 ... 6329 6330 ] + passed +==== naive scan, power-of-two ==== + [ 0 33 69 96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ] + passed +==== naive scan, non-power-of-two ==== + passed +==== work-efficient scan, power-of-two ==== + [ 0 33 69 96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ] + passed +==== work-efficient scan, non-power-of-two ==== + [ 0 33 69 96 111 154 189 225 267 316 337 349 376 ... 6329 6330 ] + passed +==== thrust scan, power-of-two ==== + passed +==== thrust scan, non-power-of-two ==== + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 0 1 1 1 1 0 0 1 1 0 1 0 ... 0 0 ] +==== work-efficient compact, power-of-two ==== + passed +==== work-efficient compact, non-power-of-two ==== + passed +``` diff --git a/data/compact_by_array_size.png b/data/compact_by_array_size.png new file mode 100644 index 0000000..d426344 Binary files /dev/null and b/data/compact_by_array_size.png differ diff --git a/data/cpu_by_arr_size.csv b/data/cpu_by_arr_size.csv new file mode 100644 index 0000000..ee8a805 --- /dev/null +++ b/data/cpu_by_arr_size.csv @@ -0,0 +1,21 @@ +size, scan, compactWithoutScan, compactWithScan +0, 0.018390, 0.017990, 0.062040 +1, 0.036900, 0.036660, 0.125410 +2, 0.056000, 0.056050, 0.191390 +3, 0.075810, 0.077990, 0.260930 +4, 0.098610, 0.103510, 0.342300 +5, 0.130480, 0.136290, 0.460860 +6, 0.176600, 0.193570, 0.614170 +7, 0.249980, 0.288220, 0.837180 +8, 0.379200, 0.451990, 1.180500 +9, 0.595360, 0.787670, 1.739130 +10, 1.010410, 1.459940, 2.873860 +11, 1.808420, 2.799750, 4.983760 +12, 3.382930, 5.454100, 9.189020 +13, 6.617830, 10.682090, 17.652109 +14, 13.106260, 20.930850, 34.538570 +15, 26.239680, 41.440238, 95.065141 +16, 53.197820, 82.259148, 232.273063 +17, 106.942953, 163.591766, 524.566375 +18, 214.327141, 326.009719, 1154.182750 +19, 444.831375, 655.437125, 2664.925000 diff --git a/data/gpu_by_array_size.csv b/data/gpu_by_array_size.csv new file mode 100644 index 0000000..a9dd910 --- /dev/null +++ b/data/gpu_by_array_size.csv @@ -0,0 +1,17 @@ +block size, naive::scan, efficient::scan, efficient::compact, thrust::scan +4, 0.019901, 0.032131, 0.042268, 0.014861 +5, 0.022966, 0.039091, 0.048816, 0.013974 +6, 0.026545, 0.046097, 0.055637, 0.013844 +7, 0.029734, 0.052625, 0.062891, 0.014107 +8, 0.032616, 0.059878, 0.069612, 0.013777 +9, 0.035588, 0.066927, 0.077829, 0.014007 +10, 0.039123, 0.074742, 0.085684, 0.013909 +11, 0.042336, 0.084190, 0.094274, 0.017210 +12, 0.049155, 0.093326, 0.103361, 0.018924 +13, 0.051340, 0.112999, 0.124146, 0.025679 +14, 0.066044, 0.153436, 0.166579, 0.040478 +15, 0.092430, 0.233036, 0.248662, 0.040362 +16, 0.145964, 0.397543, 0.418071, 0.049354 +17, 0.249689, 0.733529, 0.763885, 0.072880 +18, 0.521292, 1.435356, 1.482930, 0.095224 +19, 1.806152, 3.125029, 3.233285, 0.178910 diff --git a/data/gpu_by_block_size.csv b/data/gpu_by_block_size.csv new file mode 100644 index 0000000..3bf25a8 --- /dev/null +++ b/data/gpu_by_block_size.csv @@ -0,0 +1,10 @@ +block size, naive::scan, efficient::scan, efficient::compact, thrust::scan +2, 1.705963, 3.371424, 3.477096, 0.152979 +3, 0.869121, 1.678385, 1.735070, 0.153120 +4, 0.475849, 0.892212, 0.926461, 0.156361 +5, 0.268811, 0.502830, 0.531393, 0.159694 +6, 0.156100, 0.404411, 0.432535, 0.163947 +7, 0.144814, 0.393823, 0.414350, 0.167775 +8, 0.146012, 0.393572, 0.414207, 0.172095 +9, 0.148004, 0.393960, 0.411924, 0.175363 +10, 0.163155, 0.406132, 0.423445, 0.179432 diff --git a/data/gpu_by_block_size.png b/data/gpu_by_block_size.png new file mode 100644 index 0000000..03565ab Binary files /dev/null and b/data/gpu_by_block_size.png differ diff --git a/data/scan_perf_zoomed_in.png b/data/scan_perf_zoomed_in.png new file mode 100644 index 0000000..d9f63ac Binary files /dev/null and b/data/scan_perf_zoomed_in.png differ diff --git a/data/scan_perf_zoomed_out.png b/data/scan_perf_zoomed_out.png new file mode 100644 index 0000000..5214b17 Binary files /dev/null and b/data/scan_perf_zoomed_out.png differ diff --git a/src/main.cpp b/src/main.cpp index 675da35..90ef5cf 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -6,20 +6,148 @@ * @copyright University of Pennsylvania */ +#include #include + #include #include #include #include +#include #include "testing_helpers.hpp" +#define BENCHMARK + +void benchmarkCPU() { + const int iterations = 100; + int totalScan = 0; + int totalCompactWithout = 0; + int totalCompactWith = 0; + printf("size, scan, compactWithoutScan, compactWithScan\n"); + for (int s = 4; s < 20; s++) { + int SIZE = 1 << s; + int a[SIZE]; + int b[SIZE]; + + for (int i = 0; i < iterations; i++) { + genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + + auto begin = std::chrono::high_resolution_clock::now(); + StreamCompaction::CPU::scan(SIZE, b, a); + auto end = std::chrono::high_resolution_clock::now(); + int diff = std::chrono::duration_cast(end-begin).count(); + totalScan += diff; + + begin = std::chrono::high_resolution_clock::now(); + StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); + end = std::chrono::high_resolution_clock::now(); + diff = std::chrono::duration_cast(end-begin).count(); + totalCompactWithout += diff; + + begin = std::chrono::high_resolution_clock::now(); + StreamCompaction::CPU::compactWithScan(SIZE, b, a); + end = std::chrono::high_resolution_clock::now(); + diff = std::chrono::duration_cast(end-begin).count(); + totalCompactWith += diff; + } + printf("%d, %f, %f, %f\n", s, + (float)totalScan / iterations / 1000.0, + (float)totalCompactWithout / iterations / 1000.0, + (float)totalCompactWith / iterations / 1000.0 + ); + } +} + +void benchmarkGPUBlockSize() { + const int iterations = 100; + printf("block size, naive::scan, efficient::scan, efficient::compact\n"); + int SIZE = 1 << 16; + int a[SIZE]; + int b[SIZE]; + for (int block = 2; block < 11; block++) { + int blockSize = 1 << block; + + float totalNaive = 0; + float totalEfficientScan = 0; + float totalEfficientCompact = 0; + for (int i = 0; i < iterations; i++) { + genArray(SIZE - 1, a, 50); + a[SIZE - 1] = 0; + zeroArray(SIZE, b); + + float timeElapsed; + + StreamCompaction::Naive::scan(SIZE, b, a, &timeElapsed, blockSize); + totalNaive += timeElapsed; + + StreamCompaction::Efficient::scan(SIZE, b, a, &timeElapsed, blockSize); + totalEfficientScan += timeElapsed; + + StreamCompaction::Efficient::compact(SIZE, b, a, &timeElapsed, blockSize); + totalEfficientCompact += timeElapsed; + } + printf("%d, %f, %f, %f\n", block, + totalNaive / iterations, + totalEfficientScan / iterations, + totalEfficientCompact / iterations + ); + } +} + +void benchmarkGPUArraySize() { + const int iterations = 100; + printf("block size, naive::scan, efficient::scan, efficient::compact, thrust::scan\n"); + for (int s = 4; s < 20; s++) { + int SIZE = 1 << s; + int a[SIZE]; + int b[SIZE]; + + int blockSize = 1 << 7; + + float totalNaive = 0; + float totalEfficientScan = 0; + float totalEfficientCompact = 0; + float totalThrust = 0; + for (int i = 0; i < iterations; i++) { + genArray(SIZE - 1, a, 50); + a[SIZE - 1] = 0; + zeroArray(SIZE, b); + + float timeElapsed; + + StreamCompaction::Naive::scan(SIZE, b, a, &timeElapsed, blockSize); + totalNaive += timeElapsed; + + StreamCompaction::Efficient::scan(SIZE, b, a, &timeElapsed, blockSize); + totalEfficientScan += timeElapsed; + + StreamCompaction::Efficient::compact(SIZE, b, a, &timeElapsed, blockSize); + totalEfficientCompact += timeElapsed; + + StreamCompaction::Thrust::scan(SIZE, b, a, &timeElapsed); + totalThrust += timeElapsed; + } + printf("%d, %f, %f, %f, %f\n", s, + totalNaive / iterations, + totalEfficientScan / iterations, + totalEfficientCompact / iterations, + totalThrust / iterations + ); + } +} + int main(int argc, char* argv[]) { +#ifdef BENCHMARK + benchmarkCPU(); + benchmarkGPUBlockSize(); + benchmarkGPUArraySize(); +#else const int SIZE = 1 << 8; - const int NPOT = SIZE - 3; + //const int SIZE = 4; + const int NPOT_SIZE = SIZE - 3; int a[SIZE], b[SIZE], c[SIZE]; - // Scan tests - printf("\n"); printf("****************\n"); printf("** SCAN TESTS **\n"); @@ -36,33 +164,33 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); + StreamCompaction::CPU::scan(NPOT_SIZE, c, a); + printArray(NPOT_SIZE, b, true); + printCmpResult(NPOT_SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); + StreamCompaction::Naive::scan(NPOT_SIZE, c, a); //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); + printCmpResult(NPOT_SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); + StreamCompaction::Efficient::scan(NPOT_SIZE, c, a); + printArray(NPOT_SIZE, c, true); + printCmpResult(NPOT_SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); @@ -72,42 +200,40 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); + StreamCompaction::Thrust::scan(NPOT_SIZE, c, a); + //printArray(NPOT_SIZE, c, true); + printCmpResult(NPOT_SIZE, b, c); printf("\n"); printf("*****************************\n"); printf("** STREAM COMPACTION TESTS **\n"); printf("*****************************\n"); - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + genArray(SIZE - 1, a, 2); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; printArray(SIZE, a, true); - int count, expectedCount, expectedNPOT; + int count, expectedCount, expectedNPOT_SIZE; zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); +// printDesc("cpu compact without scan, power-of-two"); count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); +// printArray(count, b, true); +// printCmpLenResult(count, expectedCount, b, b); zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); +// printDesc("cpu compact without scan, non-power-of-two"); + count = StreamCompaction::CPU::compactWithoutScan(NPOT_SIZE, c, a); + expectedNPOT_SIZE = count; +// printArray(count, c, true); +// printCmpLenResult(count, expectedNPOT_SIZE, b, c); zeroArray(SIZE, c); - printDesc("cpu compact with scan"); +// printDesc("cpu compact with scan"); count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); +// printArray(count, c, true); +// printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); @@ -117,7 +243,28 @@ int main(int argc, char* argv[]) { zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); + count = StreamCompaction::Efficient::compact(NPOT_SIZE, c, a); //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); + printCmpLenResult(count, expectedNPOT_SIZE, b, c); + + printf("\n"); + printf("**********************\n"); + printf("** RADIX SORT TESTS **\n"); + printf("**********************\n"); + + genArray(SIZE - 1, a, 3); // Leave a 0 at the end to test that edge case + a[0] = 0; + a[1] = 2; + a[2] = 3; + a[3] = 1; + //a = { 0, 1, 2, 3 }; + //a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + zeroArray(SIZE, c); + printDesc("radix sort, power-of-two"); + StreamCompaction::Radix::sort(SIZE, c, a); + printArray(SIZE, c, true); + printArrayOrderResult(SIZE, c); +#endif } diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index f6b572f..4c93526 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -33,6 +33,23 @@ void printCmpLenResult(int n, int expN, T *a, T *b) { cmpArrays(n, a, b) ? "FAIL VALUE" : "passed"); } +template +int testArrayOrder(int n, T *a) { + for (int i = 0; i < n-1; i++) { + if (a[i] < a[i+1]) { + printf(" (a[%d] = %d) < (a[%d] = %d)\n", i, a[i], i+1, a[i+1]); + return 1; + } + } + return 0; +} + +template +void printArrayOrderResult(int n, T *a) { + printf(" %s \n", + testArrayOrder(n, a) ? "FAIL VALUE" : "passed"); +} + void zeroArray(int n, int *a) { for (int i = 0; i < n; i++) { a[i] = 0; diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index cdbef77..bcc484e 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -9,6 +9,8 @@ set(SOURCE_FILES "efficient.cu" "thrust.h" "thrust.cu" + "radix.h" + "radix.cu" ) cuda_add_library(stream_compaction diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index fe872d4..cf9b41f 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,16 +23,22 @@ namespace Common { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int k = threadIdx.x; + if (k >= n) { return; } + + bools[k] = (idata[k] != 0) ? 1 : 0; } -/** - * Performs scatter on an array. That is, for each element in idata, - * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. - */ -__global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { - // TODO +__global__ void kernScatter(int n, int *odata, int *indices, int *idata) { + int k = threadIdx.x; + if (k >= n) { return; } + if (k == n-1) { + // always take the last element + // `compact` will adjust size appropriately + odata[indices[k]] = idata[k]; + } else if (indices[k] != indices[k+1]) { + odata[indices[k]] = idata[k]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index 4f52663..73fcccd 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -29,7 +29,6 @@ namespace StreamCompaction { namespace Common { __global__ void kernMapToBoolean(int n, int *bools, const int *idata); - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices); + __global__ void kernScatter(int n, int *odata, int *indices, int *idata); } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..99c7880 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -8,8 +8,11 @@ namespace CPU { * CPU scan (prefix sum). */ void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); + int total = 0; + for (int i = 0; i < n; i++) { + odata[i] = total; + total += idata[i]; + } } /** @@ -18,8 +21,25 @@ void scan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithoutScan(int n, int *odata, const int *idata) { - // TODO - return -1; + int count = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count++] = idata[i]; + } + } + return count; +} + +/** + * CPU scatter algorithm. + * + * @returns the number of elements remaining. + */ +int scatter(int n, int *odata, const int *indices, const int *input) { + for (int i = 0; i < n; i++) { + odata[indices[i]] = input[i]; + } + return indices[n-1]; } /** @@ -28,8 +48,20 @@ int compactWithoutScan(int n, int *odata, const int *idata) { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - // TODO - return -1; + int *predicate_data = (int *)malloc(n * sizeof(int)); + for (int i = 0; i < n; i++) { + predicate_data[i] = idata[i] == 0 ? 0 : 1; + } + + int *scan_data = (int *)malloc(n * sizeof(int)); + scan(n, scan_data, predicate_data); + + int count = scatter(n, odata, scan_data, idata); + + free(predicate_data); + free(scan_data); + + return count; } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index b2f739b..2fb27f5 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -6,14 +6,91 @@ namespace StreamCompaction { namespace Efficient { -// TODO: __global__ +int BLOCK_SIZE = (2 << 7); + +__global__ void kUpSweep(int d, int *data) { + int k = (blockDim.x * blockIdx.x) + threadIdx.x; + int exp_d = (int)exp2f(d); + int exp_d1 = (int)exp2f(d+1); + if (k % exp_d1 == 0) { + data[k + exp_d1 - 1] += data[k + exp_d - 1]; + } +} + +__global__ void kDownSweep(int d, int *data) { + int k = (blockDim.x * blockIdx.x) + threadIdx.x; + if (k % (int)exp2f(d+1) == 0) { + int left = k + (int)exp2f(d) - 1; + int right = k + (int)exp2f(d+1) - 1; + int t = data[left]; + data[left] = data[right]; + data[right] += t; + } +} + +/* + * In-place scan on `dev_idata`, which must be a device memory pointer. + */ +void dv_scan(int n, int *dev_idata) { + int numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; + + for (int d = 0; d < ilog2ceil(n)-1; d++) { + kUpSweep<<>>(d, dev_idata); + checkCUDAError("scan"); + } + + int z = 0; + cudaMemcpy(&dev_idata[n-1], &z, sizeof(int), cudaMemcpyHostToDevice); + + for (int d = ilog2ceil(n)-1; d >= 0; d--) { + kDownSweep<<>>(d, dev_idata); + checkCUDAError("scan"); + } +} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); +void scan(int size, int *odata, const int *input, float *time, int blockSize) { + BLOCK_SIZE = blockSize; + int *idata; + int n; + + if (size & (size-1) != 0) { // if size is not a power of 2 + n = (int)exp2f(ilog2ceil(size)); + idata = (int*)malloc(n * sizeof(int)); + memcpy(idata, input, n * sizeof(int)); + for (int j = size; j < n; j++) { + idata[j] = 0; + } + } else { + n = size; + idata = (int*)malloc(n * sizeof(int)); + memcpy(idata, input, n * sizeof(int)); + } + + + int array_size = n * sizeof(int); + int *dv_idata; + + cudaMalloc((void**) &dv_idata, array_size); + cudaMemcpy(dv_idata, idata, array_size, cudaMemcpyHostToDevice); + + cudaEvent_t begin, end; + cudaEventCreate(&begin); + cudaEventCreate(&end); + cudaEventRecord(begin, 0); + + dv_scan(n, dv_idata); + + cudaEventRecord(end, 0); + cudaEventSynchronize(end); + cudaEventElapsedTime(time, begin, end); + cudaEventDestroy(begin); + cudaEventDestroy(end); + + cudaMemcpy(odata, dv_idata, array_size, cudaMemcpyDeviceToHost); + cudaFree(dv_idata); } /** @@ -25,9 +102,68 @@ void scan(int n, int *odata, const int *idata) { * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. */ -int compact(int n, int *odata, const int *idata) { - // TODO - return -1; +int compact(int size, int *odata, const int *input, float *time, int blockSize) { + BLOCK_SIZE = blockSize; + + int *idata; + int n; + + if (size & (size-1) != 0) { // if size is not a power of 2 + n = (int)exp2f(ilog2ceil(size)); + idata = (int*)malloc(n * sizeof(int)); + memcpy(idata, input, n * sizeof(int)); + for (int j = size; j < n; j++) { + idata[j] = 0; + } + } else { + n = size; + idata = (int*)malloc(n * sizeof(int)); + memcpy(idata, input, n * sizeof(int)); + } + + int *dev_indices; + int *dev_odata; + int *dev_idata; + int array_size = n * sizeof(int); + int numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; + + cudaMalloc((void**) &dev_indices, array_size); + cudaMalloc((void**) &dev_odata, array_size); + + cudaMalloc((void**) &dev_idata, array_size); + cudaMemcpy(dev_idata, idata, array_size, cudaMemcpyHostToDevice); + + cudaEvent_t begin, end; + cudaEventCreate(&begin); + cudaEventCreate(&end); + cudaEventRecord(begin, 0); + + StreamCompaction::Common::kernMapToBoolean<<>>(n, dev_indices, dev_idata); + + int last; + cudaMemcpy(&last, dev_indices + n-1, sizeof(int), cudaMemcpyDeviceToHost); + + dv_scan(n, dev_indices); + int streamSize; + cudaMemcpy(&streamSize, dev_indices + n-1, sizeof(int), cudaMemcpyDeviceToHost); + + StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_indices, dev_idata); + + cudaEventRecord(end, 0); + cudaEventSynchronize(end); + cudaEventElapsedTime(time, begin, end); + cudaEventDestroy(begin); + cudaEventDestroy(end); + + cudaMemcpy(odata, dev_odata, array_size, cudaMemcpyDeviceToHost); + + // The kernel always copies the last elt. + // Adjust the size to include it if desired. + if (last == 1) { + streamSize++; + } + + return streamSize; } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 395ba10..76fbc36 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -2,8 +2,9 @@ namespace StreamCompaction { namespace Efficient { - void scan(int n, int *odata, const int *idata); + void dv_scan(int n, int *dev_idata); + void scan(int size, int *odata, const int *input, float *time, int blockSize); - int compact(int n, int *odata, const int *idata); + int compact(int n, int *odata, const int *idata, float *time, int blockSize); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..ba0aa94 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -6,14 +6,68 @@ namespace StreamCompaction { namespace Naive { -// TODO: __global__ +__global__ void kScan(int d, int *odata, const int *idata) { + int k = (blockDim.x * blockIdx.x) + threadIdx.x; + if (k >= (int)exp2f(d-1)) { + odata[k] = idata[k - (int)exp2f(d-1)] + idata[k]; + } else { + odata[k] = idata[k]; + } +} + +__global__ void kShift(int n, int *odata, int *idata) { + int k = (blockDim.x * blockIdx.x) + threadIdx.x; + if (k >= n) { return; } + if (k == 0) { + odata[0] = 0; + } else { + odata[k] = idata[k-1]; + } +} /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); +__host__ void scan(int n, int *odata, const int *idata, float *time, int blockSize) { + int array_size = n * sizeof(int); + int numBlocks = (n + blockSize - 1) / blockSize; + + int *A; + int *B; + + cudaMalloc((void**) &A, array_size); + cudaMalloc((void**) &B, array_size); + cudaMemcpy(A, idata, array_size, cudaMemcpyHostToDevice); + + int *in; + int *out; + + cudaEvent_t begin, end; + cudaEventCreate(&begin); + cudaEventCreate(&end); + + cudaEventRecord(begin, 0); + + for (int d = 1; d < ilog2ceil(n)+1; d++) { + in = (d % 2 == 1) ? A : B; + out = (d % 2 == 1) ? B : A; + kScan<<>>(d, out, in); + checkCUDAError("scan"); + } + + // shift odata to the right for exclusive scan + kShift<<>>(n, in, out); + + cudaEventRecord(end, 0); + cudaEventSynchronize(end); + cudaEventElapsedTime(time, begin, end); + cudaEventDestroy(begin); + cudaEventDestroy(end); + + cudaMemcpy(odata, in, array_size, cudaMemcpyDeviceToHost); + + cudaFree(A); + cudaFree(B); } } diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 21152d6..3c8578d 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -2,6 +2,6 @@ namespace StreamCompaction { namespace Naive { - void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, float *time, int blockSize); } } diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu new file mode 100644 index 0000000..2369c2d --- /dev/null +++ b/stream_compaction/radix.cu @@ -0,0 +1,147 @@ +#include +#include +#include "common.h" +#include "radix.h" +#include "efficient.h" + +namespace StreamCompaction { +namespace Radix { + +/* + * Get INVERTED `d`th digit of odata[k]. + */ +__global__ void kDigit(int n, int d, int *dv_odata, const int *dv_idata) { + int k = threadIdx.x; + if (k >= n) { return; } + dv_odata[k] = (dv_idata[k] & (1 << d)) > 0 ? 1 : 0; +} + +__global__ void kInvert(int n, int *odata, const int *idata) { + int k = threadIdx.x; + if (k >= n) { return; } + odata[k] = idata[k] == 0 ? 1 : 0; +} + +__global__ void kMapToIndex(int n, int *odata, int *b, int *f_indices, int pivot) { + int k = threadIdx.x; + if (k >= n) { return; } + odata[k] = (b[k] == 1) ? (k - f_indices[k] + pivot) : f_indices[k]; +} + +/* + * Implement split on device memory. + * Returns totalFalses (eg. the split point). + */ +__host__ int split(int n, int d, int *dv_odata, int *dv_idata) { + printf("---- split %d %d ----\n", n, d); + int array_size = n * sizeof(int); + int *TMP = (int*)malloc(array_size); + + int *b; + int *e; + int *t; + int *indices; + cudaMalloc((void**) &b, array_size); + cudaMalloc((void**) &e, array_size); + cudaMalloc((void**) &t, array_size); + cudaMalloc((void**) &indices, array_size); + + kDigit<<<1, n>>>(n, d, b, dv_idata); // b + printf("b: "); + cudaMemcpy(TMP, b, array_size, cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n"); + kInvert<<<1, n>>>(n, e, b); // e + printf("e: "); + cudaMemcpy(TMP, e, array_size, cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n"); + + int lastElt; + cudaMemcpy(&lastElt, e + n-1, sizeof(int), cudaMemcpyDeviceToHost); + + StreamCompaction::Efficient::dv_scan(n, e); // f IN PLACE OF e + + int totalFalses; + cudaMemcpy(&totalFalses, e + n-1, sizeof(int), cudaMemcpyDeviceToHost); + totalFalses += lastElt; + + printf("totalFalses = %d\n", totalFalses); + + kMapToIndex<<<1, n>>>(n, indices, b, e, totalFalses); + printf("indices: "); + cudaMemcpy(TMP, indices, array_size, cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n"); + + StreamCompaction::Common::kernScatter<<<1, n>>>(n, dv_odata, indices, dv_idata); // scatter + printf("scattered: "); + cudaMemcpy(TMP, dv_odata, array_size, cudaMemcpyDeviceToHost); + for (int i = 0; i < n; i++) { printf("%d\t", TMP[i]); } printf("\n"); + + cudaFree(b); + return totalFalses; +} + +int testArrayOrder(int n, int *a) { + for (int i = 0; i < n-1; i++) { + if (a[i] > a[i+1]) { + return 1; + } + } + return 0; +} + +/* + * odata and idata are device memory points. + */ +__host__ void sortRecursive(int n, int d, int dmax, int *odata, int *idata) { + if (d >= dmax) { return; } + int pivot = split(n, d, odata, idata); + //sortRecursive(n, d+1, dmax, odata, odata); + //if (pivot != 0) { + // sortRecursive(pivot, d+1, dmax, odata, odata); + //} + //if (pivot != n) { + // sortRecursive(n-pivot, d+1, dmax, odata+n, odata+n); + //} +} + +__host__ void sortRecursive2(int n, int d, int dmax, int *odata, int *idata) { + if (d <= 0) { return; } + int pivot = split(n, d, odata, idata); + if (pivot != 0) { + sortRecursive(pivot, d-1, dmax, odata, odata); + } + if (pivot != n) { + sortRecursive(n-pivot, d-1, dmax, odata+n, odata+n); + } +} + +__host__ void sort(int n, int *odata, const int *idata) { + int max = idata[0]; + for (int i = 0; i < n; i++) { + if (idata[i] > max) { + max = idata[i]; + } + } + int maxDigits = ilog2ceil(max); + + int *dv_odata; + int *dv_idata; + int array_size = n * sizeof(int); + + cudaMalloc((void**) &dv_odata, array_size); + cudaMalloc((void**) &dv_idata, array_size); + cudaMemcpy(dv_idata, idata, array_size, cudaMemcpyHostToDevice); + + //sortRecursive(n, 0, maxDigits, dv_odata, dv_idata); + sortRecursive2(n, 0, maxDigits, dv_odata, dv_idata); + + cudaMemcpy(odata, dv_odata, array_size, cudaMemcpyDeviceToHost); + + //for (int i = 0; i < n; i++) { printf("%d\t%d\n", idata[i], odata[i]); } + + cudaFree(dv_odata); + cudaFree(dv_idata); +} + +} +} diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h new file mode 100644 index 0000000..57d21d8 --- /dev/null +++ b/stream_compaction/radix.h @@ -0,0 +1,7 @@ +#pragma once + +namespace StreamCompaction { +namespace Radix { + void sort(int n, int *odata, const int *idata); +} +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..9791b97 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -12,10 +12,24 @@ namespace Thrust { /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ -void scan(int n, int *odata, const int *idata) { - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); +void scan(int n, int *odata, const int *idata, float *time) { + thrust::device_vector ivec(idata, idata+n); + thrust::device_vector ovec(odata, odata+n); + + cudaEvent_t begin, end; + cudaEventCreate(&begin); + cudaEventCreate(&end); + cudaEventRecord(begin, 0); + + thrust::exclusive_scan(ivec.begin(), ivec.end(), ovec.begin()); + + cudaEventRecord(end, 0); + cudaEventSynchronize(end); + cudaEventElapsedTime(time, begin, end); + cudaEventDestroy(begin); + cudaEventDestroy(end); + + thrust::copy(ovec.begin(), ovec.end(), odata); } } diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index 06707f3..4408a15 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -2,6 +2,6 @@ namespace StreamCompaction { namespace Thrust { - void scan(int n, int *odata, const int *idata); + void scan(int n, int *odata, const int *idata, float *time); } }