CIS565-Fall-2015 · terrynsun · Sep 12, 2015 · Sep 13, 2015 · Sep 13, 2015 · Sep 13, 2015
diff --git a/README.md b/README.md
@@ -1,213 +1,118 @@
-CUDA Stream Compaction
-======================
+# CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+Terry Sun; Arch Linux, Intel i5-4670, GTX 750
 
-### (TODO: Your README)
+## Library
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project contains a `stream_compaction` library and some associated tests.
 
-Instructions (delete me)
-========================
+`CPU`: A CPU implementation of `scan` and `scatter`, for reference and
+performance comparisons. Runs in O(n) / O(n) adds.
 
-This is due Sunday, September 13 at midnight.
-
-**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
-from scratch. This algorithm is widely used, and will be important for
-accelerating your path tracer project.
-
-Your stream compaction implementations in this project will simply remove `0`s
-from an array of `int`s. In the path tracer, you will remove terminated paths
-from an array of rays.
-
-In addition to being useful for your path tracer, this project is meant to
-reorient your algorithmic thinking to the way of the GPU. On GPUs, many
-algorithms can benefit from massive parallelism and, in particular, data
-parallelism: executing the same code many times simultaneously with different
-data.
-
-You'll implement a few different versions of the *Scan* (*Prefix Sum*)
-algorithm. First, you'll implement a CPU version of the algorithm to reinforce
-your understanding. Then, you'll write a few GPU implementations: "naive" and
-"work-efficient." Finally, you'll use some of these to implement GPU stream
-compaction.
-
-**Algorithm overview & details:** There are two primary references for details
-on the implementation of scan and stream compaction.
-
-* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
-  for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
-* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).
-
-Your GPU stream compaction implementation will live inside of the
-`stream_compaction` subproject. This way, you will be able to easily copy it
-over for use in your GPU path tracer.
-
-
-## Part 0: The Usual
-
-This project (and all other CUDA projects in this course) requires an NVIDIA
-graphics card with CUDA capability. Any card with Compute Capability 2.0
-(`sm_20`) or greater will work. Check your GPU on this
-[compatibility table](https://developer.nvidia.com/cuda-gpus).
-If you do not have a personal machine with these specs, you may use those
-computers in the Moore 100B/C which have supported GPUs.
-
-**HOWEVER**: If you need to use the lab computer for your development, you will
-not presently be able to do GPU performance profiling. This will be very
-important for debugging performance bottlenecks in your program.
-
-### Useful existing code
-
-* `stream_compaction/common.h`
-  * `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
-  * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
-* `main.cpp`
-  * Some testing code for your implementations.
-
-
-## Part 1: CPU Scan & Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/cpu.cu`, implement:
-
-* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
-* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
-  the `scan` function.
-* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
-  function. Map the input array to an array of 0s and 1s, scan it, and use
-  scatter to produce the output. You will need a **CPU** scatter implementation
-  for this (see slides or GPU Gems chapter for an explanation).
-
-These implementations should only be a few lines long.
-
-
-## Part 2: Naive GPU Scan Algorithm
-
-In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
-
-This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
-taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
-shared memory, but is limited to operating on very small arrays! Instead, write
-this using global memory only. As a result of this, you will have to do
-`ilog2ceil(n)` separate kernel invocations.
-
-Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
-code in the online version of Chapter 39 are known to have a few small errors
-(in superscripting, missing braces, bad indentation, etc.)
-
-Since the parallel scan algorithm operates on a binary tree structure, it works
-best with arrays with power-of-two length. Make sure your implementation works
-on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
-- your intermediate array sizes will need to be rounded to the next power of
-two.
-
-
-## Part 3: Work-Efficient GPU Scan & Stream Compaction
-
-### 3.1. Scan
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::scan`
-
-All of the text in Part 2 applies.
-
-* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
-* Beware of errors in Example 39-2.
-* Test non-power-of-two sized arrays.
-
-### 3.2. Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::compact`
-
-For compaction, you will also need to implement the scatter algorithm presented
-in the slides and the GPU Gems chapter.
-
-In `stream_compaction/common.cu`, implement these for use in `compact`:
-
-* `StreamCompaction::Common::kernMapToBoolean`
-* `StreamCompaction::Common::kernScatter`
-
-
-## Part 4: Using Thrust's Implementation
-
-In `stream_compaction/thrust.cu`, implement:
-
-* `StreamCompaction::Thrust::scan`
-
-This should be a very short function which wraps a call to the Thrust library
-function `thrust::exclusive_scan(first, last, result)`.
-
-To measure timing, be sure to exclude memory operations by passing
-`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
-GPU).  You can create a `thrust::device_vector` by creating a
-`thrust::host_vector` from the given pointer, then casting it.
-
-
-## Part 5: Radix Sort (Extra Credit) (+10)
-
-Add an additional module to the `stream_compaction` subproject. Implement radix
-sort using one of your scan implementations. Add tests to check its correctness.
-
-
-## Write-up
-
-1. Update all of the TODOs at the top of this README.
-2. Add a description of this project including a list of its features.
-3. Add your performance analysis (see below).
-
-All extra credit features must be documented in your README, explaining its
-value (with performance comparison, if applicable!) and showing an example how
-it works. For radix sort, show how it is called and an example of its output.
-
-Always profile with Release mode builds and run without debugging.
-
-### Questions
-
-* Roughly optimize the block sizes of each of your implementations for minimal
-  run time on your GPU.
-  * (You shouldn't compare unoptimized implementations to each other!)
-
-* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
-  Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
-  (with array size on the independent axis).
-  * You should use CUDA events for timing. Be sure **not** to include any
-    explicit memory operations in your performance measurements, for
-    comparability.
-  * To guess at what might be happening inside the Thrust implementation, take
-    a look at the Nsight timeline for its execution.
-
-* Write a brief explanation of the phenomena you see here.
-  * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
-    it different for each implementation?
-
-* Paste the output of the test program into a triple-backtick block in your
-  README.
-  * If you add your own tests (e.g. for radix sort or to test additional corner
-    cases), be sure to mention it explicitly.
-
-These questions should help guide you in performance analysis on future
-assignments, as well.
-
-## Submit
-
-If you have modified any of the `CMakeLists.txt` files at all (aside from the
-list of `SOURCE_FILES`), you must test that your project can build in Moore
-100B/C. Beware of any build issues discussed on the Google Group.
-
-1. Open a GitHub pull request so that we can see that you have finished.
-   The title should be "Submission: YOUR NAME".
-2. Send an email to the TA (gmail: kainino1+cis565@) with:
-   * **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
-   * Direct link to your pull request on GitHub
-   * In the form of a grade (0-100+) with comments, evaluate your own
-     performance on the project.
-   * Feedback on the project itself, if any.
+`Naive`: A naive (non-work-efficient) implementation of `scan`, performing O(n)
+adds and O(logn) iterations.
+
+`Efficient`: A work-efficient implementation of `scan` and `compact`. Also
+contins `dv_scan`, the actual in-place scan implementation which takes a device
+memory pointer directly (useful for other CUDA functions which need scan,
+bypassing the need to generate the host-memory-pointers that `Efficient::scan`
+would take). Performs O(nlogn) adds and runs 2logn iterations.
+
+`Common`:
+* `kernMapToBooleans`, used as the `filter`ing function in `Efficient::compact`.
+* `kernScatter`, used in `Efficient::compact` and `Radix::sort`.
+
+`Radix`: `sort` is so close to working but... doesn't work :(
+
+## Performance Analysis
+
+I did performance analysis with `CudaEvent`s for the GPU algorithm
+implementations and `std::chrono` for the CPU implementations. As before, code
+for this can be found on the `performance` (to avoid cluttering the main
+codebase). Raw data (csv format) can be found in `data/`.
+
+### Some fun charts
+
+Measuring the performance of scan with a block size of 128 (where applicable).
+
+![](data/scan_perf_zoomed_out.png)
+
+I cut the top of the CPU line off and my chart is still horribly skewed. Let's
+try again:
+
+![](data/scan_perf_zoomed_in.png)
+
+Interestingly, the sharp(ish) increase in `thrust::scan` around N=14 is
+consistent between runs. Maybe it has to do with an increase in memory
+allocation around that size.
+
+`Naive` performs about twice as well as `Efficient`, which makes sense as the
+work-efficient scan takes twice as many iterations of kernel calls. I suspect a
+smarter method of spawning threads (only creating as many as you need instead of
+creating 2^N every time and only using a subset) would improve performance on
+the efficient algorithm, as it would result in more threads having the exact
+same sequence of instructions to be executed. I think the performance gain on
+efficient might be greater than `Naive` in this case because the `Efficient`
+algorithm uses more iterations but fewer threads in each case, which would
+explain why having a work-efficient algorithm is preferable. (I was planning
+on testing this but -- as you can see -- I ran out of time.)
+
+There's a small amount of moving memory from the device to host in
+`Efficient::scan` - I don't if that has an appreciable impact, since it only
+needs to copy `sizeof(int)`. `Efficient::compact` has even more memory copying
+to retrieve the size of the compacted stream.
+
+![](data/gpu_by_block_size.png)
+
+Tested on an array size of 2^16. `Naive::scan` and `Efficient::scan` are both
+roughly optimal at a block size of 128.
+
+The performance of `Efficient::compact` is dominated by `Efficient::scan`. The
+only other computation happening in `compact` is `kernMapToBoolean` and
+`kernScatter`, both of which are constant (in fact, 1 operation per thread), and
+memory copying (see above).
+
+Compact performance goes much the same way, to nobody's surprise:
+
+![](data/compact_by_array_size.png)
+
+
+## Test output
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  33  36  27  15  43  35  36  42  49  21  12  27  40 ...  28   0 ]
+==== cpu scan, power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ]
+==== cpu scan, non-power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6329 6330 ]
+    passed 
+==== naive scan, power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ]
+    passed 
+==== naive scan, non-power-of-two ====
+    passed 
+==== work-efficient scan, power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6371 6399 ]
+    passed 
+==== work-efficient scan, non-power-of-two ====
+    [   0  33  69  96 111 154 189 225 267 316 337 349 376 ... 6329 6330 ]
+    passed 
+==== thrust scan, power-of-two ====
+    passed 
+==== thrust scan, non-power-of-two ====
+    passed 
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   0   1   1   1   1   0   0   1   1   0   1   0 ...   0   0 ]
+==== work-efficient compact, power-of-two ====
+    passed 
+==== work-efficient compact, non-power-of-two ====
+    passed 
+```
diff --git a/data/compact_by_array_size.png b/data/compact_by_array_size.png
diff --git a/data/cpu_by_arr_size.csv b/data/cpu_by_arr_size.csv
@@ -0,0 +1,21 @@
+size, scan, compactWithoutScan, compactWithScan
+0, 0.018390, 0.017990, 0.062040
+1, 0.036900, 0.036660, 0.125410
+2, 0.056000, 0.056050, 0.191390
+3, 0.075810, 0.077990, 0.260930
+4, 0.098610, 0.103510, 0.342300
+5, 0.130480, 0.136290, 0.460860
+6, 0.176600, 0.193570, 0.614170
+7, 0.249980, 0.288220, 0.837180
+8, 0.379200, 0.451990, 1.180500
+9, 0.595360, 0.787670, 1.739130
+10, 1.010410, 1.459940, 2.873860
+11, 1.808420, 2.799750, 4.983760
+12, 3.382930, 5.454100, 9.189020
+13, 6.617830, 10.682090, 17.652109
+14, 13.106260, 20.930850, 34.538570
+15, 26.239680, 41.440238, 95.065141
+16, 53.197820, 82.259148, 232.273063
+17, 106.942953, 163.591766, 524.566375
+18, 214.327141, 326.009719, 1154.182750
+19, 444.831375, 655.437125, 2664.925000
diff --git a/data/gpu_by_array_size.csv b/data/gpu_by_array_size.csv
@@ -0,0 +1,17 @@
+block size, naive::scan, efficient::scan, efficient::compact, thrust::scan
+4, 0.019901, 0.032131, 0.042268, 0.014861
+5, 0.022966, 0.039091, 0.048816, 0.013974
+6, 0.026545, 0.046097, 0.055637, 0.013844
+7, 0.029734, 0.052625, 0.062891, 0.014107
+8, 0.032616, 0.059878, 0.069612, 0.013777
+9, 0.035588, 0.066927, 0.077829, 0.014007
+10, 0.039123, 0.074742, 0.085684, 0.013909
+11, 0.042336, 0.084190, 0.094274, 0.017210
+12, 0.049155, 0.093326, 0.103361, 0.018924
+13, 0.051340, 0.112999, 0.124146, 0.025679
+14, 0.066044, 0.153436, 0.166579, 0.040478
+15, 0.092430, 0.233036, 0.248662, 0.040362
+16, 0.145964, 0.397543, 0.418071, 0.049354
+17, 0.249689, 0.733529, 0.763885, 0.072880
+18, 0.521292, 1.435356, 1.482930, 0.095224
+19, 1.806152, 3.125029, 3.233285, 0.178910
diff --git a/data/gpu_by_block_size.csv b/data/gpu_by_block_size.csv
@@ -0,0 +1,10 @@
+block size, naive::scan, efficient::scan, efficient::compact, thrust::scan
+2, 1.705963, 3.371424, 3.477096, 0.152979
+3, 0.869121, 1.678385, 1.735070, 0.153120
+4, 0.475849, 0.892212, 0.926461, 0.156361
+5, 0.268811, 0.502830, 0.531393, 0.159694
+6, 0.156100, 0.404411, 0.432535, 0.163947
+7, 0.144814, 0.393823, 0.414350, 0.167775
+8, 0.146012, 0.393572, 0.414207, 0.172095
+9, 0.148004, 0.393960, 0.411924, 0.175363
+10, 0.163155, 0.406132, 0.423445, 0.179432
diff --git a/data/gpu_by_block_size.png b/data/gpu_by_block_size.png
diff --git a/data/scan_perf_zoomed_in.png b/data/scan_perf_zoomed_in.png
diff --git a/data/scan_perf_zoomed_out.png b/data/scan_perf_zoomed_out.png