Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 118 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,124 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Matt Schwartz
* [LinkedIn](https://www.linkedin.com/in/matthew-schwartz-37019016b/)
* [Personal website](https://mattzschwartz.web.app/)
* Tested on: Windows 10 22H2, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, NVIDIA GeForce RTX 2060

### (TODO: Your README)
<p align="center">
<img src="img/downsweep.jpg" alt="Downsweep">
</p>

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
*Image source [GPU Gems 3 Chapter 39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)*


# Background

In this repository, I implement several variations of a prescan algorithm: a CPU-based version, a naive implementation on the GPU, and a work-efficient GPU implementation, as well as the built-in Thrust library implementation (for comparision). Aftewards, I build upon this prescan algorithm to develop a parallel stream-compaction implementation (again, compared against the CPU).

Prescans and stream compaction have a variety of uses in data analysis and computer graphics. To learn more about these algorithms, give the [GPU Gems 3 chapter](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) a read.

# The Data

Let's take a high-level glance at the data, and then follow up with more detail on some of the steps taken along the way to optimize these results.

A note on measurement taking: all tests are repeated 20 times, and the median data point from each test is recorded and plotted. I chose to use the median because I found that there was significant variance in the results from test to test, and I wanted to lessen the impact of outliers.

Before recording any data, I ran the (automated) test suite with varying kernel block sizes (for the naive and work-efficient algorithms), and used that data to determine optimal sizes for each algorithm. This way, we can compare apples-to-apples. For the naive scan, I found this to be around 256 threads per block. For the work-efficient scan, I found this to be around 512 threads per block. In both cases, performance was minimally affected by block sizes ranging from 128 - 1024.

## Prefix Scan

Perhaps unsurprisingly, the Thrust library is unanimously the winner at every array size, ranging from ~1 thousand elements to ~16 million. However, my work-efficient implementation isn't so far off - about 2x slower than the Thrust implementation (in fact, each successively slower algorithm is about a factor of 2x slower than the previous; a fun coincidence).

Interestingly, the CPU scan is actually faster across the board than the naive scan! This is attributable both to the greater algorithmic complexity of the naive scan (it has to do far more operations to achieve the same result, albeit in parallel), and to its excessive use of global memory.

Originally, the work-efficient scan was also slower than the CPU scan, by quite a bit. There were three major improvements I made to get the time down:
1. Instead of using global memory reads, I switched to shared memory. This required completely restructing the scan algorithm so that it could run in independent blocks, and then use those independent partial scan results to compute the full scan. Also, instead of doing the two phases of the scan (upsweep and downsweep) in separate kernel invocations, it suddenly became advantageous to do it in a single invocation; this way, global memory only has to be read from and written to once.
2. Shared memory comes with a caveat; bank conflicts. To address this, I added a variable offset to where data is being stored and read from in shared memory. (see images after graph for a visual!)
3. To accomodate arbritrarily large arrays, there's a recursive step that computes and joins the results of different partial scans together. I was using a pinned memory transfer here for some data that needs to be carried over between recursive invocations, but I realized I could find out a priori how much memory I would need for all iterations, and pre-allocate it.
4. (Bonus) Because I hate my readers, I tried to inline all math expressions in my kernels. Just kidding - I did it to avoid the use of extra registers, not to hurt readability :(


<p align="center">
<img src="img/Prescan%20times%20for%20various-sized%20arrays.svg" alt="Prescan">
</p>

According to NSight Compute, adding a variable offset to avoid shared memory bank conflicts decreased my "excessive shared wavefronts" from 88.7% to 12.2%!

<p align="center">
<img src="img/perf1.webp" alt="Shared memory access before optimization">
</p>

<p align="center">
<img src="img/perf2.webp" alt="Shared memory access after optimization">
</p>

## Compact

Not much to say here - since the compact algorithm depends on the prescan, most of the speed of the GPU compaction is due to the optimizations crafted in the scan itself!

<p align="center">
<img src="img/Compact%20Times%20for%20various-sized%20arrays.svg" alt="Prescan">
</p>


# Sample output (2^20 elements)

```
****************
** SCAN TESTS **
****************
[ 27 36 43 45 26 29 25 32 2 46 49 26 19 ... 5 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.6881ms (std::chrono Measured)
==== cpu scan, non-power-of-two ====
elapsed time: 0.505ms (std::chrono Measured)
passed
==== naive scan, power-of-two ====
elapsed time: 1.11667ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.898528ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.435392ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.340064ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.294304ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.264192ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 0 3 1 2 1 1 2 0 2 1 2 1 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 2.191ms (std::chrono Measured)
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.0626ms (std::chrono Measured)
passed
==== cpu compact with scan ====
elapsed time: 5.9357ms (std::chrono Measured)
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.717088ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.518976ms (CUDA Measured)
passed

*****************************
** RADIX TESTS **
*****************************
[ 147 231 53 245 206 59 5 137 97 116 69 136 169 ... 10 0 ]
==== radix sort, power-of-two ====
elapsed time: 22.694ms (CUDA Measured)
passed
```
1 change: 1 addition & 0 deletions img/Compact Times for various-sized arrays.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions img/Prescan times for various-sized arrays.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/downsweep.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/perf1.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/perf2.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
109 changes: 109 additions & 0 deletions performance_automator.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/bin/bash

# Run the performance test 10x each, and write the values to a csv file
NUM_TESTS=20

cpu_scan_time_pot=()
cpu_scan_time_npot=()
cpu_compact_without_scan_time_pot=()
cpu_compact_without_scan_time_npot=()
cpu_compact_with_scan_time=()

naive_scan_time_pot=()
naive_scan_time_npot=()

efficient_scan_time_pot=()
efficient_scan_time_npot=()
efficient_compact_time_pot=()
efficient_compact_time_npot=()

thrust_scan_time_pot=()
thrust_scan_time_npot=()

for i in $(seq 1 $NUM_TESTS)
do
echo -e "Test $i\n"
result=$(./bin/cis5650_stream_compaction_test cpu)

elapsed_time=$(echo "$result" | grep -A 1 "cpu scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
cpu_scan_time_pot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "cpu scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
cpu_scan_time_npot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "cpu compact without scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
cpu_compact_without_scan_time_pot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "cpu compact without scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
cpu_compact_without_scan_time_npot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "cpu compact with scan" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
cpu_compact_with_scan_time+=($elapsed_time)

result=$(./bin/cis5650_stream_compaction_test naive)

elapsed_time=$(echo "$result" | grep -A 1 "naive scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
naive_scan_time_pot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "naive scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
naive_scan_time_npot+=($elapsed_time)

result=$(./bin/cis5650_stream_compaction_test efficient)

elapsed_time=$(echo "$result" | grep -A 1 "efficient scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
efficient_scan_time_pot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "efficient scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
efficient_scan_time_npot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "efficient compact, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
efficient_compact_time_pot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "efficient compact, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
efficient_compact_time_npot+=($elapsed_time)

result=$(./bin/cis5650_stream_compaction_test thrust)

elapsed_time=$(echo "$result" | grep -A 1 "thrust scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
thrust_scan_time_pot+=($elapsed_time)

elapsed_time=$(echo "$result" | grep -A 1 "thrust scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
thrust_scan_time_npot+=($elapsed_time)

done

calculate_median() {
arr=($(printf '%s\n' "${@}" | sort -n))
len=${#arr[@]}
if (( $len % 2 == 0 )); then
echo "scale=5; (${arr[$len/2-1]} + ${arr[$len/2]}) / 2" | bc
else
echo "${arr[$len/2]}"
fi
}

median_cpu_scan_time_pot=$(calculate_median "${cpu_scan_time_pot[@]}")
median_cpu_scan_time_npot=$(calculate_median "${cpu_scan_time_npot[@]}")
median_cpu_compact_without_scan_time_pot=$(calculate_median "${cpu_compact_without_scan_time_pot[@]}")
median_cpu_compact_without_scan_time_npot=$(calculate_median "${cpu_compact_without_scan_time_npot[@]}")
median_cpu_compact_with_scan_time=$(calculate_median "${cpu_compact_with_scan_time[@]}")

median_naive_scan_time_pot=$(calculate_median "${naive_scan_time_pot[@]}")
median_naive_scan_time_npot=$(calculate_median "${naive_scan_time_npot[@]}")

median_efficient_scan_time_pot=$(calculate_median "${efficient_scan_time_pot[@]}")
median_efficient_scan_time_npot=$(calculate_median "${efficient_scan_time_npot[@]}")
median_efficient_compact_time_pot=$(calculate_median "${efficient_compact_time_pot[@]}")
median_efficient_compact_time_npot=$(calculate_median "${efficient_compact_time_npot[@]}")

median_thrust_scan_time_pot=$(calculate_median "${thrust_scan_time_pot[@]}")
median_thrust_scan_time_npot=$(calculate_median "${thrust_scan_time_npot[@]}")

# Now write the results to a csv file
echo -e ",CPU,Naive,Efficient,Thrust\n" > performance_results.csv
echo -e "Scan Time Power of Two,$median_cpu_scan_time_pot,$median_naive_scan_time_pot,$median_efficient_scan_time_pot,$median_thrust_scan_time_pot" >> performance_results.csv
echo -e "Scan Time Non-Power of Two,$median_cpu_scan_time_npot,$median_naive_scan_time_npot,$median_efficient_scan_time_npot,$median_thrust_scan_time_npot" >> performance_results.csv
echo -e "Compact Time Power of Two,$median_cpu_compact_without_scan_time_pot,,$median_efficient_compact_time_pot," >> performance_results.csv
echo -e "Compact Time Non-Power of Two,$median_cpu_compact_without_scan_time_npot,,$median_efficient_compact_time_npot," >> performance_results.csv
echo -e "(CPU) Compact Time With Scan,$median_cpu_compact_with_scan_time,,," >> performance_results.csv

Loading