diff --git a/README.md b/README.md
index 0e38ddb..0ba3443 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,124 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Matt Schwartz
+ * [LinkedIn](https://www.linkedin.com/in/matthew-schwartz-37019016b/)
+ * [Personal website](https://mattzschwartz.web.app/)
+* Tested on: Windows 10 22H2, Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, NVIDIA GeForce RTX 2060
-### (TODO: Your README)
+
+
+
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+*Image source [GPU Gems 3 Chapter 39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)*
+
+# Background
+
+In this repository, I implement several variations of a prescan algorithm: a CPU-based version, a naive implementation on the GPU, and a work-efficient GPU implementation, as well as the built-in Thrust library implementation (for comparision). Aftewards, I build upon this prescan algorithm to develop a parallel stream-compaction implementation (again, compared against the CPU).
+
+Prescans and stream compaction have a variety of uses in data analysis and computer graphics. To learn more about these algorithms, give the [GPU Gems 3 chapter](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda) a read.
+
+# The Data
+
+Let's take a high-level glance at the data, and then follow up with more detail on some of the steps taken along the way to optimize these results.
+
+A note on measurement taking: all tests are repeated 20 times, and the median data point from each test is recorded and plotted. I chose to use the median because I found that there was significant variance in the results from test to test, and I wanted to lessen the impact of outliers.
+
+Before recording any data, I ran the (automated) test suite with varying kernel block sizes (for the naive and work-efficient algorithms), and used that data to determine optimal sizes for each algorithm. This way, we can compare apples-to-apples. For the naive scan, I found this to be around 256 threads per block. For the work-efficient scan, I found this to be around 512 threads per block. In both cases, performance was minimally affected by block sizes ranging from 128 - 1024.
+
+## Prefix Scan
+
+Perhaps unsurprisingly, the Thrust library is unanimously the winner at every array size, ranging from ~1 thousand elements to ~16 million. However, my work-efficient implementation isn't so far off - about 2x slower than the Thrust implementation (in fact, each successively slower algorithm is about a factor of 2x slower than the previous; a fun coincidence).
+
+Interestingly, the CPU scan is actually faster across the board than the naive scan! This is attributable both to the greater algorithmic complexity of the naive scan (it has to do far more operations to achieve the same result, albeit in parallel), and to its excessive use of global memory.
+
+Originally, the work-efficient scan was also slower than the CPU scan, by quite a bit. There were three major improvements I made to get the time down:
+1. Instead of using global memory reads, I switched to shared memory. This required completely restructing the scan algorithm so that it could run in independent blocks, and then use those independent partial scan results to compute the full scan. Also, instead of doing the two phases of the scan (upsweep and downsweep) in separate kernel invocations, it suddenly became advantageous to do it in a single invocation; this way, global memory only has to be read from and written to once.
+2. Shared memory comes with a caveat; bank conflicts. To address this, I added a variable offset to where data is being stored and read from in shared memory. (see images after graph for a visual!)
+3. To accomodate arbritrarily large arrays, there's a recursive step that computes and joins the results of different partial scans together. I was using a pinned memory transfer here for some data that needs to be carried over between recursive invocations, but I realized I could find out a priori how much memory I would need for all iterations, and pre-allocate it.
+4. (Bonus) Because I hate my readers, I tried to inline all math expressions in my kernels. Just kidding - I did it to avoid the use of extra registers, not to hurt readability :(
+
+
+
+
+
+
+According to NSight Compute, adding a variable offset to avoid shared memory bank conflicts decreased my "excessive shared wavefronts" from 88.7% to 12.2%!
+
+
+
+
+
+
+
+
+
+## Compact
+
+Not much to say here - since the compact algorithm depends on the prescan, most of the speed of the GPU compaction is due to the optimizations crafted in the scan itself!
+
+
+
+
+
+
+# Sample output (2^20 elements)
+
+```
+****************
+** SCAN TESTS **
+****************
+ [ 27 36 43 45 26 29 25 32 2 46 49 26 19 ... 5 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 0.6881ms (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+ elapsed time: 0.505ms (std::chrono Measured)
+ passed
+==== naive scan, power-of-two ====
+ elapsed time: 1.11667ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two ====
+ elapsed time: 0.898528ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two ====
+ elapsed time: 0.435392ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two ====
+ elapsed time: 0.340064ms (CUDA Measured)
+ passed
+==== thrust scan, power-of-two ====
+ elapsed time: 0.294304ms (CUDA Measured)
+ passed
+==== thrust scan, non-power-of-two ====
+ elapsed time: 0.264192ms (CUDA Measured)
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 1 0 3 1 2 1 1 2 0 2 1 2 1 ... 3 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 2.191ms (std::chrono Measured)
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 2.0626ms (std::chrono Measured)
+ passed
+==== cpu compact with scan ====
+ elapsed time: 5.9357ms (std::chrono Measured)
+ passed
+==== work-efficient compact, power-of-two ====
+ elapsed time: 0.717088ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two ====
+ elapsed time: 0.518976ms (CUDA Measured)
+ passed
+
+*****************************
+** RADIX TESTS **
+*****************************
+ [ 147 231 53 245 206 59 5 137 97 116 69 136 169 ... 10 0 ]
+==== radix sort, power-of-two ====
+ elapsed time: 22.694ms (CUDA Measured)
+ passed
+```
\ No newline at end of file
diff --git a/img/Compact Times for various-sized arrays.svg b/img/Compact Times for various-sized arrays.svg
new file mode 100644
index 0000000..4b2e055
--- /dev/null
+++ b/img/Compact Times for various-sized arrays.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/img/Prescan times for various-sized arrays.svg b/img/Prescan times for various-sized arrays.svg
new file mode 100644
index 0000000..47433ee
--- /dev/null
+++ b/img/Prescan times for various-sized arrays.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/img/downsweep.jpg b/img/downsweep.jpg
new file mode 100644
index 0000000..88be7ff
Binary files /dev/null and b/img/downsweep.jpg differ
diff --git a/img/perf1.webp b/img/perf1.webp
new file mode 100644
index 0000000..ff6ce51
Binary files /dev/null and b/img/perf1.webp differ
diff --git a/img/perf2.webp b/img/perf2.webp
new file mode 100644
index 0000000..10db27c
Binary files /dev/null and b/img/perf2.webp differ
diff --git a/performance_automator.sh b/performance_automator.sh
new file mode 100755
index 0000000..d6c5419
--- /dev/null
+++ b/performance_automator.sh
@@ -0,0 +1,109 @@
+#!/bin/bash
+
+# Run the performance test 10x each, and write the values to a csv file
+NUM_TESTS=20
+
+cpu_scan_time_pot=()
+cpu_scan_time_npot=()
+cpu_compact_without_scan_time_pot=()
+cpu_compact_without_scan_time_npot=()
+cpu_compact_with_scan_time=()
+
+naive_scan_time_pot=()
+naive_scan_time_npot=()
+
+efficient_scan_time_pot=()
+efficient_scan_time_npot=()
+efficient_compact_time_pot=()
+efficient_compact_time_npot=()
+
+thrust_scan_time_pot=()
+thrust_scan_time_npot=()
+
+for i in $(seq 1 $NUM_TESTS)
+do
+ echo -e "Test $i\n"
+ result=$(./bin/cis5650_stream_compaction_test cpu)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "cpu scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ cpu_scan_time_pot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "cpu scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ cpu_scan_time_npot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "cpu compact without scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ cpu_compact_without_scan_time_pot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "cpu compact without scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ cpu_compact_without_scan_time_npot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "cpu compact with scan" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ cpu_compact_with_scan_time+=($elapsed_time)
+
+ result=$(./bin/cis5650_stream_compaction_test naive)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "naive scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ naive_scan_time_pot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "naive scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ naive_scan_time_npot+=($elapsed_time)
+
+ result=$(./bin/cis5650_stream_compaction_test efficient)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "efficient scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ efficient_scan_time_pot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "efficient scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ efficient_scan_time_npot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "efficient compact, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ efficient_compact_time_pot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "efficient compact, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ efficient_compact_time_npot+=($elapsed_time)
+
+ result=$(./bin/cis5650_stream_compaction_test thrust)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "thrust scan, power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ thrust_scan_time_pot+=($elapsed_time)
+
+ elapsed_time=$(echo "$result" | grep -A 1 "thrust scan, non-power-of-two" | grep -oP 'elapsed time: \K[0-9]+\.[0-9]+')
+ thrust_scan_time_npot+=($elapsed_time)
+
+done
+
+calculate_median() {
+ arr=($(printf '%s\n' "${@}" | sort -n))
+ len=${#arr[@]}
+ if (( $len % 2 == 0 )); then
+ echo "scale=5; (${arr[$len/2-1]} + ${arr[$len/2]}) / 2" | bc
+ else
+ echo "${arr[$len/2]}"
+ fi
+}
+
+median_cpu_scan_time_pot=$(calculate_median "${cpu_scan_time_pot[@]}")
+median_cpu_scan_time_npot=$(calculate_median "${cpu_scan_time_npot[@]}")
+median_cpu_compact_without_scan_time_pot=$(calculate_median "${cpu_compact_without_scan_time_pot[@]}")
+median_cpu_compact_without_scan_time_npot=$(calculate_median "${cpu_compact_without_scan_time_npot[@]}")
+median_cpu_compact_with_scan_time=$(calculate_median "${cpu_compact_with_scan_time[@]}")
+
+median_naive_scan_time_pot=$(calculate_median "${naive_scan_time_pot[@]}")
+median_naive_scan_time_npot=$(calculate_median "${naive_scan_time_npot[@]}")
+
+median_efficient_scan_time_pot=$(calculate_median "${efficient_scan_time_pot[@]}")
+median_efficient_scan_time_npot=$(calculate_median "${efficient_scan_time_npot[@]}")
+median_efficient_compact_time_pot=$(calculate_median "${efficient_compact_time_pot[@]}")
+median_efficient_compact_time_npot=$(calculate_median "${efficient_compact_time_npot[@]}")
+
+median_thrust_scan_time_pot=$(calculate_median "${thrust_scan_time_pot[@]}")
+median_thrust_scan_time_npot=$(calculate_median "${thrust_scan_time_npot[@]}")
+
+# Now write the results to a csv file
+echo -e ",CPU,Naive,Efficient,Thrust\n" > performance_results.csv
+echo -e "Scan Time Power of Two,$median_cpu_scan_time_pot,$median_naive_scan_time_pot,$median_efficient_scan_time_pot,$median_thrust_scan_time_pot" >> performance_results.csv
+echo -e "Scan Time Non-Power of Two,$median_cpu_scan_time_npot,$median_naive_scan_time_npot,$median_efficient_scan_time_npot,$median_thrust_scan_time_npot" >> performance_results.csv
+echo -e "Compact Time Power of Two,$median_cpu_compact_without_scan_time_pot,,$median_efficient_compact_time_pot," >> performance_results.csv
+echo -e "Compact Time Non-Power of Two,$median_cpu_compact_without_scan_time_npot,,$median_efficient_compact_time_npot," >> performance_results.csv
+echo -e "(CPU) Compact Time With Scan,$median_cpu_compact_with_scan_time,,," >> performance_results.csv
+
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..062d011 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,9 +11,10 @@
#include
#include
#include
+#include
#include "testing_helpers.hpp"
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
@@ -34,25 +35,46 @@ int main(int argc, char* argv[]) {
// initialize b using StreamCompaction::CPU::scan you implement
// We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
// At first all cases passed because b && c are all zeroes.
- zeroArray(SIZE, b);
- printDesc("cpu scan, power-of-two");
- StreamCompaction::CPU::scan(SIZE, b, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(SIZE, b, true);
-
- zeroArray(SIZE, c);
- printDesc("cpu scan, non-power-of-two");
- StreamCompaction::CPU::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(NPOT, b, true);
- printCmpResult(NPOT, b, c);
-
- zeroArray(SIZE, c);
- printDesc("naive scan, power-of-two");
- StreamCompaction::Naive::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
+
+ if (argc > 1 && strcmp(argv[1], "cpu") == 0) {
+ zeroArray(SIZE, b);
+ printDesc("cpu scan, power-of-two");
+ StreamCompaction::CPU::scan(SIZE, b, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ // printArray(SIZE, b, true);
+
+ zeroArray(SIZE, c);
+ printDesc("cpu scan, non-power-of-two");
+ StreamCompaction::CPU::scan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ // printArray(NPOT, b, true);
+ printCmpResult(NPOT, b, c);
+ }
+
+
+ if (argc > 1 && strcmp(argv[1], "naive") == 0) {
+ // We have to run the cpu scans just to get values to compare against for the naive scan
+ zeroArray(SIZE, b);
+ StreamCompaction::CPU::scan(SIZE, b, a);
+
+ zeroArray(SIZE, c);
+ StreamCompaction::CPU::scan(NPOT, c, a);
+
+ zeroArray(SIZE, c);
+ printDesc("naive scan, power-of-two");
+ StreamCompaction::Naive::scan(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("naive scan, non-power-of-two");
+ StreamCompaction::Naive::scan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(NPOT, b, c);
+ }
+
/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
onesArray(SIZE, c);
@@ -60,40 +82,52 @@ int main(int argc, char* argv[]) {
StreamCompaction::Naive::scan(SIZE, c, a);
printArray(SIZE, c, true); */
- zeroArray(SIZE, c);
- printDesc("naive scan, non-power-of-two");
- StreamCompaction::Naive::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(NPOT, b, c);
-
- zeroArray(SIZE, c);
- printDesc("work-efficient scan, power-of-two");
- StreamCompaction::Efficient::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
-
- zeroArray(SIZE, c);
- printDesc("work-efficient scan, non-power-of-two");
- StreamCompaction::Efficient::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(NPOT, c, true);
- printCmpResult(NPOT, b, c);
-
- zeroArray(SIZE, c);
- printDesc("thrust scan, power-of-two");
- StreamCompaction::Thrust::scan(SIZE, c, a);
- printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
- printCmpResult(SIZE, b, c);
-
- zeroArray(SIZE, c);
- printDesc("thrust scan, non-power-of-two");
- StreamCompaction::Thrust::scan(NPOT, c, a);
- printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(NPOT, c, true);
- printCmpResult(NPOT, b, c);
+ if (argc > 1 && strcmp(argv[1], "efficient") == 0) {
+ // We have to run the cpu scans just to get values to compare against for the efficient scan
+ zeroArray(SIZE, b);
+ StreamCompaction::CPU::scan(SIZE, b, a);
+
+ zeroArray(SIZE, c);
+ StreamCompaction::CPU::scan(NPOT, c, a);
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient scan, power-of-two");
+ StreamCompaction::Efficient::scan(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ // printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient scan, non-power-of-two");
+ StreamCompaction::Efficient::scan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ // printArray(NPOT, c, true);
+ printCmpResult(NPOT, b, c);
+ }
+
+ if (argc > 1 && strcmp(argv[1], "thrust") == 0) {
+ // We have to run the cpu scans just to get values to compare against for the efficient scan
+ zeroArray(SIZE, b);
+ StreamCompaction::CPU::scan(SIZE, b, a);
+
+ zeroArray(SIZE, c);
+ StreamCompaction::CPU::scan(NPOT, c, a);
+
+ zeroArray(SIZE, c);
+ printDesc("thrust scan, power-of-two");
+ StreamCompaction::Thrust::scan(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("thrust scan, non-power-of-two");
+ StreamCompaction::Thrust::scan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(NPOT, c, true);
+ printCmpResult(NPOT, b, c);
+ }
+
printf("\n");
printf("*****************************\n");
@@ -108,44 +142,84 @@ int main(int argc, char* argv[]) {
int count, expectedCount, expectedNPOT;
- // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
- // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
- zeroArray(SIZE, b);
- printDesc("cpu compact without scan, power-of-two");
- count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- expectedCount = count;
- printArray(count, b, true);
- printCmpLenResult(count, expectedCount, b, b);
-
- zeroArray(SIZE, c);
- printDesc("cpu compact without scan, non-power-of-two");
- count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- expectedNPOT = count;
- printArray(count, c, true);
- printCmpLenResult(count, expectedNPOT, b, c);
-
- zeroArray(SIZE, c);
- printDesc("cpu compact with scan");
- count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
- printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(count, c, true);
- printCmpLenResult(count, expectedCount, b, c);
-
- zeroArray(SIZE, c);
- printDesc("work-efficient compact, power-of-two");
- count = StreamCompaction::Efficient::compact(SIZE, c, a);
- printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(count, c, true);
- printCmpLenResult(count, expectedCount, b, c);
-
- zeroArray(SIZE, c);
- printDesc("work-efficient compact, non-power-of-two");
- count = StreamCompaction::Efficient::compact(NPOT, c, a);
- printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(count, c, true);
- printCmpLenResult(count, expectedNPOT, b, c);
+ if (argc > 1 && strcmp(argv[1], "cpu") == 0) {
+ // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
+ // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
+ zeroArray(SIZE, b);
+ printDesc("cpu compact without scan, power-of-two");
+ count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ expectedCount = count;
+ // printArray(count, b, true);
+ printCmpLenResult(count, expectedCount, b, b);
+
+ zeroArray(SIZE, c);
+ printDesc("cpu compact without scan, non-power-of-two");
+ count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ expectedNPOT = count;
+ // printArray(count, c, true);
+ printCmpLenResult(count, expectedNPOT, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("cpu compact with scan");
+ count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ // printArray(count, c, true);
+ printCmpLenResult(count, expectedCount, b, c);
+ }
+
+ if (argc > 1 && strcmp(argv[1], "efficient") == 0) {
+ // We have to run the cpu compacts just to get values to compare against for the efficient compact
+ zeroArray(SIZE, b);
+ count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+ expectedCount = count;
+
+ zeroArray(SIZE, c);
+ count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+ expectedNPOT = count;
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient compact, power-of-two");
+ count = StreamCompaction::Efficient::compact(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedCount, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient compact, non-power-of-two");
+ count = StreamCompaction::Efficient::compact(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(count, c, true);
+ printCmpLenResult(count, expectedNPOT, b, c);
+ }
+
+ printf("\n");
+ printf("*****************************\n");
+ printf("** RADIX TESTS **\n");
+ printf("*****************************\n");
+
+ // Radix tests
+ genArray(SIZE - 1, a, 255); // Leave a 0 at the end to test that edge case
+ a[SIZE - 1] = 0;
+ printArray(SIZE, a, true);
+
+ if (argc > 1 && strcmp(argv[1], "radix") == 0) {
+ zeroArray(SIZE, b);
+ printDesc("radix sort, power-of-two");
+ StreamCompaction::Radix::sort(SIZE, b, a);
+ printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ // printArray(SIZE, b, true);
+
+ // Use std::sort to sort array a and store the result in c so we can compare against it
+ std::copy(a, a + SIZE, c);
+ std::sort(c, c + SIZE);
+
+ // printArray(SIZE, c, true);
+
+ // Compare the sorted array from Radix sort with the sorted array from std::sort
+ printCmpResult(SIZE, b, c);
+ }
system("pause"); // stop Win32 console from closing on exit
delete[] a;
diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
index 025e94a..d76dc37 100644
--- a/src/testing_helpers.hpp
+++ b/src/testing_helpers.hpp
@@ -24,7 +24,7 @@ void printDesc(const char *desc) {
template
void printCmpResult(int n, T *a, T *b) {
printf(" %s \n",
- cmpArrays(n, a, b) ? "FAIL VALUE" : "passed");
+ cmpArrays(n, a, b) ? "\033[31mFAIL VALUE\033[0m" : "\033[32mpassed\033[0m");
}
template
@@ -34,7 +34,7 @@ void printCmpLenResult(int n, int expN, T *a, T *b) {
}
printf(" %s \n",
(n == -1 || n != expN) ? "FAIL COUNT" :
- cmpArrays(n, a, b) ? "FAIL VALUE" : "passed");
+ cmpArrays(n, a, b) ? "\033[31mFAIL VALUE\033[0m" : "\033[32mpassed\033[0m");
}
void zeroArray(int n, int *a) {
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index e0ec27c..81e7a15 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
+ "radix.h"
)
set(sources
@@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
+ "radix.cu"
)
list(SORT headers)
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..cc66f76 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,7 +23,10 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
- // TODO
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) return;
+
+ bools[index] = (idata[index] != 0);
}
/**
@@ -31,8 +34,23 @@ namespace StreamCompaction {
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
- const int *idata, const int *bools, const int *indices) {
- // TODO
+ const int *idata, const int *scannedBools) {
+ int threadId = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (threadId >= n) return;
+
+ int data = idata[threadId];
+ int scan_i = scannedBools[threadId];
+
+ // Special case for last element of idata array
+ if (threadId == n - 1 && data) {
+ odata[scan_i] = data;
+ return;
+ }
+
+ int scan_iplusone = scannedBools[threadId + 1];
+ if (scan_i != scan_iplusone) {
+ odata[scan_i] = data;
+ }
}
}
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed..fdbf9f8 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -35,7 +35,7 @@ namespace StreamCompaction {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);
__global__ void kernScatter(int n, int *odata,
- const int *idata, const int *bools, const int *indices);
+ const int *idata, const int *scannedBools);
/**
* This class is used for timing the performance
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..fa2fe48 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -17,10 +17,15 @@ namespace StreamCompaction {
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
- void scan(int n, int *odata, const int *idata) {
- timer().startCpuTimer();
- // TODO
- timer().endCpuTimer();
+ void scan(int n, int *odata, const int *idata, bool useTimer) {
+ if (useTimer) timer().startCpuTimer();
+ odata[0] = 0;
+ for (int i = 1; i < n; ++i) {
+ int input = idata[i - 1];
+ int last_output = odata[i - 1];
+ odata[i] = input + last_output;
+ }
+ if (useTimer) timer().endCpuTimer();
}
/**
@@ -30,9 +35,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+ int numOutputElements = 0;
+ for (int i = 0; i < n; ++i) {
+ int input = idata[i];
+ if (input == 0) continue;
+ odata[numOutputElements] = input;
+ ++numOutputElements;
+ }
timer().endCpuTimer();
- return -1;
+ return numOutputElements;
}
/**
@@ -41,10 +52,33 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
+ int* trueFalseArray = new int[n];
+ int* scannedTFArray = new int[n];
+
timer().startCpuTimer();
- // TODO
+ for (int i = 0; i < n; ++i) {
+ int input = idata[i];
+ trueFalseArray[i] = (input == 0) ? 0 : 1;
+ }
+
+ scan(n, scannedTFArray, trueFalseArray, false);
+
+ // Scatter
+ int numOutputElements = 0;
+ for (int i = 0; i < n; ++i) {
+ int input = idata[i];
+ int trueFalseValue = trueFalseArray[i];
+ if (!trueFalseValue) continue;
+
+ odata[scannedTFArray[i]] = input;
+ ++numOutputElements;
+ }
+
timer().endCpuTimer();
- return -1;
+
+ delete[] trueFalseArray;
+ delete[] scannedTFArray;
+ return numOutputElements;
}
}
}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c047..6f3a466 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -6,7 +6,7 @@ namespace StreamCompaction {
namespace CPU {
StreamCompaction::Common::PerformanceTimer& timer();
- void scan(int n, int *odata, const int *idata);
+ void scan(int n, int *odata, const int *idata, bool useTimer = true);
int compactWithoutScan(int n, int *odata, const int *idata);
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..024982b 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,6 +2,7 @@
#include
#include "common.h"
#include "efficient.h"
+#include
namespace StreamCompaction {
namespace Efficient {
@@ -12,13 +13,143 @@ namespace StreamCompaction {
return timer;
}
+ const int MAX_BLOCK_SIZE = 512; // keep this as a power of 2. Tested and found 512 is optimal.
+
+ #define LOG_NUM_BANKS 5
+ #define CONFLICT_FREE_OFFSET(threadIdx) ((threadIdx) >> LOG_NUM_BANKS)
+
+ void scan(int n_padded, int* dev_data, int* stored_sums, int offset) {
+ int blockSize = std::min((n_padded / 2), MAX_BLOCK_SIZE);
+ dim3 blocksPerGrid = ((n_padded / 2) + blockSize - 1) / blockSize;
+
+ int sharedMemorySize = (2 * blockSize + CONFLICT_FREE_OFFSET(2 * blockSize - 1)) * sizeof(int);
+ kernScan<<>>(n_padded, ilog2ceil(2 * blockSize), dev_data, stored_sums + offset);
+ cudaDeviceSynchronize();
+
+ // If the array didn't fit within a single block, we need to collect the individual block scan results,
+ // put them in an array, and scan that array. Then add the twice-scanned array as increments back to the original results.
+ //
+ // This needs to be done recursively to handle arbitrarily large arrays.
+ if (n_padded > 2 * blockSize) {
+ // (Recursively) scan the summed blocks array
+ // Can use sum_data as both the input and output pointers for the scan. No issue writing over it.
+ scan(blocksPerGrid.x, stored_sums + offset, stored_sums + offset, blocksPerGrid.x);
+
+ // Finally, add scanned sum values back to the original dev_data
+ // In original scan, each thread handled 2 elements. In this step, each handles one, so we need 2x the blocks.
+ dim3 kernBlocksPerGrid = 2 * blocksPerGrid.x;
+ kernIncrement<<>>(n_padded, dev_data, stored_sums + offset);
+ cudaDeviceSynchronize();
+ }
+
+ }
+
/**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ * Wrapper method (to facilitate gpu timing and allocating things)
*/
void scan(int n, int *odata, const int *idata) {
+ int n_padded = pow(2, ilog2ceil(n)); // pad array to nearest power of two
+
+ int* dev_data;
+ cudaMalloc((void**)&dev_data, n_padded * sizeof(int));
+ cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ // Pad input data to be a power of two, if needed.
+ if (n < n_padded) {
+ cudaMemset(dev_data + n, 0, (n_padded - n) * sizeof(int));
+ }
+
+ int blockSize = std::min((n_padded / 2), MAX_BLOCK_SIZE);
+ dim3 blocksPerGrid = ((n_padded / 2) + blockSize - 1) / blockSize;
+
+ // Calculate the total amount of memory needed for stored_sums
+ int stored_sums_size = 0;
+ for (int i = ilog2ceil(n_padded) - ilog2ceil(MAX_BLOCK_SIZE); i >= 1; i -= ilog2ceil(MAX_BLOCK_SIZE)) {
+ stored_sums_size += pow(2, i);
+ }
+
+ // temp array used to store last entry per block during upsweep. See kernScan and kernIncrement for use info.
+ int* stored_sums;
+ cudaMalloc((void**)&stored_sums, std::max(1, stored_sums_size) * sizeof(int));
+
+
timer().startGpuTimer();
- // TODO
+ scan(n_padded, dev_data, stored_sums, 0);
timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_data, n_padded * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaFree(dev_data);
+ cudaFree(stored_sums);
+ }
+
+ /**
+ * n is the size of dev_data
+ */
+ __global__ void kernScan(int n, int numLevels, int* dev_data, int* stored_sums) {
+ if (threadIdx.x + (blockDim.x * blockIdx.x) >= n/2) return;
+
+ extern __shared__ int s_dev_data[];
+
+ // Put the right and left children into shared memory.
+ // Index from dev_data based on *global* position, but put into shared memory as local position (w.r.t. this block)
+ s_dev_data[2 * threadIdx.x + 1 + CONFLICT_FREE_OFFSET(2 * threadIdx.x + 1)] =
+ dev_data[2 * threadIdx.x + 1 + (2 * blockDim.x * blockIdx.x)];
+
+ s_dev_data[2 * threadIdx.x + CONFLICT_FREE_OFFSET(2 * threadIdx.x)] =
+ dev_data[2 * threadIdx.x + (2 * blockDim.x * blockIdx.x)];
+
+ for (int depth = 0; depth < numLevels; ++depth) {
+ __syncthreads();
+ // Make sure the local right-child index this thread will access is in bounds of this block.
+ if ((threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1 >= 2 * blockDim.x) continue;
+
+ s_dev_data[(threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1 + CONFLICT_FREE_OFFSET((threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1)] +=
+ s_dev_data[(threadIdx.x * (1 << (depth + 1))) + (1 << depth) - 1 + CONFLICT_FREE_OFFSET((threadIdx.x * (1 << (depth + 1))) + (1 << depth) - 1)];
+ }
+ __syncthreads();
+
+ // Save off the last entry of s_dev_data to a temporary stored_sums buffer. This temp buffer is used if our initial
+ // input data is too large to be scanned in a single block.
+ // Then zero out the last entry for the downsweep.
+ if (threadIdx.x == 0) {
+ stored_sums[blockIdx.x] = s_dev_data[2 * blockDim.x - 1 + CONFLICT_FREE_OFFSET(2 * blockDim.x - 1)];
+ s_dev_data[2 * blockDim.x + CONFLICT_FREE_OFFSET(2 * blockDim.x - 1) - 1] = 0;
+ }
+
+ for (int depth = numLevels - 1; depth >= 1; --depth) {
+ __syncthreads();
+ // Make sure the local right-child index this thread will access is in bounds of this block.
+ if ((threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1 >= 2 * blockDim.x) continue;
+
+ int leftVal = s_dev_data[(threadIdx.x * (1 << (depth + 1))) + (1 << depth) - 1 + CONFLICT_FREE_OFFSET((threadIdx.x * (1 << (depth + 1))) + (1 << depth) - 1)];
+ s_dev_data[(threadIdx.x * (1 << (depth + 1))) + (1 << depth) - 1 + CONFLICT_FREE_OFFSET((threadIdx.x * (1 << (depth + 1))) + (1 << depth) - 1)] =
+ s_dev_data[(threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1 + CONFLICT_FREE_OFFSET((threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1)];
+
+ s_dev_data[(threadIdx.x * (1 << (depth + 1))) + (1 << (depth + 1)) - 1 + CONFLICT_FREE_OFFSET(threadIdx.x * (1 << (depth + 1)) + (1 << (depth + 1)) - 1)] += leftVal;
+ }
+
+ __syncthreads();
+ // On the last iteration, depth = 0, we write to global memory.
+ dev_data[2 * threadIdx.x + (2 * blockDim.x * blockIdx.x)] =
+ s_dev_data[2 * threadIdx.x + 1 + CONFLICT_FREE_OFFSET(2 * threadIdx.x + 1)];
+
+ dev_data[2 * threadIdx.x + 1 + (2 * blockDim.x * blockIdx.x)] =
+ (s_dev_data[2 * threadIdx.x + CONFLICT_FREE_OFFSET(2 * threadIdx.x)] + s_dev_data[2 * threadIdx.x + 1 + CONFLICT_FREE_OFFSET(2 * threadIdx.x + 1)]);
+ }
+
+ /**
+ * Kernel to add the scanned block sums back to the original array.
+ *
+ * n here is the number of elements in the original input arary.
+ */
+ __global__ void kernIncrement(int n, int* dev_data, int* sum_data) {
+ int threadId = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (threadId >= n) return;
+
+ // The extra factor of 2 comes from the fact that we're using twice as many block here as in the original scan.
+ int sum_data_idx = (long) gridDim.x * threadId / (2 * n); // note: integer division here
+
+ dev_data[threadId] += sum_data[sum_data_idx];
}
/**
@@ -31,10 +162,52 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
+ int n_padded = pow(2, ilog2ceil(n)); // pad array to nearest power of two
+ int* trueFalseArray, *dev_idata, *dev_odata;
+ cudaMalloc((void**)&trueFalseArray, n_padded * sizeof(int));
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+ cudaMalloc((void**)&dev_odata, n * sizeof(int)); // allocate for worst-case scenario
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ if (n < n_padded) {
+ // Note that if we wanted to *retain* 0s in the compacted array, we'd need to pad
+ // trueFalseArray with a different value here.
+ cudaMemset(trueFalseArray + n, 0, (n_padded - n) * sizeof(int));
+ }
+
+ // Calculate the total amount of memory needed for stored_sums
+ int stored_sums_size = 0;
+ for (int i = ilog2ceil(n_padded) - ilog2ceil(MAX_BLOCK_SIZE); i >= 1; i -= ilog2ceil(MAX_BLOCK_SIZE)) {
+ stored_sums_size += pow(2, i);
+ }
+
+ // temp array used to store last entry per block during upsweep. See kernScan and kernIncrement for use info.
+ int* stored_sums;
+ cudaMalloc((void**)&stored_sums, stored_sums_size * sizeof(int));
+
timer().startGpuTimer();
- // TODO
+
+ int threadsPerBlock = 128;
+ dim3 blocksPerGrid = ((n + threadsPerBlock - 1) / threadsPerBlock);
+ StreamCompaction::Common::kernMapToBoolean<<>>(n, trueFalseArray, dev_idata);
+ cudaDeviceSynchronize();
+
+ scan(n_padded, trueFalseArray, stored_sums, 0); // scan happens in-place, so trueFalseArray is now scanned
+
+ StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, trueFalseArray);
+ cudaDeviceSynchronize();
+
timer().endGpuTimer();
- return -1;
+
+ int compactArraySize;
+ cudaMemcpy(&compactArraySize, trueFalseArray + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ compactArraySize += (idata[n - 1] != 0); // necessary because scan was exclusive
+
+ cudaMemcpy(odata, dev_odata, compactArraySize * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaFree(dev_odata);
+ cudaFree(dev_idata);
+ cudaFree(trueFalseArray);
+ return compactArraySize;
}
}
}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..c76a5a5 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -7,7 +7,9 @@ namespace StreamCompaction {
StreamCompaction::Common::PerformanceTimer& timer();
void scan(int n, int *odata, const int *idata);
-
+ void scan(int n_padded, int* dev_data, int* stored_sums, int offset);
+ __global__ void kernScan(int n, int numLevels, int* dev_data, int* stored_sums);
+ __global__ void kernIncrement(int n, int* dev_data, int* sum_data);
int compact(int n, int *odata, const int *idata);
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..d28fa12 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,6 +2,7 @@
#include
#include "common.h"
#include "naive.h"
+#include
namespace StreamCompaction {
namespace Naive {
@@ -11,15 +12,73 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
- // TODO: __global__
+
+ const int MAX_BLOCK_SIZE = 256;
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
- void scan(int n, int *odata, const int *idata) {
+
+ void scan(int n, int *odata, const int *idata) {
+ int nearestPowerOfTwo = pow(2, ilog2ceil(n));
+ int* outputBuf;
+ int* inputBuf;
+ cudaMalloc((void**)&outputBuf, nearestPowerOfTwo * sizeof(int));
+ cudaMalloc((void**)&inputBuf, nearestPowerOfTwo * sizeof(int));
+ cudaMemcpy(inputBuf, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+
+ // Pad input data to be a power of two, if needed.
+ if (n < nearestPowerOfTwo) {
+ cudaMemset(inputBuf + n, 0, (nearestPowerOfTwo - n) * sizeof(int));
+ }
+
timer().startGpuTimer();
- // TODO
+
+ for (int depth = 1; depth <= ilog2ceil(nearestPowerOfTwo); ++depth) {
+ int blockSize = std::min(nearestPowerOfTwo, MAX_BLOCK_SIZE); // cap at 1024 threads, hardware limitation.
+ dim3 blocksPerGrid((nearestPowerOfTwo + blockSize - 1) / blockSize); // note integer division
+ naiveScan<<>>(nearestPowerOfTwo, depth, inputBuf, outputBuf);
+
+ std::swap(outputBuf, inputBuf);
+ }
+
+ // Convert inclusive scan to exclusive scan
+ int blockSize = std::min(n, MAX_BLOCK_SIZE);
+ dim3 blocksForShift((nearestPowerOfTwo + blockSize - 1) / blockSize);
+ shiftRight<<>>(n, inputBuf, outputBuf);
+
timer().endGpuTimer();
+
+ cudaMemcpy(odata, outputBuf, nearestPowerOfTwo * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaFree(outputBuf);
+ cudaFree(inputBuf);
+ }
+
+ __global__ void naiveScan(int n, int depth, const int* inputBuf, int* outputBuf) {
+ int threadId = threadIdx.x + (blockDim.x * blockIdx.x);
+ if (threadId >= n) return;
+
+ int stride = 1 << (depth - 1);
+ if (threadId >= n - stride) {
+ outputBuf[n - threadId - 1] = inputBuf[n - threadId - 1];
+ }
+ else {
+ outputBuf[threadId + stride] = inputBuf[threadId] + inputBuf[threadId + stride];
+ }
+
+ }
+
+ __global__ void shiftRight(int n, const int* inputBuf, int* outputBuf) {
+ int threadId = threadIdx.x + (blockDim.x * blockIdx.x);
+ if (threadId >= n) return;
+
+ if (threadId == 0) {
+ outputBuf[threadId] = 0;
+ return;
+ }
+
+ outputBuf[threadId] = inputBuf[threadId - 1];
}
}
}
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb06..b3c0f79 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -5,7 +5,8 @@
namespace StreamCompaction {
namespace Naive {
StreamCompaction::Common::PerformanceTimer& timer();
-
void scan(int n, int *odata, const int *idata);
+ __global__ void naiveScan(int n, int depth, const int* inputBuf, int* outputBuf);
+ __global__ void shiftRight(int n, const int* inputBuf, int* outputBuf);
}
}
diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu
new file mode 100644
index 0000000..b64ee82
--- /dev/null
+++ b/stream_compaction/radix.cu
@@ -0,0 +1,105 @@
+#include
+#include
+#include "common.h"
+#include "radix.h"
+#include
+#include "efficient.h"
+#include
+
+namespace StreamCompaction {
+ namespace Radix {
+ using StreamCompaction::Common::PerformanceTimer;
+
+ PerformanceTimer& timer()
+ {
+ static PerformanceTimer timer;
+ return timer;
+ }
+
+ const int MAX_BLOCK_SIZE = 1024; // keep this as a power of 2
+
+ void sort(int n, int *odata, const int *idata) {
+ int numBitsInInt = sizeof(int) * CHAR_BIT;
+ int n_padded = pow(2, ilog2ceil(n));
+
+ int* dev_idata;
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ int* dev_odata;
+ cudaMalloc((void**)&dev_odata, n * sizeof(int));
+
+ int* bit_mapped_data;
+ cudaMalloc((void**)&bit_mapped_data, n * sizeof(int));
+
+ int* scanned_bit_mapped_data;
+ cudaMalloc((void**)&scanned_bit_mapped_data, n_padded * sizeof(int));
+ cudaMemset(scanned_bit_mapped_data, 0, n_padded * sizeof(int));
+
+
+ // Calculate the total amount of memory needed for stored_sums (for the scan step)
+ int stored_sums_size = 0;
+ for (int i = ilog2ceil(n_padded) - ilog2ceil(MAX_BLOCK_SIZE); i >= 1; i -= ilog2ceil(MAX_BLOCK_SIZE)) {
+ stored_sums_size += pow(2, i);
+ }
+
+ // temp array used to store last entry per block during upsweep. See kernScan and kernIncrement for use info.
+ int* stored_sums;
+ cudaMalloc((void**)&stored_sums, std::max(stored_sums_size, 1) * sizeof(int));
+
+ int blockSize = 1024;
+ dim3 blocksPerGrid((n + blockSize - 1) / blockSize);
+
+ timer().startGpuTimer();
+
+ for (int i = 0; i < numBitsInInt; ++i) {
+ kernMapBits<<>>(n, bit_mapped_data, dev_idata, i);
+ cudaDeviceSynchronize();
+
+ // Scan operates in place, so we need to first copy the bit_mapped_data to scanned_bit_mapped_data so we don't lose it
+ cudaMemcpy(scanned_bit_mapped_data, bit_mapped_data, n * sizeof(int), cudaMemcpyDeviceToDevice);
+ cudaDeviceSynchronize();
+
+ StreamCompaction::Efficient::scan(n_padded, scanned_bit_mapped_data, stored_sums, 0);
+ cudaDeviceSynchronize();
+
+ kernSort<<>>(n, dev_odata, dev_idata, bit_mapped_data, scanned_bit_mapped_data);
+ cudaDeviceSynchronize();
+
+ // Swap the pointers so that the output of the current iteration becomes the input of the next iteration
+ int* temp = dev_idata;
+ dev_idata = dev_odata;
+ dev_odata = temp;
+ }
+
+ timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ cudaFree(bit_mapped_data);
+ cudaFree(scanned_bit_mapped_data);
+ cudaFree(stored_sums);
+ }
+
+ __global__ void kernMapBits(int n, int* bit_mapped_data, const int* dev_data, int bitshift) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) return;
+
+ bit_mapped_data[index] = !((dev_data[index] >> bitshift) & 1);
+ }
+
+ __global__ void kernSort(int n, int* dev_odata, const int* dev_idata, const int* bit_mapped_data, const int* scanned_bit_mapped_data) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) return;
+
+ int totalFalses = scanned_bit_mapped_data[n - 1] + bit_mapped_data[n - 1];
+ int trueIndex = index - scanned_bit_mapped_data[index] + totalFalses;
+ int finalIndex = bit_mapped_data[index] ? scanned_bit_mapped_data[index] : trueIndex;
+
+ dev_odata[finalIndex] = dev_idata[index];
+ }
+
+ }
+}
\ No newline at end of file
diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h
new file mode 100644
index 0000000..b20f00f
--- /dev/null
+++ b/stream_compaction/radix.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+ namespace Radix {
+ StreamCompaction::Common::PerformanceTimer& timer();
+
+ void sort(int n, int *odata, const int *idata);
+ __global__ void kernMapBits(int n, int *odata, const int *idata, int bitNumber);
+ __global__ void kernSort(int n, int *dev_odata, const int* dev_idata, const int *bit_mapped_data, const int *scanned_bit_mapped_data);
+ }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..521ef0b 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,16 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ thrust::device_vector dv_in(idata, idata + n);
+ thrust::device_vector dv_out(n);
+
timer().startGpuTimer();
- // TODO use `thrust::exclusive_scan`
- // example: for device_vectors dv_in and dv_out:
- // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+ thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
timer().endGpuTimer();
+
+ thrust::copy(dv_out.begin(), dv_out.end(), odata);
}
}
}