diff --git a/README.md b/README.md
index 0e38ddb..f40bb26 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,196 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yue Zhang
+ * [LinkedIn](https://www.linkedin.com/in/yuezhang027/), [personal website](https://yuezhanggame.com/).
+* Tested on: Windows 11, i9-13900H @ 2.60GHz 32GB, NVIDIA GeForce RTX 4070 Laptop 8GB (Personal Laptop)
+* Compute capability: 8.9
-### (TODO: Your README)
+## Description
+This project explores various stream compaction and exclusive scan algorithms capable of handling arbitrary-length input arrays. The algorithms are derived from the [slides on Parallel Algorithms](https://docs.google.com/presentation/d/1ETVONA7QDM-WqsEj4qVOGD6Kura5I6E9yqH-7krnwZ0/edit#slide=id.p126) and GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html). In addition to basic implementation, the project tries to enhance performance through optimization techniques and includes a detailed analysis of the time metrics involved.
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+Here is the list of features this project implemented:
+1. Exclusive scan and stream compaction implementation performed on CPU.
+2. Naive exclusive scan algorithm implementation performed on GPU.
+3. Work-Efficient exclusive scan and stream compaction implementation performed on GPU.
+4. [Extra Credit] Try to optimize work-efficient exclusive scan by modifying blockSize, gridSize and internal visit.
+5. Thrust exclusive scan wrapper performed on GPU.
+6. [Extra Credit] Radix sort on GPU.
+7. [Extra Credit] Optmized work-efficient exclusive scan using shared memory implementation.
+
+
+## Results
+Here are screenshots for performance of all algorithm with array size n = 1<<8 and 1<<28 for the base case and stress test case. The results include extra test for work-efficient scan with shared memory:
+
+
Screenshot 1: Test result with array size = 1 << 8
+
+
+
+
+
+
+
+Screenshot 2: Test result with array size = 1 << 28
+
+
+
+
+
+
+
+There is also a NSight Analysis for the time and memory operation for each method with two tests since it directly call scan() function.
+
+
+
+Note for work-efficient compact, there is still some malloc and free calculated in the result time.
+
+## Analysis with questions
+
+* Roughly optimize the block sizes of each of your implementations for minimal
+ run time on your GPU.
+
+ I optmize the block size according to the CUDA device property to maximize the utilization of each warp and align the blocks to be distributed uniformly along all SMs. Especially in work-efficient algorithm, the `threadCount` for each level of tree is different, which require a dynamic block size assignment. Here, I rely on three properties in implementation of naive and work-efficient GPU exclusive scan to determine `blockSize` and `gridSize`:
+ * `warpSize`: Block size is set to be a multiple of warp size to ensure maximum utilization of each warp and avoid warp divergence.
+ * `maxThreadsPerBlock`: Block size should not exceed this limit.
+ * `multiProcessorCount`: After aligning the native gridSize by finding `threadCount` / `blockSize`, this value should be ideally a multiple of the number of SMs to ensure a balanced workload distribution across all SMs.
+
+ Here is a typical grid and block size setting code:
+ ```
+ cudaDeviceProp prop;
+ cudaGetDeviceProperties(&prop, 0);
+ int minBlockSize = prop.warpSize,
+ maxBlockSize = prop.maxThreadsPerBlock, sms = prop.multiProcessorCount;
+
+ int curBlockSize = std::max(minBlockSize, std::min(64/*start block size*/,
+ maxBlockSize));
+ int curGridSize = (threadCount + curBlockSize - 1) / curBlockSize;
+ curGridSize = std::ceil(curGridSize / (float)sms) * sms;
+
+ kernfunc<<>>(threadCount, ...);
+ ```
+
+ Besides, in work-efficient algorithm implementation using shared memory, I take `sharedMemPerBlock` into account too when deciding the block division along a large array. The algorithm will start a grid with dimension <<>> and assign B * sizeof(int) byte shared memory to each block. B should be a value less than `sharedMemPerBlock` / sizeof(int).
+
+* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
+ Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
+ (with array size on the independent axis).
+
+ Here is a graph of scan time for different method's implementation under an array size from 1 << 16 to 1 << 28. This chart only runs the power-of-two tests.
+
+ 
+
+ This chart reflects a tendency and rough comparison between each methods' behavior. At 1 << 28 array size case, we may see naive GPU scan performs a little bit better than the CPU scan, while the first GPU work-efficient scan is much better than them. The optimized GPU work-efficient scan is almost performing as good as the thrust scan due to the utilization of shared memory and avoiding bank conflict.
+
+ The next graph use log10 scale for time to see the comparison of method performance for each level of array size more clearly.
+
+ 
+
+ The algorithm's performance are relatively unstable along different array size. The baseline, CPU scan performs a linear increment in log10 scale, that means it follows an exponential growth as array size increases. In contradiction, the GPU scans using global memory (GPU native scan, GPU efficient scan) suffered from the inefficient memory visit at first. When array size increases, they begins to perform better from parallel feature. As array size increases, the efficient scan overvalues native scan, since it performs O(n) operation other than O(nlogn) in native scan.
+
+* To guess at what might be happening inside the Thrust implementation (e.g.
+ allocation, memory copy), take a look at the Nsight timeline for its
+ execution. Your analysis here doesn't have to be detailed, since you aren't
+ even looking at the code for the implementation.
+
+ Here is the Nsight timeline screenshot of Thrust scan:
+ 
+
+ It may operates similarly to the shared memory version of work efficient scan.
+ 1. `__uninitialized_fill`: First, it tries to fill the shared memory with for loops.
+ 2. `ScanTileState`: Then, it scan states (refers to the recursive call on sum array in shared memory version of algorithm) to store the intermediate state in the scan operation (just like what I do with int **sumArr).
+ 3. `DeviceScanPolicy`: Finally, for each state, it performs the prefix scan as we do in the block.
+
+* Write a brief explanation of the phenomena you see here.
+
+ For native scan and work-efficient scan, the global memory access may really be a bottleneck of their performance. Work-efficient scan visits memory in a better efficiency than the normal one, since one thread handles two elements at the same time, and they don't conflict with each other. Also, when gridId increases, the thread executes with a longer time. This may caused by unbalanced workload along SMs.
+
+ 
+
+ The optimized method of work-efficient scan changes memory to shared memory, avoids bank conflict by setting offset (2 * n shared memory needed) and allows the array to be divided into several length and compute with different blocks. The main bottleneck for this method may be still about auto-balancing between threads and compaction of active threads or optimizing the sumArr visit. This has a performance increase about 20% in measured scan time.
+
+ 
+
+* Paste the output of the test program into a triple-backtick block in your
+ README.
+
+ Another run with array size = 1 << 28. There is extra test for work-efficient scan with shared memory.
+ ```
+ ****************
+ ** SCAN TESTS **
+ ****************
+ [ 27 35 35 40 20 6 13 34 36 15 47 4 41 ... 25 0 ]
+ ==== cpu scan, power-of-two ====
+ elapsed time: 400.187ms (std::chrono Measured)
+ [ 0 27 62 97 137 157 163 176 210 246 261 308 312 ... -2015767044 -2015767019 ]
+ ==== cpu scan, non-power-of-two ====
+ elapsed time: 406.071ms (std::chrono Measured)
+ [ 0 27 62 97 137 157 163 176 210 246 261 308 312 ... -2015767115 -2015767102 ]
+ passed
+ ==== naive scan, power-of-two ====
+ elapsed time: 280.021ms (CUDA Measured)
+ passed
+ ==== naive scan, non-power-of-two ====
+ elapsed time: 280.801ms (CUDA Measured)
+ passed
+ ==== work-efficient scan, power-of-two ====
+ elapsed time: 94.5782ms (CUDA Measured)
+ passed
+ ==== work-efficient scan, non-power-of-two ====
+ elapsed time: 94.4707ms (CUDA Measured)
+ passed
+ ==== thrust scan, power-of-two ====
+ elapsed time: 11.0101ms (CUDA Measured)
+ passed
+ ==== thrust scan, non-power-of-two ====
+ elapsed time: 10.5656ms (CUDA Measured)
+ passed
+ ==== work-efficient scan with shared memory, power-of-two ====
+ elapsed time: 18.8852ms (CUDA Measured)
+ passed
+ ==== work-efficient scan with shared memory, non-power-of-two ====
+ elapsed time: 19.1417ms (CUDA Measured)
+ passed
+
+ *****************************
+ ** STREAM COMPACTION TESTS **
+ *****************************
+ [ 2 3 2 2 1 0 3 2 2 1 0 2 3 ... 0 0 ]
+ ==== cpu compact without scan, power-of-two ====
+ elapsed time: 518.492ms (std::chrono Measured)
+ [ 2 3 2 2 1 3 2 2 1 2 3 3 3 ... 3 3 ]
+ passed
+ ==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 527.604ms (std::chrono Measured)
+ [ 2 3 2 2 1 3 2 2 1 2 3 3 3 ... 2 3 ]
+ passed
+ ==== cpu compact with scan ====
+ elapsed time: 1249.16ms (std::chrono Measured)
+ [ 2 3 2 2 1 3 2 2 1 2 3 3 3 ... 3 3 ]
+ passed
+ ==== work-efficient compact, power-of-two ====
+ elapsed time: 141.99ms (CUDA Measured)
+ passed
+ ==== work-efficient compact, non-power-of-two ====
+ elapsed time: 117.638ms (CUDA Measured)
+ passed
+
+ **********************
+ ** RADIX SORT TESTS **
+ **********************
+ [ 45 12 33 20 35 10 39 37 19 7 11 9 15 ... 37 0 ]
+ ==== cpu sort, power-of-two ====
+ elapsed time: 4139.96ms (std::chrono)
+ passed
+ ==== radix sort, power-of-two ====
+ elapsed time: 337.48ms (CUDA Measured)
+ passed
+ ==== cpu sort, non-power-of-two ====
+ elapsed time: 4099.12ms (std::chrono)
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
+ passed
+ ==== radix sort, non-power-of-two ====
+ elapsed time: 337.146ms (CUDA Measured)
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
+ passed
+ ```
diff --git a/res/nsight.png b/res/nsight.png
new file mode 100644
index 0000000..024dc01
Binary files /dev/null and b/res/nsight.png differ
diff --git a/res/nsight_naive_workefficient.png b/res/nsight_naive_workefficient.png
new file mode 100644
index 0000000..c4ca99b
Binary files /dev/null and b/res/nsight_naive_workefficient.png differ
diff --git a/res/nsight_thrust.png b/res/nsight_thrust.png
new file mode 100644
index 0000000..eec229b
Binary files /dev/null and b/res/nsight_thrust.png differ
diff --git a/res/nsight_workefficient_shared.png b/res/nsight_workefficient_shared.png
new file mode 100644
index 0000000..c22c7ea
Binary files /dev/null and b/res/nsight_workefficient_shared.png differ
diff --git a/res/res_2^28.png b/res/res_2^28.png
new file mode 100644
index 0000000..3406525
Binary files /dev/null and b/res/res_2^28.png differ
diff --git a/res/res_2^28_compact.png b/res/res_2^28_compact.png
new file mode 100644
index 0000000..225b334
Binary files /dev/null and b/res/res_2^28_compact.png differ
diff --git a/res/res_2^28_radix.png b/res/res_2^28_radix.png
new file mode 100644
index 0000000..6c4b255
Binary files /dev/null and b/res/res_2^28_radix.png differ
diff --git a/res/res_2^8.png b/res/res_2^8.png
new file mode 100644
index 0000000..aba3492
Binary files /dev/null and b/res/res_2^8.png differ
diff --git a/res/res_2^8_compact.png b/res/res_2^8_compact.png
new file mode 100644
index 0000000..e37875f
Binary files /dev/null and b/res/res_2^8_compact.png differ
diff --git a/res/res_2^8_radix.png b/res/res_2^8_radix.png
new file mode 100644
index 0000000..f9e0b77
Binary files /dev/null and b/res/res_2^8_radix.png differ
diff --git a/res/time_arraysize.png b/res/time_arraysize.png
new file mode 100644
index 0000000..9b49788
Binary files /dev/null and b/res/time_arraysize.png differ
diff --git a/res/time_arraysize_log10.png b/res/time_arraysize_log10.png
new file mode 100644
index 0000000..e4ae443
Binary files /dev/null and b/res/time_arraysize_log10.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..f4eea66 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,9 +11,10 @@
#include
#include
#include
+#include
#include "testing_helpers.hpp"
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
@@ -95,6 +96,20 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
+ zeroArray(SIZE, c);
+ printDesc("work-efficient scan with shared memory, power-of-two");
+ StreamCompaction::Efficient::scanShared(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("work-efficient scan with shared memory, non-power-of-two");
+ StreamCompaction::Efficient::scanShared(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(NPOT, c, true);
+ printCmpResult(NPOT, b, c);
+
printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
@@ -147,6 +162,42 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+ printf("\n");
+ printf("**********************\n");
+ printf("** RADIX SORT TESTS **\n");
+ printf("**********************\n");
+
+ genArray(SIZE - 1, a, 50);
+ printArray(SIZE, a, true);
+
+ zeroArray(SIZE, b);
+ printDesc("cpu sort, power-of-two");
+ StreamCompaction::CPU::sort(SIZE, b, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono)");
+ //printArray(SIZE, b, true);
+ printCmpResult(SIZE, b, b);
+
+ zeroArray(SIZE, c);
+ printDesc("radix sort, power-of-two");
+ StreamCompaction::RadixSort::sort(SIZE, c, a);
+ printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(NPOT, b);
+ printDesc("cpu sort, non-power-of-two");
+ StreamCompaction::CPU::sort(NPOT, b, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono)");
+ printArray(NPOT, b, true);
+ printCmpResult(NPOT, b, b);
+
+ zeroArray(NPOT, c);
+ printDesc("radix sort, non-power-of-two");
+ StreamCompaction::RadixSort::sort(NPOT, c, a);
+ printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printArray(NPOT, c, true);
+ printCmpResult(NPOT, b, c);
+
system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index 567795b..d35f904 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
+ "radixsort.h"
)
set(sources
@@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
+ "radixsort.cu"
)
list(SORT headers)
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..275a677 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -24,6 +24,12 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ bools[index] = idata[index] == 0 ? 0 : 1;
}
/**
@@ -33,6 +39,14 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ if (bools[index] == 1) {
+ odata[indices[index]] = idata[index];
+ }
}
}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..eb3886f 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -12,6 +12,13 @@ namespace StreamCompaction {
return timer;
}
+ void scanImpl(int n, int* odata, const int* idata) {
+ odata[0] = 0;
+ for (int i = 0; i < n - 1; ++i) {
+ odata[i + 1] = odata[i] + idata[i];
+ }
+ }
+
/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
@@ -19,10 +26,23 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
+
// TODO
+ scanImpl(n, odata, idata);
+
timer().endCpuTimer();
}
+ void sort(int n, int* odata, const int* idata) {
+ memcpy(odata, idata, n * sizeof(int));
+
+ timer().startCpuTimer();
+
+ std::sort(odata, odata + n);
+
+ timer().endCpuTimer();
+ }
+
/**
* CPU stream compaction without using the scan function.
*
@@ -30,9 +50,17 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
+
// TODO
+ int index = 0;
+ for (int i = 0; i < n; ++i) {
+ if (idata[i] != 0) {
+ odata[index++] = idata[i];
+ }
+ }
+
timer().endCpuTimer();
- return -1;
+ return index;
}
/**
@@ -41,10 +69,34 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
- timer().startCpuTimer();
// TODO
+ int* scanArr = new int[n];
+ int* tempArr = new int[n];
+
+ timer().startCpuTimer();
+
+ // Step 1: Compute temporary array of 0s and 1s
+ for (int i = 0; i < n; ++i) {
+ tempArr[i] = (idata[i] != 0) ? 1 : 0;
+ }
+
+ // Step2: Run exclusive scan on tempArr
+ scanImpl(n, scanArr, tempArr);
+
+ // Step 3: Scatter
+ for (int i = 0; i < n; ++i) {
+ if (tempArr[i] == 1) {
+ odata[scanArr[i]] = idata[i];
+ }
+ }
+
+ int count = scanArr[n - 1] + tempArr[n - 1];
+
+ delete[] tempArr;
+ delete[] scanArr;
+
timer().endCpuTimer();
- return -1;
+ return count;
}
}
}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c047..8a6efa9 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -8,6 +8,8 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata);
+ void sort(int n, int* odata, const int* idata);
+
int compactWithoutScan(int n, int *odata, const int *idata);
int compactWithScan(int n, int *odata, const int *idata);
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..501d4b0 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,6 +3,18 @@
#include "common.h"
#include "efficient.h"
+#define NUM_BANKS 16
+#define LOG_NUM_BANKS 4
+#define CONFLICT_FREE_OFFSET(n) ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
+
+#define ELEMENTS_PER_BLOCK 128
+#define TWICE_ELEMENTS_PER_BLOCK 256
+
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
+
+// sum aux arrays
+int** sumArr;
+
namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +24,209 @@ namespace StreamCompaction {
return timer;
}
+ __global__ void kernUpSweep(int n, int offset, int* x) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ int k = index * offset;
+ x[k + offset - 1] += x[k + (offset >> 1) - 1];
+ }
+
+ __global__ void kernDownSweep(int n, int offset, int* x) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ int left = (1 + (index << 1)) * offset - 1,
+ right = left + offset;
+
+ int t = x[left];
+ x[left] = x[right];
+ x[right] += t;
+ }
+
+ __global__ void kernScanShared(int n, int* sums, int* odata, int* idata) {
+ extern __shared__ int temp[];
+
+ int curBlockSize = n < TWICE_ELEMENTS_PER_BLOCK ? n : TWICE_ELEMENTS_PER_BLOCK;
+
+ int tid = threadIdx.x,
+ index = tid + blockIdx.x * TWICE_ELEMENTS_PER_BLOCK;
+
+ int offset = 1, ai = tid, bi = tid + ELEMENTS_PER_BLOCK,
+ bankOffsetA = CONFLICT_FREE_OFFSET(ai), bankOffsetB = CONFLICT_FREE_OFFSET(bi);
+
+ temp[ai + bankOffsetA] = index < n ? idata[index] : 0;
+ temp[bi + bankOffsetB] = (index < n - ELEMENTS_PER_BLOCK) ? idata[index + ELEMENTS_PER_BLOCK] : 0;
+
+ // up sweep
+ for (int d = curBlockSize >> 1; d > 0; d >>= 1) {
+ __syncthreads();
+
+ if (tid < d) {
+ int ai = ((tid << 1) + 1) * offset - 1;
+ int bi = ai + offset;
+
+ ai += CONFLICT_FREE_OFFSET(ai);
+ bi += CONFLICT_FREE_OFFSET(bi);
+
+ temp[bi] += temp[ai];
+ }
+
+ offset <<= 1;
+ }
+
+ if (tid == 0) {
+ sums[blockIdx.x] = temp[curBlockSize - 1 + CONFLICT_FREE_OFFSET(curBlockSize - 1)];
+ temp[curBlockSize - 1 + CONFLICT_FREE_OFFSET(curBlockSize - 1)] = 0;
+ }
+
+ // down sweep
+ for (int d = 1; d < curBlockSize; d <<= 1) {
+ offset >>= 1;
+ __syncthreads();
+
+ if (tid < d) {
+ int ai = ((tid << 1) + 1) * offset - 1;
+ int bi = ai + offset;
+
+ ai += CONFLICT_FREE_OFFSET(ai);
+ bi += CONFLICT_FREE_OFFSET(bi);
+
+ int t = temp[ai];
+ temp[ai] = temp[bi];
+ temp[bi] += t;
+ }
+ }
+
+ __syncthreads();
+
+ odata[index] = index < n ? temp[ai + bankOffsetA] : 0;
+ odata[index + ELEMENTS_PER_BLOCK] = (index < n - ELEMENTS_PER_BLOCK) ? temp[bi + bankOffsetB] : 0;
+ }
+
+ __global__ void kernAdd(int n, int* odata, const int* idata) {
+ int index = blockIdx.x * TWICE_ELEMENTS_PER_BLOCK + threadIdx.x;
+ if (index >= n) {
+ return;
+ }
+
+ odata[index] += idata[blockIdx.x];
+ if (index < n - ELEMENTS_PER_BLOCK) {
+ odata[index + ELEMENTS_PER_BLOCK] += idata[blockIdx.x];
+ }
+ }
+
+ void createSumArr(int depth, int next_power_of_two) {
+ sumArr = (int**)malloc(sizeof(int*) * depth);
+ int blockCnt = (int)ceil((float)next_power_of_two / (float)TWICE_ELEMENTS_PER_BLOCK);
+ for (int i = 0; i < depth; ++i) {
+ cudaMalloc((void**)&(sumArr[i]), blockCnt * sizeof(int));
+ blockCnt = (int)ceil((float)blockCnt / (float)TWICE_ELEMENTS_PER_BLOCK);
+ }
+ }
+
+ void freeSumArr(int depth) {
+ for (int i = 0; i < depth; ++i) {
+ cudaFree(sumArr[i]);
+ }
+ free(sumArr);
+ }
+
+ void scanSharedHelper(int n, int depth, int* odata, int* idata) {
+ int blockCnt = (int)ceil((float)n / (float)TWICE_ELEMENTS_PER_BLOCK);
+
+ kernScanShared << > > (n, sumArr[depth], odata, idata);
+ if (blockCnt > 1) {
+ scanSharedHelper(blockCnt, depth + 1, sumArr[depth], sumArr[depth]);
+ kernAdd << > > (n, odata, sumArr[depth]);
+ }
+ }
+
+ void scanShared(int n, int* odata, const int* idata) {
+ int max_d = ilog2ceil(n);
+ int next_power_of_two = 1 << max_d;
+
+ int* in, int* out;
+ cudaMalloc((void**)&in, next_power_of_two * sizeof(int));
+ cudaMalloc((void**)&out, n * sizeof(int));
+ cudaMemcpy(in, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ createSumArr(max_d, next_power_of_two);
+
+ timer().startGpuTimer();
+
+ scanSharedHelper(next_power_of_two, 0, out, in);
+
+ timer().endGpuTimer();
+
+ cudaMemcpy(odata, out, n * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaFree(in);
+ cudaFree(out);
+
+ freeSumArr(max_d);
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ // memory operation
+ int max_d = ilog2ceil(n);
+ int next_power_of_two = 1 << max_d;
+
+ int* x;
+ cudaMalloc((void**)&x, next_power_of_two * sizeof(int));
+ cudaMemcpy(x, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ // optimize block size
+ // block size should be multiple of warp size and less than max threads per block
+ // grid size should be multiple of number of SMs
+ cudaDeviceProp prop;
+ cudaGetDeviceProperties(&prop, 0);
+
+ int minBlockSize = prop.warpSize, maxBlockSize = prop.maxThreadsPerBlock, sms = prop.multiProcessorCount;
+
timer().startGpuTimer();
+
// TODO
+ int step = 1, threadCount = next_power_of_two;
+
+ // up-sweep
+ for (int d = 0; d < max_d; ++d) {
+ step <<= 1;
+ threadCount >>= 1;
+
+ int curBlockSize = std::max(minBlockSize, std::min(threadCount, maxBlockSize));
+ int curGridSize = (threadCount + curBlockSize - 1) / curBlockSize;
+ curGridSize = std::ceil(curGridSize / (float)sms) * sms;
+
+ kernUpSweep<<>>(threadCount, step, x);
+ }
+
+ // down-sweep
+ cudaMemset(x + next_power_of_two - 1, 0, sizeof(int));
+
+ for (int d = max_d - 1; d >= 0; --d) {
+ step >>= 1;
+
+ int smMultiple = 1 << ((int)std::ceil(threadCount / (float)sms));
+ int curBlockSize = std::max(minBlockSize, std::min(threadCount, std::min(smMultiple, maxBlockSize)));
+ int curGridSize = (threadCount + curBlockSize - 1) / curBlockSize;
+ curGridSize = std::ceil(curGridSize / (float)sms) * sms;
+
+ kernDownSweep<<>>(threadCount, step, x);
+ threadCount <<= 1;
+ }
+
timer().endGpuTimer();
+
+ cudaMemcpy(odata, x, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaFree(x);
}
/**
@@ -31,10 +239,46 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
+ int* bools, *scanArr, *out, *in;
+ cudaMalloc((void**)&bools, n * sizeof(int));
+ cudaMalloc((void**)&scanArr, n * sizeof(int));
+ cudaMalloc((void**)&out, n * sizeof(int));
+ cudaMalloc((void**)&in, n * sizeof(int));
+
+ cudaMemcpy(in, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ //timer().startGpuTimer();
+
// TODO
- timer().endGpuTimer();
- return -1;
+ cudaDeviceProp prop;
+ cudaGetDeviceProperties(&prop, 0);
+
+ int minBlockSize = prop.warpSize, maxBlockSize = prop.maxThreadsPerBlock, sms = prop.multiProcessorCount;
+ int curBlockSize = std::max(minBlockSize, std::min(n, maxBlockSize));
+ int gridSize = (n + curBlockSize - 1) / curBlockSize;
+
+ // Step 1: Compute temporary array of 0s and 1s
+ StreamCompaction::Common::kernMapToBoolean<<>>(n, bools, in);
+
+ // Step2: Run exclusive scan on tempArr
+ scan(n, scanArr, bools);
+
+ // Step 3: Scatter
+ StreamCompaction::Common::kernScatter<<>>(n, out, in, bools, scanArr);
+
+ //timer().endGpuTimer();
+
+ int count = 0, lastScan = 0;
+ cudaMemcpy(&count, scanArr + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ cudaMemcpy(&lastScan, bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ cudaMemcpy(odata, out, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaFree(bools);
+ cudaFree(scanArr);
+ cudaFree(out);
+ cudaFree(in);
+
+ return count + lastScan;
}
}
}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..4bd9d70 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -6,8 +6,16 @@ namespace StreamCompaction {
namespace Efficient {
StreamCompaction::Common::PerformanceTimer& timer();
+ void scanSharedHelper(int n, int depth, int* odata, int* idata);
+ void scanShared(int n, int *odata, const int *idata);
+
void scan(int n, int *odata, const int *idata);
int compact(int n, int *odata, const int *idata);
+
+ // helper
+ void createSumArr(int depth, int next_power_of_two);
+
+ void freeSumArr(int depth);
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..2617111 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -11,15 +11,63 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
+
// TODO: __global__
+ __global__ void kernNaiveScan(int n, int offset, int* x, const int* last) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ x[index] = last[index];
+ if (index >= offset) {
+ x[index] += last[index - offset];
+ }
+ }
+
+ __global__ void kernShift(int n, int* x, const int* last) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ x[index] = index == 0 ? 0 : last[index - 1];
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ // memory operation
+ int *x, *last;
+ cudaMalloc((void**)&x, n * sizeof(int));
+ cudaMalloc((void**)&last, n * sizeof(int));
+ cudaMemcpy(last, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ cudaDeviceProp prop;
+ cudaGetDeviceProperties(&prop, 0);
+
+ int minBlockSize = prop.warpSize, maxBlockSize = prop.maxThreadsPerBlock, sms = prop.multiProcessorCount;
+ int curBlockSize = std::max(minBlockSize, std::min(n, maxBlockSize));
+ int gridSize = (int)ceil((float)(n + curBlockSize - 1) / curBlockSize);
+
timer().startGpuTimer();
+
+ int max_d = ilog2ceil(n);
// TODO
+ for (int d = 1; d <= max_d; ++d) {
+ kernNaiveScan<<>>(n, 1 << (d - 1), x, last);
+ std::swap(x, last);
+ }
+
+ kernShift<<>>(n, x, last);
+
timer().endGpuTimer();
+
+ cudaMemcpy(odata, x, n * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaFree(x);
+ cudaFree(last);
}
}
}
diff --git a/stream_compaction/radixsort.cu b/stream_compaction/radixsort.cu
new file mode 100644
index 0000000..a1a6595
--- /dev/null
+++ b/stream_compaction/radixsort.cu
@@ -0,0 +1,112 @@
+#include
+#include
+#include "common.h"
+
+#include "efficient.h"
+
+namespace StreamCompaction {
+ namespace RadixSort {
+ using StreamCompaction::Common::PerformanceTimer;
+ PerformanceTimer& timer()
+ {
+ static PerformanceTimer timer;
+ return timer;
+ }
+
+ __global__ void kernMaxElement(int n, int* res, const int* idata) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ if (*res < idata[index]) {
+ *res = idata[index];
+ }
+
+ }
+
+ __global__ void kernMapToBoolean(int n, int bit, int* e, const int* idata) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ e[index] = (idata[index] & (1 << bit)) == 0;
+ }
+
+ __global__ void kernScatter(int n, int *cur, int *last, int *e, int *f) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ int totalFalses = f[n - 1] + e[n - 1];
+
+ int t = index - f[index] + totalFalses;
+ int d = e[index] ? f[index] : t;
+ last[d] = cur[index];
+ }
+
+
+ void sort(int n, int* odata, const int* idata) {
+ int *last, *cur, *out;
+ cudaMalloc((void**)&last, n * sizeof(int)); // last array i
+ cudaMalloc((void**)&cur, n * sizeof(int)); // current array i
+ cudaMalloc((void**)&out, n * sizeof(int));
+
+ cudaMemcpy(cur, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+ int *e, *f;
+ cudaMalloc((void**)&e, n * sizeof(int));
+ cudaMalloc((void**)&f, n * sizeof(int));
+
+ cudaDeviceProp prop;
+ cudaGetDeviceProperties(&prop, 0);
+
+ int minBlockSize = prop.warpSize, maxBlockSize = prop.maxThreadsPerBlock, sms = prop.multiProcessorCount;
+ int curBlockSize = std::max(minBlockSize, std::min(n, maxBlockSize));
+ int curGridSize = (n + curBlockSize - 1) / curBlockSize;
+
+ int* maxElement;
+ cudaMalloc((void**)&maxElement, sizeof(int));
+ kernMaxElement << > > (n, maxElement, cur);
+
+ int maxVal;
+ cudaMemcpy(&maxVal, maxElement, sizeof(int), cudaMemcpyDeviceToHost);
+ int numBits = ilog2ceil(maxVal);
+
+ int max_d = ilog2ceil(n);
+ int next_power_of_two = 1 << max_d;
+ StreamCompaction::Efficient::createSumArr(max_d, next_power_of_two);
+
+ timer().startGpuTimer();
+ // run split on each bit
+ for (int bit = 0; bit <= numBits; ++bit) {
+ // find b
+ // get e array
+ kernMapToBoolean<<>>(n, bit, e, cur); // e
+
+ // exclusive scan e
+ StreamCompaction::Efficient::scanSharedHelper(next_power_of_two, 0, f, e);
+
+ // find scatter indices
+
+ kernScatter << > > (n, cur, last, e, f);
+ std::swap(last, cur);
+ }
+
+ timer().endGpuTimer();
+
+ cudaMemcpy(odata, cur, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+ cudaFree(last);
+ cudaFree(cur);
+ cudaFree(out);
+ cudaFree(e);
+ cudaFree(f);
+ cudaFree(maxElement);
+
+ StreamCompaction::Efficient::freeSumArr(max_d);
+ }
+ }
+}
diff --git a/stream_compaction/radixsort.h b/stream_compaction/radixsort.h
new file mode 100644
index 0000000..8344b17
--- /dev/null
+++ b/stream_compaction/radixsort.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+ namespace RadixSort {
+ StreamCompaction::Common::PerformanceTimer& timer();
+
+ void sort(int n, int* odata, const int* idata);
+ }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..8658a10 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,18 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+ thrust::device_vector dev_in(idata, idata + n);
+ thrust::device_vector dev_out(n);
+
+ timer().startGpuTimer();
+ thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin());
timer().endGpuTimer();
+
+ thrust::copy(dev_out.begin(), dev_out.end(), odata);
}
}
}