diff --git a/README.md b/README.md index 0e38ddb..6eaf976 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,144 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yifan Lu + * [LinkedIn](https://www.linkedin.com/in/yifan-lu-495559231/), [personal website](http://portfolio.samielouse.icu/) +* Tested on: Windows 11, AMD Ryzen 7 5800H 3.20 GHz, Nvidia GeForce RTX 3060 Laptop GPU (Personal Laptop) -### (TODO: Your README) +## Project Feature +- CPU Scan & Stream Compaction +- Naive GPU Scan Algorithm +- Work-Efficient GPU Scan & Stream Compaction +- Thrust Scan +- **[Extra Credit]** Optimized work-efficient with bitwise kernel operations and dynamic launching block numbers -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +## Introduction + +### Scan +Scan is to process elements of an array to generate a new array where each position contains an aggregation like a sum, product, or logical operation of elements up to that position in the original array. in this project, we are doing a scan for sum. + +There are two main variaty of scan: + + - **Inclusive Scan** + +Each element in the output array includes the corresponding element from the input array in the sum. + + - **Exclusive Scan** + +Each element in the output array does not include the corresponding element from the input array, starting with an initial value (like zero for sum). + + +### Compaction +Compaction is to remove or filter out unwanted elements from an array or list, creating a new array that contains only the elements that satisfy a specific condition. + +### Parallelization +Parallelizing the scan and compaction process can significantly improve its performance, especially on large datasets. In parallel implementations, the array can be divided among multiple processors or threads, each performing the mapping, scanning, and scattering on a segment of the array. + + +## Performance Analysis + +### Blocksize Optimization + +For my implement and hardware setting, when block size reaches 256, the operations are showing better performance. The performance will not change significantly if the block size continues increasing. + +This is the GPU time for work-efficient method on array size $2^{15}$, which reaches its lowest point at block size 256. + +![](img/blocksize.png) + +### Compare GPU Scan Implementations + +The following chart shows the time for CPU, GPU Navie, GPU work-efficient and thrust scan. + +![](img/time.png) +![](img/timelog2.png) + +Naive scan needs $O(log2(n))$ passes. Each pass has $O(n)$ computations. Work-efficient uses a 'binary tree' structure and we only need to do $O(n)$ computations for a single traverse of the tree. + +Work-efficient has a significant upgrade when the array size are getting larger. + +### Performance Bottlenecks + +To trace the GPU bottlenecks when doing scanning, I used Nsight Systems to launch release build. + +![](img/b4numblock.png) + +The above screenshots are taken from nsight systems which time period are related to work-efficient scan. + +From the graph we can see that there are time gaps between GPU SM executions. That is probably because non-coalesced memory accesses to global memory. + +In order for each kernel to have faster computations, I switch operations such as mod and multiply/divide into bit-wise operations. + +For example, change ``` index % offset == 0 ``` into ``` (index & (offset - 1)) == 0 ```. This is because ```offset``` is always the power of 2, so the ```index``` we are looking for is also power of 2. By AND the ```index``` and ```offset - 1```, which is a bunch of 1s, we can check the result to see if it is 0. If it is, then ```index``` can be divided by ```offset```. + +Also to avoid too much threads idling, I **dynamically change the number of blocks that kernel will launch in the work-efficient method**. + +After the optimization, the timeline looks like the graph below, which has longer SM active time. However, the time gaps will exists because the memory latency. + +![](img/afternumblock.png) + +### Test Output + +The test is done under the configuration of array size 2^10 and block size 256. + +``` + +**************** +** SCAN TESTS ** +**************** + [ 16 47 5 34 25 42 48 11 36 49 39 23 0 ... 42 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0008ms (std::chrono Measured) + [ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0008ms (std::chrono Measured) + [ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25372 25394 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.342016ms (CUDA Measured) + [ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.105472ms (CUDA Measured) + [ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 0 0 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.897024ms (CUDA Measured) + [ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25401 25443 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.27648ms (CUDA Measured) + [ 0 16 63 68 102 127 169 217 228 264 313 352 375 ... 25372 25394 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.121856ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.04096ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 3 1 3 1 1 3 1 1 0 1 1 1 3 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0028ms (std::chrono Measured) + [ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0022ms (std::chrono Measured) + [ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0005ms (std::chrono Measured) + [ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.27648ms (CUDA Measured) + [ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.27648ms (CUDA Measured) + [ 3 1 3 1 1 3 1 1 1 1 1 3 1 ... 1 1 ] + passed +``` diff --git a/img/afternumblock.png b/img/afternumblock.png new file mode 100644 index 0000000..fa932eb Binary files /dev/null and b/img/afternumblock.png differ diff --git a/img/b4numblock.png b/img/b4numblock.png new file mode 100644 index 0000000..9af8923 Binary files /dev/null and b/img/b4numblock.png differ diff --git a/img/blocksize.png b/img/blocksize.png new file mode 100644 index 0000000..9dcb383 Binary files /dev/null and b/img/blocksize.png differ diff --git a/img/nsystem_trace.png b/img/nsystem_trace.png new file mode 100644 index 0000000..50f4339 Binary files /dev/null and b/img/nsystem_trace.png differ diff --git a/img/nsystem_trace2.png b/img/nsystem_trace2.png new file mode 100644 index 0000000..c35d0d7 Binary files /dev/null and b/img/nsystem_trace2.png differ diff --git a/img/time.png b/img/time.png new file mode 100644 index 0000000..713aa2e Binary files /dev/null and b/img/time.png differ diff --git a/img/timelog.png b/img/timelog.png new file mode 100644 index 0000000..719dfab Binary files /dev/null and b/img/timelog.png differ diff --git a/img/timelog2.png b/img/timelog2.png new file mode 100644 index 0000000..9fafae7 Binary files /dev/null and b/img/timelog2.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..f132fc4 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 10; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan @@ -64,21 +64,21 @@ int main(int argc, char* argv[]) { printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); + StreamCompaction::Efficient::scan_opt(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); + StreamCompaction::Efficient::scan_opt(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); @@ -108,6 +108,7 @@ int main(int argc, char* argv[]) { int count, expectedCount, expectedNPOT; + // initialize b using StreamCompaction::CPU::compactWithoutScan you implement // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. zeroArray(SIZE, b); @@ -137,16 +138,17 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp index 025e94a..8ac1d3e 100644 --- a/src/testing_helpers.hpp +++ b/src/testing_helpers.hpp @@ -10,7 +10,9 @@ template int cmpArrays(int n, T *a, T *b) { for (int i = 0; i < n; i++) { if (a[i] != b[i]) { - printf(" a[%d] = %d, b[%d] = %d\n", i, a[i], i, b[i]); + printf(" a[%d] = %d, b[%d] = %d\n ", i, a[i], i, b[i]); + printf(" a[%d] = %d, b[%d] = %d\n ", i - 1, a[i - 1], i - 1, b[i - 1]); + printf(" a[%d] = %d, b[%d] = %d\n ", i +1, a[i + 1], i + 1, b[i + 1]); return 1; } } @@ -58,6 +60,7 @@ void genArray(int n, int *a, int maxval) { } void printArray(int n, int *a, bool abridged = false) { + //printf("count: %d \n", n); printf(" [ "); for (int i = 0; i < n; i++) { if (abridged && i + 2 == 15 && n > 16) { diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..bd5071c 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,12 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + bools[index] = (idata[index] != 0) ? 1 : 0; + } /** @@ -33,6 +39,14 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } + } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..8bf1ab6 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -20,6 +20,10 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } timer().endCpuTimer(); } @@ -31,8 +35,15 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int count = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count] = idata[i]; + count++; + } + } timer().endCpuTimer(); - return -1; + return count; } /** @@ -41,10 +52,25 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); + //timer().startCpuTimer(); // TODO - timer().endCpuTimer(); - return -1; + int *bools = new int[n]; + for (int i = 0; i < n; i++) { + bools[i] = (idata[i] != 0) ? 1 : 0; + } + int* scanResult = new int[n]; + scan(n, scanResult, bools); + // scatter + int count = bools[n - 1] == 1 ? scanResult[n - 1] : scanResult[n - 1]; + for (int i = 0; i < n; i++) { + if (bools[i] == 1) { + odata[scanResult[i]] = idata[i]; + } + } + //timer().endCpuTimer(); + delete[] bools; + delete[] scanResult; + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..c5348b6 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,9 @@ #include "common.h" #include "efficient.h" +#define blockSize 256 +#define RECURSIVE_SCAN 0 + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,15 +15,249 @@ namespace StreamCompaction { return timer; } + + // up-sweep kernel + __global__ void kernUpSweep(int n, int* odata, int* idata, int t) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } +#if RECURSIVE_SCAN + // exclusive scan + odata[index] = (index > 0) ? idata[index - 1] : 0; + __syncthreads(); + // upsweep + for (int d = 0; d <= t; ++d) { + int offset = 1 << (d + 1); + int ai = index + offset - 1; + int bi = index + (offset / 2) - 1; + if (index < n && (index % offset) == 0) { + odata[ai] += odata[bi]; + } + + __syncthreads(); + } +#else + int offset = 1 << (t + 1); // 2^(d + 1) + int ai = index + offset - 1; + int bi = index + (offset >> 1) - 1; + if ((index & (offset - 1)) == 0) { + idata[ai] += idata[bi]; + } +#endif + } + + // down-sweep kernel + __global__ void kernDownSweep(int n, int* odata, const int* idata, int t) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); +#if RECURSIVE_SCAN + if (index >= 1 << (t + 1)) { + return; + } + // exclusive scan + odata[index] = (index > 0) ? idata[index - 1] : 0; + __syncthreads(); + // downsweep + if (index == 0) { + odata[n - 1] = 0; + } + for (int d = t; d >= 0; --d) { + int offset = 1 << (d + 1); + int ai = index + offset - 1; + int bi = index + (offset / 2) - 1; + if (index < n && (index % offset) == 0) { + int temp = odata[bi]; + odata[bi] = odata[ai]; + odata[ai] += temp; + } + + __syncthreads(); + } +#else + if (index >= n) { + return; + } + + __syncthreads(); + + int offset = 1 << (t + 1); + int ai = index + offset - 1; + int bi = index + (offset >> 1) - 1; + if ((index & (offset - 1)) == 0) { + int temp = odata[bi]; + odata[bi] = odata[ai]; + odata[ai] += temp; + } + + __syncthreads(); + + +#endif + } + + __global__ void kernUpSweep_opt(int n, int* idata, int d) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + int offset = 1 << (d + 1); + index <<= (d + 1); + idata[index + offset - 1] += idata[index + (offset >> 1) - 1]; + } + + __global__ void kernDownSweep_opt(int n, int* idata, int d) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + int offset = 1 << (d + 1); + index <<= (d + 1); + int temp = idata[index + (offset >> 1) - 1]; + idata[index + (offset >> 1) - 1] = idata[index + offset - 1]; + idata[index + offset - 1] += temp; + } + + + // up sweep + down aweep + __global__ void kernScan(int n, int* odata, const int* idata, int t) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + int paddedSize = 1 << (t + 1); + if (index >= paddedSize) { + return; + } + // exclusive scan + //odata[index] = (index > 0) ? idata[index - 1] : 0; + //odata[index] = idata[index]; + odata[index] = (index >= n) ? 0 : idata[index]; + __syncthreads(); + // upsweep + for (int d = 0; d <= t; ++d) { + int offset = 1 << (d + 1); + int ai = index + offset - 1; + int bi = index + (offset >> 1) - 1; + if (index < paddedSize && ((index & (offset - 1)) == 0)) { + odata[ai] += odata[bi]; + } + + __syncthreads(); + } + // downsweep + if (index == 0) { + odata[paddedSize - 1] = 0; + } + + for (int d = t; d >= 0; --d) { + int offset = 1 << (d + 1); + int ai = index + offset - 1; + int bi = index + (offset >> 1) - 1; + if (index < paddedSize && ((index & (offset - 1)) == 0)) { + int temp = odata[bi]; + odata[bi] = odata[ai]; + odata[ai] += temp; + } + + __syncthreads(); + } + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + // TODO - timer().endGpuTimer(); + int t = ilog2ceil(n) - 1; + int peddedSize = 1 << (t + 1); + //const int blockSize = 128; + int numBlocks = (peddedSize + blockSize - 1) / blockSize; + dim3 fullBlocksPerGrid(numBlocks); + + /*printf("log2_n - 1: %d\n", t); + printf("array size: %d; pedded size: %d\n", n, peddedSize); + printf("block numbers: %d\n", numBlocks);*/ + // call kernel + int* dev_idata; + int* dev_odata; + cudaMalloc((void**)&dev_idata, peddedSize * sizeof(int)); + cudaMalloc((void**)&dev_odata, peddedSize * sizeof(int)); + cudaMemset(dev_odata, 0, peddedSize * sizeof(int)); + cudaMemset(dev_idata, 0, peddedSize * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + +#if RECURSIVE_SCAN + //kernUpSweep << <1, n >> > (n, dev_odata, dev_idata, t); + //kernDownSweep << <1, n >> > (n, dev_odata, dev_idata, t); + //kernScan << > > (n, dev_odata, dev_idata, t); // arbitrary block size + kernScan << <1, n >> > (n, dev_odata, dev_idata, t); +#else + + // up-sweep + for (int d = 0; d <= t; ++d) { + kernUpSweep << > > (peddedSize, dev_odata, dev_idata, d); + } + // down sweep + // set last element to 0 + + cudaMemset(dev_idata + peddedSize - 1, 0, sizeof(int)); + for (int d = t; d >= 0; d--) { + kernDownSweep << > > (peddedSize, dev_idata, dev_idata, d); + } + +#endif + timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); } + // dynamic block number + void scan_opt(int n, int* odata, const int* idata) { + int t = ilog2ceil(n) - 1; + int peddedSize = 1 << (t + 1); + int numBlocks = (peddedSize + blockSize - 1) / blockSize; + dim3 fullBlocksPerGrid(numBlocks); + + // call kernel + int* dev_idata; + cudaMalloc((void**)&dev_idata, peddedSize * sizeof(int)); + cudaMemset(dev_idata, 0, peddedSize * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + + // up-sweep + for (int d = 0; d <= t; d++) { + int offset = 1 << (d + 1); + int activeThreads = peddedSize / offset; + numBlocks = (activeThreads + offset - 1) / offset; + dim3 fullBlocksPerGrid(numBlocks); + kernUpSweep_opt << > > (activeThreads, dev_idata, d); + cudaDeviceSynchronize(); + } + // down sweep + // set last element to 0 + cudaMemset(dev_idata + peddedSize - 1, 0, sizeof(int)); + for (int d = t; d >= 0; d--) { + int offset = 1 << (d + 1); + int activeThreads = peddedSize / offset; + numBlocks = (activeThreads + offset - 1) / offset; + dim3 fullBlocksPerGrid(numBlocks); + kernDownSweep_opt << > > (activeThreads, dev_idata, d); + cudaDeviceSynchronize(); + } + + timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + + } + + + /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -31,10 +268,85 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + //timer().startGpuTimer(); // TODO - timer().endGpuTimer(); - return -1; + // compute bool array + int t = ilog2ceil(n) - 1; + int peddedSize = 1 << (t + 1); + int numBlocks = (peddedSize + blockSize - 1) / blockSize; + dim3 fullBlocksPerGrid(numBlocks); + int* dev_bools; + int* dev_idata; + int* dev_indices; + int* dev_odata; + int* bools = new int[peddedSize]; + int* indices = new int[peddedSize]; + cudaMalloc((void**)&dev_bools, peddedSize * sizeof(int)); + cudaMalloc((void**)&dev_idata, peddedSize * sizeof(int)); + cudaMalloc((void**)&dev_indices, peddedSize * sizeof(int)); + cudaMalloc((void**)&dev_odata, peddedSize * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemset(dev_bools, 0, peddedSize * sizeof(int)); + cudaMemset(dev_indices, 0, peddedSize * sizeof(int)); + StreamCompaction::Common::kernMapToBoolean << > > (peddedSize, dev_bools, dev_idata); + // scan +#if 1 + kernScan << <1, n >> > (n, dev_indices, dev_bools, t); + // scatter + StreamCompaction::Common::kernScatter << <1, n >> > (n, dev_odata, dev_idata, dev_bools, dev_indices); + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(bools, dev_bools, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(indices, dev_indices, n * sizeof(int), cudaMemcpyDeviceToHost); + int count = bools[n - 1] ? indices[n - 1] + 1 : indices[n - 1]; + cudaFree(dev_bools); + cudaFree(dev_idata); + cudaFree(dev_indices); + cudaFree(dev_odata); + delete[] bools; + delete[] indices; +#else + // up-sweep + int* temp = new int[peddedSize]; + dev_odata = dev_bools; + for (int i = 0; i <= t; i++) { + //int offset = 1 << (i + 1); + //int numBlocks = (n + offset - 1) / offset; + //dim3 fullBlocksPerGrid(numBlocks); + kernUpSweep << > > (peddedSize, dev_odata, dev_odata, i); + } + + // set last element to 0 + cudaMemset(dev_odata + peddedSize - 1, 0, sizeof(int)); + // down-sweep + for (int i = t; i >= 0; i--) { + //int offset = 1 << (i + 1); + //int numBlocks = (n + offset - 1) / offset; + //dim3 fullBlocksPerGrid(numBlocks); + kernDownSweep << > > (peddedSize, dev_odata, dev_odata, i); + } + dev_indices = dev_odata; + cudaMemset(dev_odata, 0, peddedSize * sizeof(int)); + + // scatter + StreamCompaction::Common::kernScatter << > > (peddedSize, dev_odata, dev_idata, dev_bools, dev_indices); + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(bools, dev_bools, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(indices, dev_indices, n * sizeof(int), cudaMemcpyDeviceToHost); + int count = bools[n - 1] ? indices[n - 1] : indices[n - 1]; + + cudaFree(dev_bools); + cudaFree(dev_idata); + cudaFree(dev_indices); + cudaFree(dev_odata); + delete[] bools; + delete[] indices; + delete[] temp; + printf("work-efficient compact: %d\n", count); +#endif + + + //timer().endGpuTimer(); + return count; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..c0b26ae 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -8,6 +8,8 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); + void scan_opt(int n, int* odata, const int* idata); + int compact(int n, int *odata, const int *idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..1b79a94 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,8 @@ #include "common.h" #include "naive.h" +# define RECURSIVE_SCAN 0 + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -12,14 +14,192 @@ namespace StreamCompaction { return timer; } // TODO: __global__ + int getLog2(int n) { + int log2 = 0; + while (n >>= 1) { + log2++; + } + return log2+1; + } + + __global__ void kernInclusiveScan(int n, int* odata, const int* idata, int t) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + + if (index >= n) { + return; + } + + odata[index] = idata[index]; + __syncthreads(); + + if(index >= t) { + odata[index] = idata[index - t] + odata[index]; + } + } + + __global__ void kernScan(int n, int* odata, const int* idata, int log2_n) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + int pedding = 1 << log2_n; + if (index >= pedding) { + return; + } + // exclusive scan + odata[index] = (index > 0) ? idata[index - 1] : 0; + __syncthreads(); // odata with first element as 0 is ready + + for (int d = 1; d <= log2_n; ++d) { + int t = 1 << (d - 1); + int temp = 0; + if (index >= t) temp = odata[index - t]; // Load the previous step's value + __syncthreads(); // Synchronize before updating + if (index >= t) odata[index] += temp; // Update the current value + __syncthreads(); // Synchronize after updating + } + } + + __global__ void kernBlockWiseExclusiveScan(int n, int* odata, const int* idata, int blockSize) { + extern __shared__ int sdata[]; + + int idx = threadIdx.x; + int blockStartIndex = blockIdx.x * blockDim.x; + int index = blockStartIndex + idx; + + // Load data into shared memory + if (index < n) { + sdata[idx] = idata[index];//(idx > 0) ? idata[index - 1] : 0; + } + else { + sdata[idx] = 0; // Out-of-range threads + } + __syncthreads(); + + // Perform in-block scan + for (int d = 1; d < blockDim.x; d *= 2) { + int t = idx >= d ? sdata[idx - d] : 0; + __syncthreads(); + if (idx >= d) { + sdata[idx] += t; + } + __syncthreads(); + } + + // Write results to global memory + if (index < n) { + odata[index] = sdata[idx]; + } + } + + + + // kernel for write total sum of each block into a new array + __global__ void kernWriteBlockSum(int n, const int* odata, int* blockSum) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + if ((index + 1) % (blockDim.x) == 0) { + int i = (index + 1) / (blockDim.x) - 1; + blockSum[i] = odata[index]; + } + } + + // kernel for add block increments to each element in the corresponding block + __global__ void kernAddBlockSum(int n, int* odata, const int* blockSum) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + + // exclusive scan + // set first element to 0 + + int temp = 0; + if (index == 0) { + temp = 0; + } + else { + int blockIdx = (index - 1) / blockDim.x; + int sumToAdd = blockSum[blockIdx]; + temp = odata[index - 1] + sumToAdd; + } + odata[index] = temp; + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + // TODO + int log2_n = getLog2(n); + int t = ilog2ceil(n); + //printf("n: %d\n", n); + //printf("log2_n: %d\n", t); + int blockSize = 256; + int numBlocks = (n + blockSize - 1) / blockSize; + dim3 fullBlocksPerGrid(numBlocks); + //printf("block size: %d\n", blockSize); + //printf("numBlocks: %d\n", numBlocks); + // get block size and block number for scan block sum + int numBlocks_scan = (numBlocks + blockSize - 1) / blockSize; + dim3 fullBlocksPerGrid_scan(numBlocks_scan); + + int* dev_idata; + int* dev_odata; + int* dev_blockSum; + int* dev_blockIncrements; + int* blockSum = new int[n]; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_blockSum, n * sizeof(int)); + cudaMalloc((void**)&dev_blockIncrements, n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + cudaMemcpy(dev_blockSum, blockSum, blockSize * sizeof(int), cudaMemcpyHostToDevice); + + timer().startGpuTimer(); + // call kernel +# if RECURSIVE_SCAN + // scan on each block + //kernScan << > > (n, dev_odata, dev_idata, t); // non-shared memory one + kernBlockWiseExclusiveScan << > > (n, dev_odata, dev_idata, blockSize); // shared memory one + // write total sum of each block to blockSum + kernWriteBlockSum << > > (n, dev_odata, dev_blockSum); + // scan on blockSum + int blockSumSize = ilog2ceil(numBlocks); + kernScan << <1, numBlocks >> > (numBlocks, dev_blockIncrements, dev_blockSum, blockSumSize); + //kernScan << > > (numBlocks, dev_blockIncrements, dev_blockSum, blockSumSize); + // recursive scan on blockSum + // add block increments to each element in the corresponding block + kernAddBlockSum << > > (n, dev_odata, dev_blockIncrements); + //dev_odata = dev_blockIncrements; // for testing + timer().endGpuTimer(); + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); +# else + for (int d = 1; d <= log2_n; d++) { + int pedding = 1 << (d - 1); + kernInclusiveScan << > > (n, dev_odata, dev_idata, pedding); + int* temp = dev_idata; + dev_idata = dev_odata; + dev_odata = temp; + } + + + timer().endGpuTimer(); + + // right shift odata + odata[0] = 0; + cudaMemcpy(odata + 1, dev_idata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost); + +#endif + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_blockSum); + delete[] blockSum; + + } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..4cf7138 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,18 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::device_vector dv_in(idata, idata + n); + thrust::device_vector dv_out(n); + timer().startGpuTimer(); + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); timer().endGpuTimer(); + thrust::copy(dv_out.begin(), dv_out.end(), odata); + + } } }