diff --git a/README.md b/README.md index 0e38ddb..61e8a79 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,118 @@ CUDA Stream Compaction ====================== -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 - Stream Compaction** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Tong Hu + * [LinkedIn](https://www.linkedin.com/in/tong-hu-5819a122a/) +* Time tested on: Windows 11, Ryzen 7 1700X @ 3.4GHz 16GB, GTX 1080 16GB (Personal Computer) +* Nsight system analysis tested on: Windows 11, i5-11600K @ 3.91GHz 32GB, RTX 2060 6GB (Personal Desktop) (since GTX 1080 does not support GPU metric collection) -### (TODO: Your README) +### Features +- CPU Scan & Stream Compaction +- Naive GPU Scan algorithm +- Work-Efficient GPU Scan (extra credit version) & Stream Compaction +- Using Thrust's Implementation -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +### Introduction +In this project, I implemented the scan function commonly used in stream compaction of int array. I implemented 3 different version (CPU version, naive parallel version, and efficient work version) of the Scan(exlusive Prefix Sum) algorithm, and use thrust's implemntation of exclusive scan, then compared the performance of each version of Scan. +### Roughly optimize the block sizes +Figure1. Time(ms) of scan function vs. block size (array size = $2^{27}$) +![Time(ms) vs. Block Size](/img/time_vs_blockSize.png) + +Table 1. Time(ms) of scan function in different block size (array size = $2^{27}$) +| Block size | 1 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | +| --------------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | +| naïve pow2 | 4378.43 | 588.428 | 313.288 | 164.764 | 146.405 | 144.845 | 144.111 | 144.72 | +| naïve npow2 | 4373.24 | 571.921 | 291.82 | 164.399 | 146.764 | 144.936 | 144.916 | 144.649 | +| efficient pow2 | 328.222 | 64.9941 | 57.1655 | 58.7879 | 62.4402 | 62.0726 | 62.0135 | 60.9551 | +| efficient npow2 | 332.084 | 65.0913 | 56.9548 | 58.7963 | 62.0282 | 62.4519 | 61.8724 | 61.3648 | + +From Figure 1 we can see that as the block size increases, the time of Scan function decreases rapidly, and after some point, the time does not change much although the number of block increased. From Table 1 we can see the naive parallel scan performs best when the block size equals to 256, while the work efficient parallel scan performs best when the block size is 16. Therefore I select 256 and 16 as block sizes for naive scan and work-efficient scan in the following performance analysis. + +### Performance analysis +Figure 2. Time(ms) of scan function in diffrent array size. +![Time(ms) vs. array size](/img/time_vs_arraySize.png) + +Table 2. Time(ms) of scan function in diffrent array size. +| log of array size | 4 | 8 | 16 | 20 | 22 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | +| ----------------- | -------- | -------- | -------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | +| cpu | 0.0002 | 0.0008 | 0.153 | 2.4054 | 9.0351 | 37.6997 | 78.1179 | 153.777 | 304.135 | 592.904 | 1266.01 | 2379.02 | +| naïve | 0.02048 | 0.19456 | 0.318272 | 0.918432 | 3.64445 | 16.1397 | 33.5231 | 68.8091 | 144.143 | 302.976 | 639.805 | 4626.05 | +| efficient | 0.306176 | 0.326656 | 0.566368 | 1.32307 | 2.16864 | 7.9192 | 15.282 | 29.2795 | 57.1435 | 113.729 | 234.424 | 429.09 | +| thrust | 0.233504 | 0.131072 | 0.111616 | 0.929792 | 0.75264 | 1.57184 | 2.51597 | 3.45184 | 5.73133 | 10.7909 | 20.3489 | 112.254 | + + +From Figure 2 we can tell that when the array size increased, the time cost of Scan will increase. + +When the array size is small (smaller than $2^{16}$), CPU Scan performs better than GPU Scans and thrust's Scan. This is probably because the overhead of invoking GPU kernels overwhelms the benefit of parallel scans. When the array size is relative large, the time cost less for GPU Scans. + +Comparing the naive parallel scan and work-efficient parallel scan, we can see from the figure that the work-efficient parallel scan performs better when the array size is large (greater than $2^{20}$). It takes naive parallel scan $O(n\log n)$ floating point adds operations while takes work-efficient scan $O(n)$ adds. Although both algorithm seems to run in parallel in the ideal case, in reality the number of threads run in parallel is bounded by hardware, and therefore, work-efficient scan performs better since the number of threads it need to lauch is smaller. + +Figure 3. Nsight system trace +![Nsight system trace](/img/overall_2.png) + +From Figure 3, we can see that compared with self-implemented Scan, the thrust's scan has very low DRAM bandwidth usage, and the unallocated warps in active SMs are also very low. Following are factors that might affect the performance: + +1. Thrust's Scan might use shared memory, and memory coalescing when accessing global memory. This improves the memory throughput. +2. Thrust's Scan might optimize the block size and launch parameters based on workload and GPU type while our self-implemented scans hard code the block size only roughly optimized by eye. + +### Bottleneck +Also, from Figure 3, we can know that the bottleneck of performance of Scan functions should be the memory bandwidth. The bandwidth of DRAM of both naive and work-efficient Scan are almost full. + +### The output of the test program when the array size is $2^{25}$. +``` +**************** +** SCAN TESTS ** +**************** + [ 26 0 20 47 36 3 37 33 24 20 25 43 43 ... 6 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 79.4345ms (std::chrono Measured) + [ 0 26 26 46 93 129 132 169 202 226 246 271 314 ... 821748880 821748886 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 78.2092ms (std::chrono Measured) + [ 0 26 26 46 93 129 132 169 202 226 246 271 314 ... 821748831 821748870 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 34.0556ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 33.1287ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 14.6806ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 14.6292ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 2.33472ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 2.2352ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 0 0 1 1 0 0 2 2 1 3 1 0 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 82.3193ms (std::chrono Measured) + [ 1 1 1 2 2 1 3 1 1 1 2 1 2 ... 2 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 82.2665ms (std::chrono Measured) + [ 1 1 1 2 2 1 3 1 1 1 2 1 2 ... 3 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 329.982ms (std::chrono Measured) + [ 1 1 1 2 2 1 3 1 1 1 2 1 2 ... 2 3 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 21.669ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 21.466ms (CUDA Measured) + passed +``` \ No newline at end of file diff --git a/img/overall.png b/img/overall.png new file mode 100644 index 0000000..20374a4 Binary files /dev/null and b/img/overall.png differ diff --git a/img/overall_2.png b/img/overall_2.png new file mode 100644 index 0000000..1e61456 Binary files /dev/null and b/img/overall_2.png differ diff --git a/img/thrust.png b/img/thrust.png new file mode 100644 index 0000000..60f46af Binary files /dev/null and b/img/thrust.png differ diff --git a/img/time_vs_arraySize.png b/img/time_vs_arraySize.png new file mode 100644 index 0000000..85ff02d Binary files /dev/null and b/img/time_vs_arraySize.png differ diff --git a/img/time_vs_blockSize.png b/img/time_vs_blockSize.png new file mode 100644 index 0000000..a743efd Binary files /dev/null and b/img/time_vs_blockSize.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..d4a3150 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 25; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..16ab41c 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -1,4 +1,5 @@ #include "common.h" +#include "device_launch_parameters.h" void checkCUDAErrorFn(const char *msg, const char *file, int line) { cudaError_t err = cudaGetLastError(); @@ -24,6 +25,9 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= n) return; + bools[idx] = idata[idx] == 0 ? 0 : 1; } /** @@ -33,6 +37,11 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= n) return; + if (bools[idx] == 1) { + odata[indices[idx]] = idata[idx]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed..a83f2d1 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -12,6 +12,7 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +//#define BLOCK_SIZE 128 /** * Check for CUDA errors; print and exit if there was a problem. diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..b0dd979 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,5 +1,6 @@ #include #include "cpu.h" +#include #include "common.h" @@ -19,7 +20,11 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + // TODO: traverse all elements and record the exclusive prefix sum to odata. + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = odata[i-1] + idata[i-1]; + } timer().endCpuTimer(); } @@ -30,9 +35,16 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + // TODO: traverse all elements and copy to odata without 0's. + int count = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[count] = idata[i]; + count++; + } + } timer().endCpuTimer(); - return -1; + return count; } /** @@ -41,10 +53,26 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int* scanResult = new int[n]; + int* bools = new int[n]; timer().startCpuTimer(); // TODO + //std::cout << "start traverse and build temp arr" << std::endl; + int sum = 0; + for (int i = 0; i < n; i++) { + bools[i] = idata[i] == 0 ? 0 : 1; + scanResult[i] = sum; + sum += bools[i]; + } + for (int i = 0; i < n; i++) { + if (bools[i] == 1) { + odata[scanResult[i]] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + int count = scanResult[n - 1] + bools[n - 1]; + delete[] scanResult; + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..4a69475 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -2,6 +2,10 @@ #include #include "common.h" #include "efficient.h" +#include "device_launch_parameters.h" +#include + +#define BLOCK_SIZE 16 namespace StreamCompaction { namespace Efficient { @@ -12,13 +16,65 @@ namespace StreamCompaction { return timer; } + __global__ void efficientScanUpSweep(int n, int nThread, int d, int *data) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nThread) return; + int currIdx = ((idx + 1) << (d + 1)) - 1; + int prevIdx = currIdx - (1 << d); + data[currIdx] += data[prevIdx]; + } + + __global__ void efficientScanDownSweep(int n, int nThread, int d, int *data) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nThread) return; + int currIdx = (n >> d) * (idx + 1) - 1; + int prevIdx = currIdx - (n >> (d + 1)); + int temp = data[currIdx]; + data[currIdx] += data[prevIdx]; + data[prevIdx] = temp; + } + + __global__ void replaceWithZero(int n, int nThread, int* data) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx >= nThread) return; + data[n - 1] = 0; + } + + void efficientScanUpDownSweep(int n, int newN, int* dev_idata) { + dim3 numBlocks; + int nThread = newN; + // up sweep + for (int d = 0; d < ilog2ceil(n); d++) { + nThread = newN >> (d + 1); + numBlocks = (nThread + BLOCK_SIZE - 1) / BLOCK_SIZE; + efficientScanUpSweep <<>> (newN, nThread, d, dev_idata); + } + // replace the last number of the array with 0. + replaceWithZero <<<1, 1 >>> (newN, 1, dev_idata); + // down sweep + for (int d = 0; d < ilog2ceil(n); d++) { + nThread = 1 << d; + numBlocks = (nThread + BLOCK_SIZE - 1) / BLOCK_SIZE; + efficientScanDownSweep <<>> (newN, nThread, d, dev_idata); + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dev_idata; + int newN = 1 << ilog2ceil(n); + cudaMalloc((void**)&dev_idata, sizeof(int) * newN); + checkCUDAError("Efficient scan: cudaMalloc failed (dev_idata)"); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); timer().startGpuTimer(); - // TODO + //// TODO + efficientScanUpDownSweep(n, newN, dev_idata); + timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(dev_idata); } /** @@ -31,10 +87,38 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int newN = 1 << ilog2ceil(n); + int* dev_idata; + cudaMalloc((void**)&dev_idata, sizeof(int) * newN); + checkCUDAError("Efficient scan: cudaMalloc failed (dev_idata)"); + int* dev_bools; + cudaMalloc((void**)&dev_bools, sizeof(int) * newN); + checkCUDAError("Efficient scan: cudaMalloc failed (dev_bools)"); + int* dev_indices; + cudaMalloc((void**)&dev_indices, sizeof(int) * newN); + checkCUDAError("Efficient scan: cudaMalloc failed (dev_indices)"); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + dim3 numBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; timer().startGpuTimer(); // TODO + Common::kernMapToBoolean <<>> (n, dev_bools, dev_idata); + cudaMemcpy(dev_indices, dev_bools, sizeof(int) * n, cudaMemcpyDeviceToDevice); + + efficientScanUpDownSweep(n, newN, dev_indices); + + Common::kernScatter <<>> (n, dev_bools, dev_idata, dev_bools, dev_indices); + timer().endGpuTimer(); - return -1; + int lastIdx; + cudaMemcpy(&lastIdx, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + int lastBool; + cudaMemcpy(&lastBool, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(odata, dev_bools, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_bools); + cudaFree(dev_indices); + return lastIdx + lastBool; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..2f48e2f 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -2,6 +2,9 @@ #include #include "common.h" #include "naive.h" +#include "device_launch_parameters.h" + +#define BLOCK_SIZE 256 namespace StreamCompaction { namespace Naive { @@ -12,14 +15,55 @@ namespace StreamCompaction { return timer; } // TODO: __global__ + __global__ void naiveParallelScanAtLevelD(int n, int sumIdx, const int *idata, int* odata) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx > n) return; + if (idx >= sumIdx) { + odata[idx] = idata[idx - sumIdx] + idata[idx]; + } + else { + odata[idx] = idata[idx]; + } + } + + __global__ void include2exclude(int n, const int* idata, int* odata) { + int idx = (blockIdx.x * blockDim.x) + threadIdx.x; + if (idx > n) return; + if (idx == 0) { + odata[idx] = 0; + } + else { + odata[idx] = idata[idx - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + // declear + int *dev_idata, *dev_odata; + // allocate memory on GPU + cudaMalloc((void**)&dev_idata, sizeof(int) * n); + checkCUDAError("Naive scan: cudaMalloc failed (dev_idata)"); + cudaMalloc((void**)&dev_odata, sizeof(int) * n); + checkCUDAError("Naive scan: cudaMalloc failed (dev_odata)"); + // copy idata to GPU + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + dim3 numBlocks((n + BLOCK_SIZE - 1) / BLOCK_SIZE); timer().startGpuTimer(); - // TODO + // TODO: call scan + for (int d = 1; d <= ilog2ceil(n); d++) { + naiveParallelScanAtLevelD <<>> (n, 1 << (d - 1), dev_idata, dev_odata); + std::swap(dev_idata, dev_odata); + } + include2exclude <<>> (n, dev_idata, dev_odata); timer().endGpuTimer(); + // copy back odata to CPU + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + // free memory + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..63c3e12 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -17,12 +17,16 @@ namespace StreamCompaction { /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int *odata, const int *idata) { + thrust::device_vector dev_idata(idata, idata + n); + thrust::device_vector dev_odata(n); timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin()); timer().endGpuTimer(); + thrust::copy(dev_odata.begin(), dev_odata.end(), odata); } } }