diff --git a/README.md b/README.md index b71c458..509d799 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,145 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Bowen Bao +* Tested on: Windows 10, i7-6700K @ 4.00GHz 32GB, GTX 1080 8192MB (Personal Computer) -### (TODO: Your README) +## Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +Here's the list of features of this project: +1. CPU Scan and Stream Compaction +2. Naive GPU Scan +3. Efficient GPU Scan and Stream Compaction +4. Thrust Scan +5. Optimize efficient GPU Scan +6. Radix Sort based on GPU Scan +7. Benchmark suite + +## Instruction to Run + +I made a few changes to the function headers to add more flexible benchmarking capabilities, such as able to return process times, change block size without re-compile, etc. The only change that is visible to the user is that they need to pass in a double parameter as reference to be able to receive the logged process time. + +I added a benchmark suite for testing the run time of each implementation under different parameter settings. Also, I inserted a few tests for radix sort into the original main function. + +## Performance Analysis +### Performance of different implementation + +![](/image/process_time.png) + +Here's the test result for each of the methods. The tests are run with the block size of 256(which is decided as near optimal after testing on numerous values). For each methods, I ran 100 independent tests, and calculated their average process time. + +We can observe indeed that the GPU version of scan has a better performance than CPU scan. + +### Performance of GPU methods under different block size + +![](/image/process_time_blocksize.png) + +The tests are run with the stream length of 2^24, each method is tested 100 times and recorded the average. Observe that the performance starts to decrease after blocksize getting over 256. + +## Extra Credits +### Improving GPU Scan +See part 3 in Questions. + +### Radix Sort +I followed the algorithm in the slides, and implemented a radix sort method based on the GPU Scan function. One interesting note is that when checking bits of the numbers, numbers with 1 on the first bit are actually smaller than those with 0, as on these occasions they turned out to be negative, which is the reverse case against situations on other bits. I tested my radix sort function with a special hand crafted case containing negative numbers, and with a random large test case. + +## Questions +* Roughly optimize the block sizes of each of your implementations for minimal + run time on your GPU. + * (You shouldn't compare unoptimized implementations to each other!) + +See Performance Analysis. + +* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and + Thrust) to the serial CPU version of Scan. Plot a graph of the comparison + (with array size on the independent axis). + * You should use CUDA events for timing GPU code. Be sure **not** to include + any *initial/final* memory operations (`cudaMalloc`, `cudaMemcpy`) in your + performance measurements, for comparability. Note that CUDA events cannot + time CPU code. + * You can use the C++11 `std::chrono` API for timing CPU code. See this + [Stack Overflow answer](http://stackoverflow.com/a/23000049) for an example. + Note that `std::chrono` may not provide high-precision timing. If it does + not, you can either use it to time many iterations, or use another method. + * To guess at what might be happening inside the Thrust implementation (e.g. + allocation, memory copy), take a look at the Nsight timeline for its + execution. Your analysis here doesn't have to be detailed, since you aren't + even looking at the code for the implementation. + +See Performance Analysis. + +* Write a brief explanation of the phenomena you see here. + * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is + it different for each implementation? + +One problem with "naive" efficient GPU scan is that there are too many threads wasted(after being checked that their index mod interval is not zero). One way of improving this is to assign the index as the divided result of the original index by the interval, and compute back the actual index later in that thread. With this improvement, we can save a lot of useless works done by threads, and note that waste grows exponentially with the number of elements in stream in the original implementation. + +* Paste the output of the test program into a triple-backtick block in your + README. + * If you add your own tests (e.g. for radix sort or to test additional corner + cases), be sure to mention it explicitly. + +See Output. + +## Output + + **************** + ** SCAN TESTS ** + **************** + [ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 35 0 ] + ==== cpu scan, power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ] + ==== cpu scan, non-power-of-two ==== + [ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ] + passed + ==== naive scan, power-of-two ==== + passed + ==== naive scan, non-power-of-two ==== + passed + ==== work-efficient scan, power-of-two ==== + passed + ==== work-efficient scan, non-power-of-two ==== + passed + ==== thrust scan, power-of-two ==== + passed + ==== thrust scan, non-power-of-two ==== + passed + + ***************************** + ** STREAM COMPACTION TESTS ** + ***************************** + [ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 1 0 ] + ==== cpu compact without scan, power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ] + passed + ==== cpu compact without scan, non-power-of-two ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 1 ] + passed + ==== cpu compact with scan ==== + [ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ] + passed + ==== work-efficient compact, power-of-two ==== + passed + ==== work-efficient compact, non-power-of-two ==== + passed + ==== work-efficient compact, power-of-two, last non-zero ==== + passed + ==== work-efficient compact, power-of-two, last zero ==== + passed + ==== work-efficient compact, test on special case 1 ==== + passed + ==== work-efficient compact, test on special case 2 ==== + passed + ==== cpu compact without scan, test on special case 1 ==== + passed + ==== radix sort, test on special case ==== + [ 0 5 -2 6 3 7 -5 2 7 1 ] + sorted: + [ -5 -2 0 1 2 3 5 6 7 7 ] + passed + ==== radix sort, test ==== + [ 38 7719 1238 2437 8855 1797 8365 2285 450 612 5853 8100 1142 ... 5085 6505 ] + sorted: + [ 0 0 0 0 0 0 0 1 1 1 1 1 1 ... 9999 9999 ] + passed \ No newline at end of file diff --git a/image/process_time.png b/image/process_time.png new file mode 100644 index 0000000..07efe2b Binary files /dev/null and b/image/process_time.png differ diff --git a/image/process_time_blocksize.png b/image/process_time_blocksize.png new file mode 100644 index 0000000..da3975d Binary files /dev/null and b/image/process_time_blocksize.png differ diff --git a/src/main.cpp b/src/main.cpp index 675da35..7b111ec 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -12,11 +12,132 @@ #include #include #include "testing_helpers.hpp" +#include +#include + + +void bench_mark() +{ + int run_times = 100; + const int block_choice = 5; + const int input_choice = 4; + const int method_choice = 7; + + int blockSizes[] = { 32, 128, 256, 512, 1024 }; + int inputSizes[] = { 8, 16, 20, 24 }; + std::string methods[] = { + "cpu scan", + "naive scan", + "eff scan", + "thrust scan", + "cpu compact no scan", + "cpu compact", + "gpu compact" + }; + + double result[method_choice * block_choice * input_choice]; + + for (int i = 0; i < run_times; ++i) + { + printf("=========Running %d round ... \n", i+1); + for (int j = 0; j < block_choice; ++j) + { + printf("==================Running on blockSize %d ... \n", blockSizes[j]); + for (int k = 0; k < input_choice; ++k) + { + printf("=============================Running on inputSize 2^%d ... \n", inputSizes[k]); + int idx; + // generate input + int SIZE = 1 << inputSizes[k]; + int *a = new int[SIZE]; + int *b = new int[SIZE]; + + genArray(SIZE - 1, a, 1000); + int cur_method = 0; + + // cpu scan + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("==test on pos %d \n", idx); + zeroArray(SIZE, b); + result[idx] += StreamCompaction::CPU::scan(SIZE, b, a); + + // naive scan + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("==test on pos %d \n", idx); + zeroArray(SIZE, b); + result[idx] + += StreamCompaction::Naive::scan(SIZE, b, a, blockSizes[j]); + + // work-efficient scan + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("==test on pos %d \n", idx); + zeroArray(SIZE, b); + result[idx] + += StreamCompaction::Efficient::scan(SIZE, b, a, blockSizes[j]); + + // thrust scan + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("==test on pos %d \n", idx); + zeroArray(SIZE, b); + result[idx] + += StreamCompaction::Thrust::scan(SIZE, b, a); + + // cpu compact no scan + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("==test on pos %d \n", idx); + zeroArray(SIZE, b); + double time; + StreamCompaction::CPU::compactWithoutScan(SIZE, b, a, time); + result[idx] += time; + + // cpu compact + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("==test on pos %d \n", idx); + zeroArray(SIZE, b); + StreamCompaction::CPU::compactWithScan(SIZE, b, a, time); + result[idx] += time; + + // gpu compact + idx = block_choice * input_choice * (cur_method++) + j * input_choice + k; + printf("test on pos %d \n", idx); + zeroArray(SIZE, b); + StreamCompaction::Efficient::compact(SIZE, b, a, time); + result[idx] += time; + + delete[] a; + delete[] b; + } + } + } + + // print result + printf("===================== RESULTS ========================\n"); + for (int j = 0; j < block_choice; ++j) + { + printf("======= block size %d ===========\n", blockSizes[j]); + + for (int i = 0; i < method_choice; ++i) + { + printf("==== method %s ==== ", methods[i].c_str()); + for (int k = 0; k < input_choice; ++k) + { + printf(" %d input %f time ", inputSizes[k], result[block_choice * input_choice * i + j * input_choice + k] / run_times); + } + printf("\n"); + } + + printf("=====================================\n"); + } +} + int main(int argc, char* argv[]) { - const int SIZE = 1 << 8; + const int SIZE = 1 << 16; const int NPOT = SIZE - 3; - int a[SIZE], b[SIZE], c[SIZE]; + //int a[SIZE], b[SIZE], c[SIZE]; + int *a = new int[SIZE]; + int *b = new int[SIZE]; + int *c = new int[SIZE]; // Scan tests @@ -89,35 +210,97 @@ int main(int argc, char* argv[]) { int count, expectedCount, expectedNPOT; + double time; + zeroArray(SIZE, b); printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); + count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a, time); expectedCount = count; printArray(count, b, true); printCmpLenResult(count, expectedCount, b, b); zeroArray(SIZE, c); printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); + count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a, time); expectedNPOT = count; printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); zeroArray(SIZE, c); printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); + count = StreamCompaction::CPU::compactWithScan(SIZE, c, a, time); printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); + count = StreamCompaction::Efficient::compact(SIZE, c, a, time); //printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); + count = StreamCompaction::Efficient::compact(NPOT, c, a, time); //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + + zeroArray(SIZE, c); + a[SIZE - 1] = 5; + printDesc("work-efficient compact, power-of-two, last non-zero"); + count = StreamCompaction::Efficient::compact(SIZE, c, a, time); + int *bb = new int[SIZE]; + int cpuCount = StreamCompaction::CPU::compactWithoutScan(SIZE, bb, a, time); + //printArray(count, c, true); + printCmpLenResult(count, cpuCount, bb, c); + + zeroArray(SIZE, c); + a[SIZE - 1] = 0; + printDesc("work-efficient compact, power-of-two, last zero"); + count = StreamCompaction::Efficient::compact(SIZE, c, a, time); + //printArray(count, c, true); + printCmpLenResult(count, expectedCount, b, c); + + printDesc("work-efficient compact, test on special case 1"); + int test[5] = { 1, 0, 1, 0, 1 }; + count = StreamCompaction::Efficient::compact(5, c, test, time); + printCmpLenResult(count, 3, c, c); + + printDesc("work-efficient compact, test on special case 2"); + int test1[5] = { 1, 0, 1, 0, 0 }; + count = StreamCompaction::Efficient::compact(5, c, test1, time); + printCmpLenResult(count, 2, c, c); + + printDesc("cpu compact without scan, test on special case 1"); + count = StreamCompaction::CPU::compactWithoutScan(5, c, test, time); + printCmpLenResult(count, 3, c, c); + + + + //bench_mark(); + + int testArr[] = { 0, 5, -2, 6, 3, 7, -5, 2, 7, 1 }; + int resultArr[10]; + int goalArr[] = { -5, -2, 0, 1, 2, 3, 5, 6, 7, 7 }; + + StreamCompaction::Efficient::radix_sort(10, resultArr, testArr); + printDesc("radix sort, test on special case"); + printArray(10, testArr, true); + printf(" sorted:\n"); + printArray(10, resultArr, true); + printCmpResult(10, goalArr, resultArr); + + genArray(SIZE, a, 10000); + StreamCompaction::Efficient::radix_sort(SIZE, b, a); + printDesc("radix sort, test"); + printArray(SIZE, a, true); + printf(" sorted:\n"); + printArray(SIZE, b, true); + std::sort(a, a + SIZE); + printCmpResult(SIZE, a, b); + + + delete[] a; + delete[] b; + delete[] c; + delete[] bb; } diff --git a/stat.xlsx b/stat.xlsx new file mode 100644 index 0000000..b838afa Binary files /dev/null and b/stat.xlsx differ diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index fe872d4..23174dc 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,10 @@ namespace Common { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) return; + if (idata[index] != 0) bools[index] = 1; + else bools[index] = 0; } /** @@ -33,6 +37,12 @@ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= n) return; + if (bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index e600c29..e285031 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,5 +1,6 @@ #include #include "cpu.h" +#include "chrono" namespace StreamCompaction { namespace CPU { @@ -7,9 +8,22 @@ namespace CPU { /** * CPU scan (prefix sum). */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); +double scan(int n, int *odata, const int *idata) { + // record time + auto start = std::chrono::system_clock::now(); + + // TODO : finished + if (n <= 0) return -1; + odata[0] = 0; + for (int i = 1; i < n; ++i) + { + odata[i] = odata[i - 1] + idata[i-1]; + } + + std::chrono::duration diff = (std::chrono::system_clock::now() - start); + //printf("CPU scan took %fms\n", diff.count()); + + return diff.count(); } /** @@ -17,9 +31,24 @@ void scan(int n, int *odata, const int *idata) { * * @returns the number of elements remaining after compaction. */ -int compactWithoutScan(int n, int *odata, const int *idata) { - // TODO - return -1; +int compactWithoutScan(int n, int *odata, const int *idata, double &time) { + // TODO : finished + // record time + auto start = std::chrono::system_clock::now(); + + int num_remain = 0; + for (int i = 0; i < n; ++i) + { + if (idata[i] != 0) + { + odata[num_remain++] = idata[i]; + } + } + + std::chrono::duration diff = (std::chrono::system_clock::now() - start); + //printf("CPU compact without scan took %fms\n", diff.count()); + time = diff.count(); + return num_remain; } /** @@ -27,9 +56,39 @@ int compactWithoutScan(int n, int *odata, const int *idata) { * * @returns the number of elements remaining after compaction. */ -int compactWithScan(int n, int *odata, const int *idata) { - // TODO - return -1; +int compactWithScan(int n, int *odata, const int *idata, double &time) { + // TODO : finished + // record time + auto start = std::chrono::system_clock::now(); + + // map data to 1 and 0 for non-zero and zero. + int *tmp_data = new int[n]; + for (int i = 0; i < n; ++i) + { + if (idata[i] == 0) tmp_data[i] = 0; + else tmp_data[i] = 1; + } + + // scan + scan(n, odata, tmp_data); + + // scatter + int num_remain = 0; + for (int i = 0; i < n; ++i) + { + if (tmp_data[i] == 1) + { + odata[odata[i]] = idata[i]; + num_remain++; + } + } + + delete[] tmp_data; + + std::chrono::duration diff = (std::chrono::system_clock::now() - start); + //printf("CPU compact with scan took %fms\n", diff.count()); + time = diff.count(); + return num_remain; } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 6348bf3..1c0880b 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -2,10 +2,10 @@ namespace StreamCompaction { namespace CPU { - void scan(int n, int *odata, const int *idata); + double scan(int n, int *odata, const int *idata); - int compactWithoutScan(int n, int *odata, const int *idata); + int compactWithoutScan(int n, int *odata, const int *idata, double &time); - int compactWithScan(int n, int *odata, const int *idata); + int compactWithScan(int n, int *odata, const int *idata, double &time); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index b2f739b..d665f0c 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -2,33 +2,266 @@ #include #include "common.h" #include "efficient.h" +#include namespace StreamCompaction { -namespace Efficient { + namespace Efficient { -// TODO: __global__ + // TODO: __global__ + __global__ void kernScanUpSweep(int N, int interval, int *data) + { + // up sweep + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + int real_index = index * interval * 2; + if (real_index >= N) return; + int cur_index = real_index + 2 * interval - 1; + int last_index = real_index + interval - 1; + if (cur_index >= N) return; -/** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); -} + data[cur_index] = data[last_index] + data[cur_index]; + } -/** - * Performs stream compaction on idata, storing the result into odata. - * All zeroes are discarded. - * - * @param n The number of elements in idata. - * @param odata The array into which to store elements. - * @param idata The array of elements to compact. - * @returns The number of elements remaining after compaction. - */ -int compact(int n, int *odata, const int *idata) { - // TODO - return -1; -} + __global__ void kernScanDownSweep(int N, int interval, int *data) + { + // down seep + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + int real_index = index * interval * 2; + if (real_index >= N) return; + int last_index = real_index + interval - 1; + int cur_index = real_index + 2 * interval - 1; + if (cur_index >= N) return; + int tmp = data[last_index]; + data[last_index] = data[cur_index]; + data[cur_index] += tmp; + } -} + __global__ void kernMapDigitToBoolean(int N, int digit, int *odata, const int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + int mask = 1 << digit; + if ((idata[index] & mask) == 0) + { + if (digit != 31) odata[index] = 0; + else odata[index] = 1; + } + else + { + if (digit != 31) odata[index] = 1; + else odata[index] = 0; + } + } + + __global__ void kernFlipBoolean(int N, int *odata, const int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + if (idata[index] == 0) + { + odata[index] = 1; + } + else + { + odata[index] = 0; + } + } + + __global__ void kernSortOneRound(int N, int *bools, int *indices_zero, int *indices_one, int maxFalse, + int *odata, const int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + if (bools[index] == 0) + { + // false; + odata[indices_zero[index]] = idata[index]; + } + else + { + odata[indices_one[index] + maxFalse] = idata[index]; + } + } + + void radix_sort(int n, int *odata, const int *idata, int blockSize) + { + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + int *cuda_idata, *cuda_bools_one, *cuda_bools_zero, + *cuda_indices_one, *cuda_indices_zero, *cuda_odata; + int *bools = new int[n]; + int *indices = new int[n]; + + cudaMalloc((void**)&cuda_idata, n * sizeof(int)); + cudaMalloc((void**)&cuda_bools_one, n * sizeof(int)); + cudaMalloc((void**)&cuda_bools_zero, n * sizeof(int)); + cudaMalloc((void**)&cuda_odata, n * sizeof(int)); + cudaMalloc((void**)&cuda_indices_one, n * sizeof(int)); + cudaMalloc((void**)&cuda_indices_zero, n * sizeof(int)); + + cudaMemcpy(cuda_idata, idata, n*sizeof(int), cudaMemcpyHostToDevice); + + for (int i = 0; i < 32; ++i) + { + kernMapDigitToBoolean << > >(n, i, cuda_bools_one, cuda_idata); + cudaMemcpy(bools, cuda_bools_one, n * sizeof(int), cudaMemcpyDeviceToHost); + scan(n, indices, bools); + cudaMemcpy(cuda_indices_one, indices, n * sizeof(int), cudaMemcpyHostToDevice); + + kernFlipBoolean << > >(n, cuda_bools_zero, cuda_bools_one); + cudaMemcpy(bools, cuda_bools_zero, n * sizeof(int), cudaMemcpyDeviceToHost); + scan(n, indices, bools); + cudaMemcpy(cuda_indices_zero, indices, n * sizeof(int), cudaMemcpyHostToDevice); + + int totalFalse; + cudaMemcpy(&totalFalse, &cuda_indices_zero[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + totalFalse += bools[n - 1]; + + kernSortOneRound << > >(n, cuda_bools_one, cuda_indices_zero, cuda_indices_one, totalFalse, cuda_odata, cuda_idata); + + int *tmp = cuda_idata; + cuda_idata = cuda_odata; + cuda_odata = tmp; + } + + cudaMemcpy(odata, cuda_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(cuda_idata); + cudaFree(cuda_bools_one); + cudaFree(cuda_bools_zero); + cudaFree(cuda_odata); + cudaFree(cuda_indices_one); + cudaFree(cuda_indices_zero); + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + float scan(int n, int *odata, const int *idata, int blockSize) { + // record time + float diff(0); + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start, 0); + + int loop_times = ilog2ceil(n); + int totalNum = 1; + for (int i = 0; i < loop_times; ++i) + { + totalNum *= 2; + } + int interval = 1; + //printf("total looptimes: %d, total num %d\n", loop_times, totalNum); + + int *tmp_data; + cudaMalloc((void**)&tmp_data, totalNum * sizeof(int)); + cudaMemset(tmp_data, 0, totalNum); + cudaMemcpy(tmp_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + + // up sweep + for (int i = 0; i < loop_times; ++i) + { + dim3 fullBlocksPerGrid((totalNum / (interval * 2) + blockSize - 1) / blockSize); + kernScanUpSweep << > >(totalNum, interval, tmp_data); + interval *= 2; + } + + // down sweep + cudaMemset(&tmp_data[totalNum - 1], 0, sizeof(int)); + + for (int i = 0; i < loop_times; ++i) + { + dim3 fullBlocksPerGrid((totalNum / interval + blockSize - 1) / blockSize); + interval /= 2; + kernScanDownSweep << > >(totalNum, interval, tmp_data); + } + + cudaMemcpy(odata, tmp_data, n*sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(tmp_data); + + cudaEventRecord(end, 0); + cudaEventSynchronize(start); + cudaEventSynchronize(end); + cudaEventElapsedTime(&diff, start, end); + + //printf("GPU scan took %fms\n", diff); + return diff; + } + + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int compact(int n, int *odata, const int *idata, double &time, int blockSize) { + // TODO + // record time + float diff(0); + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start, 0); + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + int *indices_cuda; + int *bools_cuda; + int *idata_cuda; + int *odata_cuda; + + int *indices = new int[n]; + int *bools = new int[n]; + + cudaMalloc((void**)&indices_cuda, n * sizeof(int)); + cudaMalloc((void**)&bools_cuda, n * sizeof(int)); + cudaMalloc((void**)&idata_cuda, n * sizeof(int)); + cudaMalloc((void**)&odata_cuda, n * sizeof(int)); + + cudaMemcpy(idata_cuda, idata, n * sizeof(int), cudaMemcpyHostToDevice); + + Common::kernMapToBoolean<<>>(n, bools_cuda, idata_cuda); + + cudaMemcpy(bools, bools_cuda, n * sizeof(int), cudaMemcpyDeviceToHost); + + scan(n, indices, bools); + + cudaMemcpy(indices_cuda, indices, n * sizeof(int), cudaMemcpyHostToDevice); + + Common::kernScatter << > >(n, odata_cuda, idata_cuda, bools_cuda, indices_cuda); + + int remain_elem; + cudaMemcpy(&remain_elem, &indices_cuda[n - 1], sizeof(int), cudaMemcpyDeviceToHost); + remain_elem += bools[n - 1]; + //for (int i = 0; i < n; ++i) + //{ + // if (bools[i] == 1) remain_elem++; + //} + + cudaMemcpy(odata, odata_cuda, remain_elem * sizeof(int), cudaMemcpyDeviceToHost); + + delete[] bools; + delete[] indices; + + cudaFree(indices_cuda); + cudaFree(bools_cuda); + cudaFree(idata_cuda); + cudaFree(odata_cuda); + + cudaEventRecord(end, 0); + cudaEventSynchronize(start); + cudaEventSynchronize(end); + cudaEventElapsedTime(&diff, start, end); + + //printf("GPU compact took %fms\n", diff); + + time = diff; + return remain_elem; + } + + } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 395ba10..c3ce7ae 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -2,8 +2,10 @@ namespace StreamCompaction { namespace Efficient { - void scan(int n, int *odata, const int *idata); + float scan(int n, int *odata, const int *idata, int blockSize = 128); - int compact(int n, int *odata, const int *idata); + int compact(int n, int *odata, const int *idata, double &time, int blockSize = 128); + + void radix_sort(int n, int *odata, const int *idata, int blockSize = 128); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 3d86b60..c8966dc 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -4,17 +4,67 @@ #include "naive.h" namespace StreamCompaction { -namespace Naive { + namespace Naive { -// TODO: __global__ + // TODO: __global__ : finished -/** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ -void scan(int n, int *odata, const int *idata) { - // TODO - printf("TODO\n"); -} + __global__ void kernScan(int N, int start_idx, int *odata, const int *idata) + { + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index >= N) return; + if (index >= start_idx) + { + odata[index] = idata[index - start_idx] + idata[index]; + } + else + { + odata[index] = idata[index]; + } + } -} + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + float scan(int n, int *odata, const int *idata, int blockSize) { + // TODO : finished + // record time + float diff(0); + cudaEvent_t start, end; + cudaEventCreate(&start); + cudaEventCreate(&end); + cudaEventRecord(start, 0); + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + int *tmp_data, *tmp_data2; + cudaMalloc((void**)&tmp_data, n * sizeof(int)); + cudaMalloc((void**)&tmp_data2, n * sizeof(int)); + cudaMemset(tmp_data2, 0, n * sizeof(int)); + cudaMemset(tmp_data, 0, n * sizeof(int)); + cudaMemcpy(tmp_data+1, idata, (n-1) * sizeof(int), cudaMemcpyHostToDevice); + int loop_times = ilog2ceil(n); + int start_idx = 1; + for (int i = 0; i < loop_times; ++i) + { + kernScan<<>>(n, start_idx, tmp_data2, tmp_data); + int *tmp_pt = tmp_data; + tmp_data = tmp_data2; + tmp_data2 = tmp_pt; + start_idx *= 2; + } + + cudaMemcpy(odata, tmp_data, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(tmp_data); + cudaFree(tmp_data2); + + cudaEventRecord(end, 0); + cudaEventSynchronize(start); + cudaEventSynchronize(end); + cudaEventElapsedTime(&diff, start, end); + + //printf("GPU naive scan took %fms\n", diff); + return diff; + } + + } } diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 21152d6..1a4d276 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -2,6 +2,6 @@ namespace StreamCompaction { namespace Naive { - void scan(int n, int *odata, const int *idata); + float scan(int n, int *odata, const int *idata, int blockSize=128); } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index d8dbb32..aaff222 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -5,6 +5,7 @@ #include #include "common.h" #include "thrust.h" +#include namespace StreamCompaction { namespace Thrust { @@ -12,10 +13,30 @@ namespace Thrust { /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ -void scan(int n, int *odata, const int *idata) { +double scan(int n, int *odata, const int *idata) { // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::host_vector hv_in(n); + + for (int i = 0; i < n; ++i) + { + hv_in[i] = idata[i]; + } + + thrust::device_vector dv_in(hv_in), dv_out(n); + + // record time + auto start = std::chrono::system_clock::now(); + + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + std::chrono::duration diff = (std::chrono::system_clock::now() - start); + //printf("Thrust scan took %fms\n", diff.count()); + + thrust::copy(dv_out.begin(), dv_out.end(), odata); + + return diff.count(); } } diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index 06707f3..07e50a1 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -2,6 +2,6 @@ namespace StreamCompaction { namespace Thrust { - void scan(int n, int *odata, const int *idata); + double scan(int n, int *odata, const int *idata); } }