diff --git a/README.md b/README.md index 0e38ddb..ee30f79 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,97 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yian Chen + * [LinkedIn](https://www.linkedin.com/in/yian-chen-33a31a1a8/), [personal website](https://sydianandrewchen.github.io/) etc. +* Tested on: Windows 10, AMD Ryzen 5800 HS with Radeon Graphics CPU @ 3.20GHz 16GB, NVIDIA GeForce RTX3060 Laptop 8GB -### (TODO: Your README) +### Questions -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and + Thrust) to the serial CPU version of Scan. Plot a graph of the comparison + (with array size on the independent axis). + ![](img/scan-power-of-two.png) + ![](img/scan-non-power-of-two.png) + * To guess at what might be happening inside the Thrust implementation (e.g. + allocation, memory copy), take a look at the Nsight timeline for its + execution. Your analysis here doesn't have to be detailed, since you aren't + even looking at the code for the implementation. + - Timeline of `thrust::exclusive_scan`: + ![](img/image.png) + From the graph we can see that there are three scans launched at the same time and ended at the same time. + I guess probably that `thrust::exclusive_scan` implemented an algorithm that uses Up Sweep, Down Sweep and Shifting at the same time. + +* Write a brief explanation of the phenomena you see here. + - As the size of data grow exponentially, the time of CPU scanning algorithm and GPU naive scan will increase most rapidly. + - The CPU scanning algorithm can beaet GPU naive scan algorithm when data size is large. I guess that might be caused by the full exploitation of locality in the CPU scanning algorithm. On the contrary, GPU naive scan will usually cause a lot of global memory acceesses. + - The time of GPU efficient scan and `thrust` scan increase much slower. `thrust` performs the best. +* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation? + The performance bottlenecks should exist within the design of each implementation. + - For CPU method, the bottlenecks appear because of the lack of parallellism, comparing with GPU methods. + - For Naive Scan, the bottlenecks appear because of: + - Double buffering. On the positive side, this trick can save us a large amount of time copying memory. However, two large double-bufferr will cause a stable global memory access within each thread. + - Time complexity. + - For Efficient Scan: + - Redundant thread launching(settled). + - Shared memory. + + +* Paste the output of the test program into a triple-backtick block in your + README. + - Output of the scan (`SIZE = 1 << 26`) +``` +**************** +** SCAN TESTS ** +**************** + [ 6 12 22 4 21 5 23 32 32 36 46 37 34 ... 47 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 48.9518ms (std::chrono Measured) + [ 0 6 18 40 44 65 70 93 125 157 193 239 276 ... 1643792958 1643793005 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 45.263ms (std::chrono Measured) + [ 0 6 18 40 44 65 70 93 125 157 193 239 276 ... 1643792920 1643792939 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 63.2586ms (CUDA Measured) + [ 0 6 18 40 44 65 70 93 125 157 193 239 276 ... 1643792958 1643793005 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 62.9893ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 24.2463ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 22.6304ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 2.46502ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 2.53338ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 0 1 1 3 0 1 0 1 0 0 3 1 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 120.001ms (std::chrono Measured) + [ 1 1 1 3 1 1 3 1 3 3 2 3 3 ... 3 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 123.921ms (std::chrono Measured) + [ 1 1 1 3 1 1 3 1 3 3 2 3 3 ... 3 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 219.142ms (std::chrono Measured) + [ 1 1 1 3 1 1 3 1 3 3 2 3 3 ... 3 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 22.6714ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 22.6632ms (CUDA Measured) + passed +``` \ No newline at end of file diff --git a/img/image.png b/img/image.png new file mode 100644 index 0000000..14b4dcd Binary files /dev/null and b/img/image.png differ diff --git a/img/scan-non-power-of-two.png b/img/scan-non-power-of-two.png new file mode 100644 index 0000000..9c4f042 Binary files /dev/null and b/img/scan-non-power-of-two.png differ diff --git a/img/scan-power-of-two.png b/img/scan-power-of-two.png new file mode 100644 index 0000000..730551a Binary files /dev/null and b/img/scan-power-of-two.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..3203388 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,13 +13,13 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 26; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; -int main(int argc, char* argv[]) { +void scanTest() { // Scan tests printf("\n"); @@ -51,7 +51,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan @@ -94,6 +94,9 @@ int main(int argc, char* argv[]) { printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); +} + +void compactionTest() { printf("\n"); printf("*****************************\n"); @@ -146,8 +149,16 @@ int main(int argc, char* argv[]) { printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); +} +void test() { + scanTest(); + compactionTest(); system("pause"); // stop Win32 console from closing on exit +} + +int main(int argc, char* argv[]) { + test(); delete[] a; delete[] b; delete[] c; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..1bf6708 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,10 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index < n) { + bools[index] = bool(idata[index]); + } } /** @@ -33,6 +37,12 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index < n) { + if (bools[index]) { + odata[indices[index]] = idata[index]; + } + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed..85b7665 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -12,7 +12,7 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) - +#define blockSize 512 /** * Check for CUDA errors; print and exit if there was a problem. */ diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..f6f6cb4 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -12,6 +12,17 @@ namespace StreamCompaction { return timer; } + void scanCore(int n, int* odata, const int* idata) { + //for (int i = 0; i < n; ++i) { + // for (int j = 0; j < i /* exclusive prefix sum */; ++j) { + // odata[i] += idata[j]; + // } + //} + for (int i = 1; i < n; ++i) { + odata[i] = odata[i - 1] + idata[i-1]; + } + } + /** * CPU scan (prefix sum). * For performance analysis, this is supposed to be a simple for loop. @@ -20,6 +31,7 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + scanCore(n, odata, idata); timer().endCpuTimer(); } @@ -31,8 +43,12 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int oPtr = 0; + for (int i = 0; i < n; ++i) { + if (idata[i]) odata[oPtr++] = idata[i]; + } timer().endCpuTimer(); - return -1; + return oPtr; } /** @@ -41,10 +57,21 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int* odata_tmp = new int[n]; timer().startCpuTimer(); - // TODO + + for (int i = 0; i < n; ++i) { + odata_tmp[i] = !(!idata[i]); + if (i) odata_tmp[i] += odata_tmp[i - 1]; + if (idata[i]) { + odata[odata_tmp[i] - 1] = idata[i]; + } + } + int oSize = odata_tmp[n-1]; timer().endCpuTimer(); - return -1; + delete [] odata_tmp; + + return oSize; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..20dc735 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -15,10 +15,75 @@ namespace StreamCompaction { /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ + + void scanCore(int n, int* dev_odata) { + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + // Reduce + int offset = 1; + for (int d = 0; d < ilog2(n); d++) { + int operation_number = n / (offset * 2); + dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize); + //printf("%d\n", blocksPerGrid.x); + //printf("Cut off unneccessary threads\n"); + if (blocksPerGrid.x == 1) { + kernUpSweep << <1, operation_number >> > (n, offset, dev_odata); + } + else + kernUpSweep << > > (n, offset, dev_odata); + //kernUpSweep << > > (n, offset, dev_odata); // 0.31504 for power of two + checkCUDAError("kernUpSweep failed"); + offset <<= 1; + } + + // Down sweep + for (int d = ilog2(n) - 1; d >= 0; d--) { + offset = (1 << d); + int operation_number = n / (offset * 2); + dim3 blocksPerGrid((operation_number + (blockSize - 1)) / blockSize); + //printf("%d\n", blocksPerGrid.x); + //printf("Cut off unneccessary threads\n"); + if (blocksPerGrid.x == 1) { + kernDownSweep << <1, operation_number >> > (n, offset, dev_odata); + } + else + kernDownSweep << > > (n, offset, dev_odata); + //kernDownSweep << > > (n, offset, dev_odata); + checkCUDAError("kernDownSweep failed"); + } + + } void scan(int n, int *odata, const int *idata) { + int padded_n = (1 << ilog2ceil(n)); + int* dev_odata; + cudaMalloc(&dev_odata, padded_n * sizeof(int)); + cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice); timer().startGpuTimer(); - // TODO + scanCore(padded_n, dev_odata); timer().endGpuTimer(); + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost); + cudaFree(dev_odata); + } + + __global__ void kernUpSweep(int n,int offset, int* odata1) { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + //printf("%d, %d, %d, %d\n",n, index, offset, index*offset*2); + int arrIndex = index * (offset * 2); + if (arrIndex < n) { + odata1[arrIndex + offset * 2 - 1] += odata1[arrIndex + offset - 1]; + odata1[n-1] = 0; + } + } + + __global__ void kernDownSweep(int n, int offset, int* odata1) { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + //printf("%d, %d, %d, %d\n",n, index, offset, index*offset*2); + int arrIndex = index * (offset * 2); + if (arrIndex < n) { + int t = odata1[arrIndex + offset - 1]; + odata1[arrIndex + offset - 1] = odata1[arrIndex + offset * 2 - 1]; + odata1[arrIndex + offset * 2 - 1] += t; + } } /** @@ -31,10 +96,59 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO + /* + bools, indices should only be allocated on device + odata and idata needs to be copied to device + */ + int padded_n = (1 << ilog2ceil(n)); + + int* dev_bools; + /* TODO: Check if remaining part is also zero OR DOESN'T MATTER? */ + cudaMalloc(&dev_bools, padded_n * sizeof(int)); + checkCUDAError("cudaMalloc dev_bools failed"); + + int* dev_idata; + cudaMalloc(&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy from idata to dev_idata failed"); + + int* dev_indices; + cudaMalloc(&dev_indices, padded_n * sizeof(int)); + checkCUDAError("cudaMalloc dev_indices failed"); + + int* dev_odata; + cudaMalloc(&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed"); + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + StreamCompaction::Common::kernMapToBoolean<<>>(n, dev_bools, dev_idata); + checkCUDAError("kernMapToBoolean failed"); + + cudaMemcpy(dev_indices, dev_bools, padded_n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToDevice); + + timer().startGpuTimer(); + scanCore(padded_n, dev_indices); timer().endGpuTimer(); - return -1; + StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_bools, dev_indices); + checkCUDAError("kernScatter failed"); + + /* Still got problem here! */ + int length, last_element; + cudaMemcpy(&length, dev_indices + n - 1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost); + cudaMemcpy(&last_element, dev_bools + n-1, sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy from dev_indices[n-1] to length failed"); + length += last_element; + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_indices); + cudaFree(dev_bools); + + return length; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..b3fc851 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -1,13 +1,15 @@ -#pragma once - -#include "common.h" - -namespace StreamCompaction { - namespace Efficient { - StreamCompaction::Common::PerformanceTimer& timer(); - - void scan(int n, int *odata, const int *idata); - - int compact(int n, int *odata, const int *idata); - } -} +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace Efficient { + StreamCompaction::Common::PerformanceTimer& timer(); + void scanCore(int n, int* odata, const int* idata); + void scan(int n, int *odata, const int *idata); + + int compact(int n, int *odata, const int *idata); + __global__ void kernUpSweep(int n,int offset, int* odata1); + __global__ void kernDownSweep(int n, int offset, int* odata1); + } +} \ No newline at end of file diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..88779d1 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -17,9 +17,53 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO + int* dev_odata1; + cudaMalloc(&dev_odata1, n*sizeof(int)); + checkCUDAError("cudaMalloc dev_odata1 failed"); + cudaMemcpy(dev_odata1, idata, n*sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy from idata to dev_odata1 failed"); + int *dev_odata2; + cudaMalloc(&dev_odata2, n*sizeof(int)); + checkCUDAError("cudaMalloc dev_odata2 failed"); + cudaMemcpy(dev_odata2, idata, n*sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy from idata to dev_odata2 failed"); + + timer().startGpuTimer(); + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize + 1); + int d = ilog2ceil(n); + int offset = 1; + for (int k = 1; k <= d; ++k) { + //offset <<= 1; + kernScan << > > (n, (1 << (k-1)), dev_odata1, dev_odata2); + std::swap(dev_odata1, dev_odata2); + } + checkCUDAError("kernScan failed"); + timer().endGpuTimer(); + cudaMemcpy(odata, dev_odata1, n*sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost); + for (int i = n - 1; i >= 0; --i) { + if (i == 0) odata[i] = 0; + else odata[i] = odata[i - 1]; + } + + checkCUDAError("cudaMemcpy from dev_odata1 to odata failed"); + + cudaFree(dev_odata1); + cudaFree(dev_odata2); + + } + + __global__ void kernScan(int n, int offset , int *odata1, int * odata2 ) { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) return; + if (index >= offset) { /* no need to check if (index < n)*/ + //odata2[index + offset] = odata1[index] + odata1[index + offset]; + odata2[index] = odata1[index] + odata1[index - offset]; + } + else { + odata2[index] = odata1[index]; + } } } } diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h index 37dcb06..de8c6d7 100644 --- a/stream_compaction/naive.h +++ b/stream_compaction/naive.h @@ -7,5 +7,7 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + __global__ void kernScan(int n, int offset, int* odata1, int* odata2); + } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..79c94a3 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,30 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dev_idata; + int* dev_odata; + + cudaMalloc(&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc(&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyKind::cudaMemcpyHostToDevice); + + thrust::device_ptr dev_in_ptr(dev_idata); + thrust::device_ptr dev_out_ptr(dev_odata); + thrust::device_vector dv_in(dev_in_ptr, dev_in_ptr +n); + thrust::device_vector dv_out(dev_out_ptr, dev_out_ptr +n); + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); timer().endGpuTimer(); + cudaMemcpy(odata, dv_out.data().get(), n * sizeof(int), cudaMemcpyKind::cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/tools/data_processing.py b/tools/data_processing.py new file mode 100644 index 0000000..e3b25bd --- /dev/null +++ b/tools/data_processing.py @@ -0,0 +1,90 @@ +import matplotlib.pyplot as plt + +N = [] +for i in range(18, 28, 2): + xlabel = "$2^{" + xlabel += str(i) + xlabel += '}$' + N.append(xlabel) +# N = [r"$2^{}$".format(i) for i in range(18, 28, 2)] + +# Drawing increasing boids, with visuals +cpu = [0.1506, 0.5339, 2.3771, 13.9042, 56.4294] +gpu_naive = [0.106496, 0.754688, 3.23482, 14.9873, 63.2361] +gpu_efficient = [0.157696, 0.443392, 1.40288, 5.29101, 20.8722] +gpu_thrust = [0.160768, 0.18832, 0.319488, 0.975808, 2.20467] +plt.xlabel("N count") +plt.ylabel("Time (ms)") +plt.plot(N, cpu, label="CPU", color="r", marker="o") +plt.plot(N, gpu_naive, label="Naive GPU", color="g", marker="o") +plt.plot(N, gpu_efficient, label="Efficient GPU", color="b", marker="o") +plt.plot(N, gpu_thrust, label="Thrust", color="y", marker="o") +plt.legend(labels=["CPU", "Naive GPU", "Efficient GPU", "Thrust"]) +plt.title("Time of scan using different method(power-of-two)") +plt.show() +plt.savefig("../img/scan-power-of-two.png") +plt.clf() + +cpu = [0.1685, 0.6555, 3.2271, 17.4222, 44.574] +gpu_naive = [0.1024, 0.765952, 3.24301, 15.574, 74.4376] +gpu_efficient = [0.166912, 0.477184, 1.41312, 6.2505, 20.8456] +gpu_thrust = [0.166944, 0.191584, 0.401504, 0.777952, 2.18931] +plt.xlabel("N count") +plt.ylabel("Time (ms)") +plt.plot(N, cpu, label="CPU", color="r", marker="o") +plt.plot(N, gpu_naive, label="Naive GPU", color="g", marker="o") +plt.plot(N, gpu_efficient, label="Efficient GPU", color="b", marker="o") +plt.plot(N, gpu_thrust, label="Thrust", color="y", marker="o") +plt.legend(labels=["CPU", "Naive GPU", "Efficient GPU", "Thrust"]) +plt.title("Time of scan using different method(non-power-of-two)") +plt.show() +plt.savefig("../img/scan-non-power-of-two.png") +plt.clf() + +exit() + +# Drawing increasing boids, no visuals +fps_brutalForce = [8297.48, 1227.13, 42.882, 0.46393, 0.0] +fps_scatteredGrid = [7666.14, 7279.71, 4348.93, 380.379, 18.6696] +fps_coherentGrid = [7383.36, 7034.57, 4743.67, 683.372, 57.848] + +plt.xlabel("N count") +plt.ylabel("FPS") +plt.plot(N, fps_brutalForce, label="brutalForce", color="r", marker="o") +plt.plot(N, fps_scatteredGrid, label="scatteredGrid", color="g", marker="o") +plt.plot(N, fps_coherentGrid, label="coherentGrid", color="b", marker="o") +plt.legend(labels=["brutalForce", "scatteredGrid", "coherentGrid"]) +plt.title("FPS of kernel under different method (No visulization)") +# plt.show() +plt.savefig("../images/boids_no_visual.png") + +plt.clf() + +# Drawing 27-cell vs 8-cell +plt.xlabel("N count") +plt.ylabel("FPS") +fps_scatteredGrid27Neighbors = [7285.03, 6472.57, 3955.12, 353.385, 18.667] +fps_scatteredGrid8Neighbors = [7666.14, 7279.71, 4348.93, 380.379, 18.6696] +plt.plot(N, fps_scatteredGrid27Neighbors, label="27-cell", marker="o") +plt.plot(N, fps_scatteredGrid8Neighbors, label="8-cell", marker="o") +plt.legend(labels=["27-cell", "8-cell"]) +plt.title("FPS of 27-cell vs 8-cell (No visualization)") +# plt.show() +plt.savefig("../images/27vs8_no_visual.png") + +plt.clf() + + +# Drawing increasing blockSize + +blockSize = [r"$2^{}$".format(i) for i in range(2, 11)] +fps_coherentGridBlockSizeChanged = [ 11.7512, 18.7417,28.524,41.9919, 57.0016, 62.4268, 64.3508, 61.0085, 59.5032] +plt.xlabel("Block size") +plt.ylabel("FPS") +plt.plot(blockSize, fps_coherentGridBlockSizeChanged, marker="o") +plt.title("FPS of coherent grid over increasing grid size (With visualization)") +# plt.show() +plt.savefig("../images/increasing_gridSize.png") + +if __name__ == "__main__": + pass \ No newline at end of file