CIS565-Fall-2023 · LichengCAO · Sep 16, 2023 · Sep 16, 2023 · Sep 17, 2023 · Sep 17, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,111 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Licheng CAO
+  * [LinkedIn](https://www.linkedin.com/in/licheng-cao-6a523524b/)
+* Tested on: Windows 10, i7-10870H @ 2.20GHz 32GB, GTX 3060 6009MB
 
-### (TODO: Your README)
+Implemented Features
+======================
+  * naive GPU scan
+  * efficient GPU scan (with reduced number of threads)
+  * GPU stream compaction
+  * naive GPU radix sort
+
+Analysis
+======================
+### Blocksize selection
+* ![blocksize_select](https://github.com/LichengCAO/Project2-Stream-Compaction/assets/81556019/7402b71e-c8dd-4949-9be1-cc63e8a7cec9)
+* Figure 1
+* Figure 1 shows the running time of my GPU program under different blocksizes. Consequently, I have chosen a block size of 128 for my naive method and 64 for my efficient method based on the results.
+
+### Scan
+* ![scan_largenum](https://github.com/LichengCAO/Project2-Stream-Compaction/assets/81556019/1dfd0dfe-80d0-440d-87a1-aab498bf6f9e)
+ * Figure 2 average runtime with large array size
+* ![scan_smallnum](https://github.com/LichengCAO/Project2-Stream-Compaction/assets/81556019/eca72f02-124d-4dbf-961e-b0e8864e4550)
+ * Figure 3 average runtime with small array size
+ * Figure 2 and 3 display the runtime performance of different methods for the scan operation. Upon analyzing these figures, it becomes evident that for array sizes below 24,576, the CPU method outperforms the other approaches in terms of speed. However, as the array size increases to approximately 100,000, the GPU methods exhibit superior performance. I think the bottlenecks in both GPU methods are related to memory input/output (I/O), as the computational tasks within these methods are not particularly complex. The bottleneck of CPU method may stem from its inability to execute operations in parallel, as its runtime is roughly proportional to the array size.
+ * ![cuda_compute](https://github.com/LichengCAO/Project2-Stream-Compaction/assets/81556019/88dd91d1-cec9-45dc-873f-093b51e57935)
+ * With Nsight Compute, we can see that thrust_scan uses 3 kernel functions to scan the array. I suspect that this method may closely resemble the scan method mentioned at the end of the slide, which involves dividing arrays into several blocks for scanning and subsequently adding offsets within each block to obtain the final result.
+
+### Sort
+* ![sort_largenum](https://github.com/LichengCAO/Project2-Stream-Compaction/assets/81556019/4e27cfed-501d-42f0-8029-8729402aff04)
+ * Figure 4 average sort time with large array size
+* ![sort_smallnum](https://github.com/LichengCAO/Project2-Stream-Compaction/assets/81556019/a9a22753-c58a-4664-9200-68ba61eb2641)
+ * Figure 5 average sort time with small array size
+* Figure 4, 5 show the runtime of sort with different method. With small arrays, my implementation of radix sort runs slower than the other two methods. With large large arrays, the 2 GPU methods run much faster than the CPU method.
+* The primary computational cost in my implementation arises from the scan procedure used to rearrange numbers based on their bits. Initially, I employed two separate scans to determine the correct indices for numbers with '0' and '1' bits at a specific position. Surprisingly, this approach made my radix implementation even slower than the CPU method.
+  * After reviewing others' implementations, I came to realize that I can calculate the index for numbers with '1' based on the scan result for numbers with '0' (i.e., index1 = total_number_of_0 + (cur_id_of_num - number_of_0_before_cur_id)). This modification boosted the performance of my implementation significantly, resulting in it running approximately 40% faster than the CPU method.
+  * Furthermore, it became evident that scanning all 32 bits in each iteration was unnecessary for sorting numbers. By checking if the array is already sorted at the beginning of each loop, I could avoid unnecessary scans. As a result, the runtime for the sorting process reduced to just 1/8 of its original duration.
+
+
+Tests
+======================
+```result
+****************
+** SCAN TESTS **
+****************
+==== cpu scan, power-of-two ====
+   elapsed time: 96.4101ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 92.3129ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 50.0737ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 50.0838ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 18.0347ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 18.0009ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 1.9712ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 1.95203ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 138.717ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 137.777ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 233.015ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 18.0009ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 18.0009ms    (CUDA Measured)
+    passed
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+****************
+** SORT TESTS **
+****************
+==== cpu sort, power-of-two ====
+   elapsed time: 1192.6ms    (std::chrono Measured)
+==== work-efficient sort, power-of-two ====
+   elapsed time: 161.361ms    (CUDA Measured)
+    passed
+==== thrust sort, power-of-two ====
+   elapsed time: 10.9355ms    (CUDA Measured)
+    passed
+==== cpu sort, non-power-of-two ====
+   elapsed time: 1200.21ms    (std::chrono Measured)
+==== work-efficient sort, non-power-of-two ====
+   elapsed time: 159.406ms    (CUDA Measured)
+    passed
+==== thrust sort, non-power-of-two ====
+   elapsed time: 10.8496ms    (CUDA Measured)
+    passed
+```
 
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,23 +13,101 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = (1 << 20); // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
 int main(int argc, char* argv[]) {
-    // Scan tests
 
+#if 0
+	float avgCPU = 0.f;
+	float avgEff = 0.f;
+	float avgNav = 0.f;
+	float avgThrust = 0.f;
+    for (int i = 0;i < 100;++i) {
+        genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+        a[SIZE - 1] = 0;
+
+        zeroArray(SIZE, c);
+        StreamCompaction::CPU::scan(SIZE, c, a);
+        avgCPU += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+
+        zeroArray(SIZE, c);
+        StreamCompaction::Naive::scan(SIZE, c, a);
+        avgNav += StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+
+        zeroArray(SIZE, c);
+        StreamCompaction::Efficient::scan(SIZE, c, a);
+        avgEff += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+
+        zeroArray(SIZE, c);
+        StreamCompaction::Thrust::scan(SIZE, c, a);
+        avgThrust += StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation();
+
+        //zeroArray(SIZE, c);
+        //StreamCompaction::CPU::sort(SIZE, c, a);
+        //avgCPU += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+
+
+        //zeroArray(SIZE, c);
+        //StreamCompaction::Efficient::sort(SIZE, c, a);
+        //avgEff += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+
+        //zeroArray(SIZE, c);
+        //StreamCompaction::Thrust::sort(SIZE, c, a);
+        //avgThrust += StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation();
+
+    }
+
+    std::cout << "avg CPU: " << avgCPU / 100. << std::endl;
+	std::cout << "avg naive: " << avgNav / 100. << std::endl;
+	std::cout << "avg efficient: " << avgEff / 100. << std::endl;
+	std::cout << "avg thrust: " << avgThrust / 100. << std::endl;
+	std::cout << "......." << std::endl;
+#endif
+
+#if 0
+    float avgCPU = 0.f;
+    float avgEff = 0.f;
+    float avgThrust = 0.f;
+    int testCnt = 5;
+    for (int i = 0;i < testCnt;++i) {
+        genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+        a[SIZE - 1] = 0;
+
+        zeroArray(SIZE, c);
+        StreamCompaction::CPU::sort(SIZE, c, a);
+        avgCPU += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+
+        zeroArray(SIZE, c);
+        StreamCompaction::Efficient::sort(SIZE, c, a);
+        avgEff += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+
+        zeroArray(SIZE, c);
+        StreamCompaction::Thrust::sort(SIZE, c, a);
+        avgThrust += StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation();
+
+    }
+    float denom = testCnt;
+    std::cout << "avg CPU: " << avgCPU / denom << std::endl;
+    std::cout << "avg efficient: " << avgEff / denom << std::endl;
+    std::cout << "avg thrust: " << avgThrust / denom << std::endl;
+    std::cout << "......." << std::endl;
+#endif
+
+
+#if 1
+    // Scan tests
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
     genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
+    //printArray(SIZE, a, true);
 
     // initialize b using StreamCompaction::CPU::scan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
@@ -38,13 +116,13 @@ int main(int argc, char* argv[]) {
     printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
+   // printArray(SIZE, b, true);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    //printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -104,7 +182,7 @@ int main(int argc, char* argv[]) {
 
     genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
+    //printArray(SIZE, a, true);
 
     int count, expectedCount, expectedNPOT;
 
@@ -115,22 +193,22 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedCount = count;
-    printArray(count, b, true);
+    //printArray(count, b, true);
     printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact without scan, non-power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedNPOT = count;
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
@@ -147,6 +225,61 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("****************\n");
+    printf("** SORT TESTS **\n");
+    printf("****************\n");
+
+    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    //printArray(SIZE, a, true);
+
+    // initialize b using StreamCompaction::CPU::scan you implement
+    // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
+    // At first all cases passed because b && c are all zeroes.
+    zeroArray(SIZE, b);
+    printDesc("cpu sort, power-of-two");
+    StreamCompaction::CPU::sort(SIZE, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    //printArray(SIZE, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient sort, power-of-two");
+    StreamCompaction::Efficient::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("thrust sort, power-of-two");
+    StreamCompaction::Thrust::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+
+    zeroArray(SIZE, b);
+    printDesc("cpu sort, non-power-of-two");
+    StreamCompaction::CPU::sort(NPOT, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    //printArray(NPOT, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("work-efficient sort, non-power-of-two");
+    StreamCompaction::Efficient::sort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("thrust sort, non-power-of-two");
+    StreamCompaction::Thrust::sort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+#endif // 1
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,9 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int id = blockDim.x * blockIdx.x + threadIdx.x;
+            if (id >= n)return;
+            bools[id] = idata[id] == 0 ? 0 : 1;
         }
 
         /**
@@ -33,6 +36,12 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int id = blockDim.x * blockIdx.x + threadIdx.x;
+            if (id >= n)return;
+            if (bools[id] == 1) {
+                int idx = indices[id];
+                odata[idx] = idata[id];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -9,6 +9,7 @@
 #include <algorithm>
 #include <chrono>
 #include <stdexcept>
+#include <device_launch_parameters.h>
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)