CIS565-Fall-2015 · ziyezhou-Jerry · Sep 13, 2015 · Sep 13, 2015 · Sep 13, 2015 · Sep 13, 2015
diff --git a/README.md b/README.md
@@ -3,211 +3,55 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Ziye Zhou
+* Tested on: Windows 8.1, i7-4910 @ 2.90GHz 32GB, GTX 880M 8192MB (Alienware)
 
-### (TODO: Your README)
-
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
-
-Instructions (delete me)
-========================
-
-This is due Sunday, September 13 at midnight.
-
-**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
-from scratch. This algorithm is widely used, and will be important for
-accelerating your path tracer project.
-
-Your stream compaction implementations in this project will simply remove `0`s
-from an array of `int`s. In the path tracer, you will remove terminated paths
-from an array of rays.
-
-In addition to being useful for your path tracer, this project is meant to
-reorient your algorithmic thinking to the way of the GPU. On GPUs, many
-algorithms can benefit from massive parallelism and, in particular, data
-parallelism: executing the same code many times simultaneously with different
-data.
-
-You'll implement a few different versions of the *Scan* (*Prefix Sum*)
-algorithm. First, you'll implement a CPU version of the algorithm to reinforce
-your understanding. Then, you'll write a few GPU implementations: "naive" and
-"work-efficient." Finally, you'll use some of these to implement GPU stream
-compaction.
-
-**Algorithm overview & details:** There are two primary references for details
-on the implementation of scan and stream compaction.
-
-* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
-  for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
-* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).
-
-Your GPU stream compaction implementation will live inside of the
-`stream_compaction` subproject. This way, you will be able to easily copy it
-over for use in your GPU path tracer.
-
-
-## Part 0: The Usual
-
-This project (and all other CUDA projects in this course) requires an NVIDIA
-graphics card with CUDA capability. Any card with Compute Capability 2.0
-(`sm_20`) or greater will work. Check your GPU on this
-[compatibility table](https://developer.nvidia.com/cuda-gpus).
-If you do not have a personal machine with these specs, you may use those
-computers in the Moore 100B/C which have supported GPUs.
-
-**HOWEVER**: If you need to use the lab computer for your development, you will
-not presently be able to do GPU performance profiling. This will be very
-important for debugging performance bottlenecks in your program.
-
-### Useful existing code
-
-* `stream_compaction/common.h`
-  * `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
-  * `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
-* `main.cpp`
-  * Some testing code for your implementations.
-
-
-## Part 1: CPU Scan & Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/cpu.cu`, implement:
-
-* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
-* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
-  the `scan` function.
-* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
-  function. Map the input array to an array of 0s and 1s, scan it, and use
-  scatter to produce the output. You will need a **CPU** scatter implementation
-  for this (see slides or GPU Gems chapter for an explanation).
-
-These implementations should only be a few lines long.
-
-
-## Part 2: Naive GPU Scan Algorithm
-
-In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`
-
-This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
-taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
-shared memory, but is limited to operating on very small arrays! Instead, write
-this using global memory only. As a result of this, you will have to do
-`ilog2ceil(n)` separate kernel invocations.
-
-Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
-code in the online version of Chapter 39 are known to have a few small errors
-(in superscripting, missing braces, bad indentation, etc.)
-
-Since the parallel scan algorithm operates on a binary tree structure, it works
-best with arrays with power-of-two length. Make sure your implementation works
-on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
-- your intermediate array sizes will need to be rounded to the next power of
-two.
-
-
-## Part 3: Work-Efficient GPU Scan & Stream Compaction
-
-### 3.1. Scan
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::scan`
-
-All of the text in Part 2 applies.
-
-* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
-* Beware of errors in Example 39-2.
-* Test non-power-of-two sized arrays.
-
-### 3.2. Stream Compaction
-
-This stream compaction method will remove `0`s from an array of `int`s.
-
-In `stream_compaction/efficient.cu`, implement
-`StreamCompaction::Efficient::compact`
-
-For compaction, you will also need to implement the scatter algorithm presented
-in the slides and the GPU Gems chapter.
-
-In `stream_compaction/common.cu`, implement these for use in `compact`:
-
-* `StreamCompaction::Common::kernMapToBoolean`
-* `StreamCompaction::Common::kernScatter`
-
-
-## Part 4: Using Thrust's Implementation
-
-In `stream_compaction/thrust.cu`, implement:
-
-* `StreamCompaction::Thrust::scan`
-
-This should be a very short function which wraps a call to the Thrust library
-function `thrust::exclusive_scan(first, last, result)`.
-
-To measure timing, be sure to exclude memory operations by passing
-`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
-GPU).  You can create a `thrust::device_vector` by creating a
-`thrust::host_vector` from the given pointer, then casting it.
-
-
-## Part 5: Radix Sort (Extra Credit) (+10)
-
-Add an additional module to the `stream_compaction` subproject. Implement radix
-sort using one of your scan implementations. Add tests to check its correctness.
-
-
-## Write-up
-
-1. Update all of the TODOs at the top of this README.
-2. Add a description of this project including a list of its features.
-3. Add your performance analysis (see below).
-
-All extra credit features must be documented in your README, explaining its
-value (with performance comparison, if applicable!) and showing an example how
-it works. For radix sort, show how it is called and an example of its output.
-
-Always profile with Release mode builds and run without debugging.
+### Description
+1. Implemented CPU Scan & Stream Compaction
+2. Implemented Naive GPU Scan & Stream Compaction
+3. Implemented Work-Efficient GPU Scan & Stream Compaction
+4. Tested Thrust's Implementation
+5. Implemented Radix Sort (Extra Credit)
+6. Compared the performance of CPU & GPU on Scan Algorithm
 
 ### Questions
 
-* Roughly optimize the block sizes of each of your implementations for minimal
-  run time on your GPU.
-  * (You shouldn't compare unoptimized implementations to each other!)
-
 * Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
   Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
   (with array size on the independent axis).
-  * You should use CUDA events for timing. Be sure **not** to include any
-    explicit memory operations in your performance measurements, for
-    comparability.
-  * To guess at what might be happening inside the Thrust implementation, take
-    a look at the Nsight timeline for its execution.
+ ![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_compare.png?raw=true)
+![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_thrust_running.png?raw=true)
 
 * Write a brief explanation of the phenomena you see here.
   * Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
     it different for each implementation?
+From the performance analysis given below, I think the bootleneck is the memory I/O part. As we can see from the timeline, the computation time only takes a little portion of the whole time, but the memory I/O takes a lot. This is nearly the same for all implementations.
+
+![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_bottleneck_analysis.png?raw=true)
 
 * Paste the output of the test program into a triple-backtick block in your
   README.
-  * If you add your own tests (e.g. for radix sort or to test additional corner
-    cases), be sure to mention it explicitly.
-
-These questions should help guide you in performance analysis on future
-assignments, as well.
 
-## Submit
+  Apart from the given test cases, I have also added my own test case for the radix sort. I used the STL sort method to get the CPU version of sort result and compare it with the GPU radix sort.
+  <<<![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_testing_output.png?raw=true) >>>
+
+### Extra Credit
+I have also Implemented the Radix Sort. Within the method, I am using the thrust::scan method to do the prefix-sum. The comparison with the CPU STL sort can be seen below:
+![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_extra_output.png?raw=true)
+
+The input I used is manually generate by this for loop:
+``` C++
+    int m_array[M_SIZE];
+		int m_out[M_SIZE];
+		for (int i = 0; i < M_SIZE / 2; i++)
+		{
+			m_array[i] = M_SIZE / 2 - i;
+		}
+		for (int i = M_SIZE / 2; i < M_SIZE; i++)
+		{
+			m_array[i] = i - M_SIZE / 2;
+		}
+```
+The output is the sorted array, we can see it from the screenshot above.
 
-If you have modified any of the `CMakeLists.txt` files at all (aside from the
-list of `SOURCE_FILES`), you must test that your project can build in Moore
-100B/C. Beware of any build issues discussed on the Google Group.
 
-1. Open a GitHub pull request so that we can see that you have finished.
-   The title should be "Submission: YOUR NAME".
-2. Send an email to the TA (gmail: kainino1+cis565@) with:
-   * **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
-   * Direct link to your pull request on GitHub
-   * In the form of a grade (0-100+) with comments, evaluate your own
-     performance on the project.
-   * Feedback on the project itself, if any.
diff --git a/bin_new/Activity1.nvact b/bin_new/Activity1.nvact
@@ -0,0 +1,125 @@
+<Model DefinitionId="3a3ca3f4-7ba7-4c09-a182-6ba59bf12599" DisplayName="Analysis Activity" xmlns="clr-namespace:Ark.PropertyModel;assembly=Ark">
+    <Property Name="Application" Value="C:\Users\Shun\Desktop\Project2-Stream-Compaction\bin_new\Release\cis565_stream_compaction_test.exe" />
+    <Property Name="EnableCudaApiContextAttachDetach" Value="True" />
+    <Property Name="EnableCudaApiContextManagement" Value="True" />
+    <Property Name="EnableCudaApiD3D10Interop" Value="True" />
+    <Property Name="EnableCudaApiD3D11Interop" Value="True" />
+    <Property Name="EnableCudaApiD3D9Interop" Value="True" />
+    <Property Name="EnableCudaApiDeviceManagement" Value="True" />
+    <Property Name="EnableCudaApiEventManagement" Value="True" />
+    <Property Name="EnableCudaApiExecutionControl" Value="True" />
+    <Property Name="EnableCudaApiGraphicsInterop" Value="True" />
+    <Property Name="EnableCudaApiMemoryManagement" Value="True" />
+    <Property Name="EnableCudaApiMemoryOperations" Value="True" />
+    <Property Name="EnableCudaApiModuleManagement" Value="True" />
+    <Property Name="EnableCudaApiOpenGlInterop" Value="True" />
+    <Property Name="EnableCudaApiPeerAccess" Value="True" />
+    <Property Name="EnableCudaApiProfiler" Value="True" />
+    <Property Name="EnableCudaApiStreamManagement" Value="True" />
+    <Property Name="EnableCudaApiSurfaceObjectManagement" Value="True" />
+    <Property Name="EnableCudaApiSurfaceReferenceManagement" Value="True" />
+    <Property Name="EnableCudaApiTextureObjectManagement" Value="True" />
+    <Property Name="EnableCudaApiTextureReferenceManagement" Value="True" />
+    <Property Name="EnableCudaApiTrace" Value="True" />
+    <Property Name="EnableCudaApiUnifiedAddressing" Value="True" />
+    <Property Name="EnableCudaCallbackTrace" Value="True" />
+    <Property Name="EnableCudaGpuTaskTrace" Value="True" />
+    <Property Name="EnableCudaRuntimeApiD3D10Interop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiD3D11Interop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiD3D9Interop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiDeviceManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiErrorHandling" Value="True" />
+    <Property Name="EnableCudaRuntimeApiEventManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiExecutionControl" Value="True" />
+    <Property Name="EnableCudaRuntimeApiGraphicsInterop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiMemoryManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiMemoryOperations" Value="True" />
+    <Property Name="EnableCudaRuntimeApiOpenGlInterop" Value="True" />
+    <Property Name="EnableCudaRuntimeApiPeerAccess" Value="True" />
+    <Property Name="EnableCudaRuntimeApiProfiler" Value="True" />
+    <Property Name="EnableCudaRuntimeApiStreamManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiSurfaceObjectManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiSurfaceReferenceManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiTextureObjectManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiTextureReferenceManagement" Value="True" />
+    <Property Name="EnableCudaRuntimeApiTrace" Value="True" />
+    <Property Name="EnableCudaRuntimeApiUnifiedAddressing" Value="True" />
+    <Property Name="EnableCudaRuntimeApiVersionManagement" Value="True" />
+    <Property Name="EnableCudaSoftwareCounters" Value="True" />
+    <Property Name="EnableCudaTrace" Value="True" />
+    <Property Name="EnableDirectXApiBlit" Value="True" />
+    <Property Name="EnableDirectXApiClear" Value="True" />
+    <Property Name="EnableDirectXApiCommandList" Value="True" />
+    <Property Name="EnableDirectXApiDispatch" Value="True" />
+    <Property Name="EnableDirectXApiLock" Value="True" />
+    <Property Name="EnableDirectXApiPresent" Value="True" />
+    <Property Name="EnableDirectXApiRender" Value="True" />
+    <Property Name="EnableDirectXApiTrace" Value="True" />
+    <Property Name="EnableDirectXPerformanceMarkerTrace" Value="True" />
+    <Property Name="EnableDirectXPerformanceRangeTrace" Value="True" />
+    <Property Name="EnableDirectXShaderCompileTimes" Value="True" />
+    <Property Name="EnableDirectXWorkloadCopies" Value="True" />
+    <Property Name="EnableDirectXWorkloadCpuFrames" Value="True" />
+    <Property Name="EnableDirectXWorkloadDispatches" Value="True" />
+    <Property Name="EnableDirectXWorkloadDrawCalls" Value="True" />
+    <Property Name="EnableDirectXWorkloadGpuFrames" Value="True" />
+    <Property Name="EnableDirectXWorkloadPushBuffers" Value="True" />
+    <Property Name="EnableOpenClApiCommandQueue" Value="True" />
+    <Property Name="EnableOpenClApiContext" Value="True" />
+    <Property Name="EnableOpenClApiD3D10Interop" Value="True" />
+    <Property Name="EnableOpenClApiD3D11Interop" Value="True" />
+    <Property Name="EnableOpenClApiD3D9Interop" Value="True" />
+    <Property Name="EnableOpenClApiDevice" Value="True" />
+    <Property Name="EnableOpenClApiEnqueueCommands" Value="True" />
+    <Property Name="EnableOpenClApiEventObject" Value="True" />
+    <Property Name="EnableOpenClApiFlushAndFinish" Value="True" />
+    <Property Name="EnableOpenClApiKernelObject" Value="True" />
+    <Property Name="EnableOpenClApiMemoryObject" Value="True" />
+    <Property Name="EnableOpenClApiOpenGlInterop" Value="True" />
+    <Property Name="EnableOpenClApiPlatform" Value="True" />
+    <Property Name="EnableOpenClApiProgramObject" Value="True" />
+    <Property Name="EnableOpenClApiSampler" Value="True" />
+    <Property Name="EnableOpenClApiTrace" Value="True" />
+    <Property Name="EnableOpenClCommandTrace" Value="True" />
+    <Property Name="EnableOpenClResourceProgramBuildCallbackBinaryCodeTrace" Value="True" />
+    <Property Name="EnableOpenClResourceProgramBuildCallbackTrace" Value="True" />
+    <Property Name="EnableOpenClResourceProgramSourceCodeTrace" Value="True" />
+    <Property Name="EnableOpenClResourceRefCountTrace" Value="True" />
+    <Property Name="EnableOpenClResourceTrace" Value="True" />
+    <Property Name="EnableOpenGlApiBegin" Value="True" />
+    <Property Name="EnableOpenGlApiClear" Value="True" />
+    <Property Name="EnableOpenGlApiDisplayList" Value="True" />
+    <Property Name="EnableOpenGlApiErrors" Value="True" />
+    <Property Name="EnableOpenGlApiEval" Value="True" />
+    <Property Name="EnableOpenGlApiFence" Value="True" />
+    <Property Name="EnableOpenGlApiFlush" Value="True" />
+    <Property Name="EnableOpenGlApiFrameBufferObject" Value="True" />
+    <Property Name="EnableOpenGlApiGet" Value="True" />
+    <Property Name="EnableOpenGlApiLight" Value="True" />
+    <Property Name="EnableOpenGlApiPath" Value="True" />
+    <Property Name="EnableOpenGlApiPixel" Value="True" />
+    <Property Name="EnableOpenGlApiProgram" Value="True" />
+    <Property Name="EnableOpenGlApiRaster" Value="True" />
+    <Property Name="EnableOpenGlApiState" Value="True" />
+    <Property Name="EnableOpenGlApiTexture" Value="True" />
+    <Property Name="EnableOpenGlApiTrace" Value="True" />
+    <Property Name="EnableOpenGlApiTransform" Value="True" />
+    <Property Name="EnableOpenGlApiVertex" Value="True" />
+    <Property Name="EnableOpenGlApiVertexArray" Value="True" />
+    <Property Name="EnableOpenGlApiWgl" Value="True" />
+    <Property Name="EnableOpenGlTrace" Value="True" />
+    <Property Name="EnableOpenGlWorkloadCopies" Value="True" />
+    <Property Name="EnableOpenGlWorkloadCpuFrames" Value="True" />
+    <Property Name="EnableOpenGlWorkloadDispatches" Value="True" />
+    <Property Name="EnableOpenGlWorkloadDrawCalls" Value="True" />
+    <Property Name="EnableOpenGlWorkloadGpuFrames" Value="True" />
+    <Property Name="EnableOpenGlWorkloadPushBuffers" Value="True" />
+    <Property Name="EnableSystemCpuThreadTrace" Value="True" />
+    <Property Name="EnableSystemModuleTrace" Value="True" />
+    <Property Name="EnableToolsExtMarkerTrace" Value="True" />
+    <Property Name="EnableToolsExtPushPopRangeTrace" Value="True" />
+    <Property Name="EnableToolsExtResourceNaming" Value="True" />
+    <Property Name="EnableToolsExtStartEndRangeTrace" Value="True" />
+    <Property Name="ExtensionsToSynchronize" Value="*.*;" />
+    <Property Name="WorkingDirectory" Value="C:\Users\Shun\Desktop\Project2-Stream-Compaction\bin_new\" />
+</Model>
diff --git a/proj2_bottleneck_analysis.png b/proj2_bottleneck_analysis.png
diff --git a/proj2_compare.png b/proj2_compare.png
diff --git a/proj2_data.xlsx b/proj2_data.xlsx
diff --git a/proj2_extra_output.png b/proj2_extra_output.png
diff --git a/proj2_runningTime.png b/proj2_runningTime.png
diff --git a/proj2_testing_output.png b/proj2_testing_output.png
diff --git a/proj2_thrust_running.png b/proj2_thrust_running.png