Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 35 additions & 191 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,211 +3,55 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Ziye Zhou
* Tested on: Windows 8.1, i7-4910 @ 2.90GHz 32GB, GTX 880M 8192MB (Alienware)

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Instructions (delete me)
========================

This is due Sunday, September 13 at midnight.

**Summary:** In this project, you'll implement GPU stream compaction in CUDA,
from scratch. This algorithm is widely used, and will be important for
accelerating your path tracer project.

Your stream compaction implementations in this project will simply remove `0`s
from an array of `int`s. In the path tracer, you will remove terminated paths
from an array of rays.

In addition to being useful for your path tracer, this project is meant to
reorient your algorithmic thinking to the way of the GPU. On GPUs, many
algorithms can benefit from massive parallelism and, in particular, data
parallelism: executing the same code many times simultaneously with different
data.

You'll implement a few different versions of the *Scan* (*Prefix Sum*)
algorithm. First, you'll implement a CPU version of the algorithm to reinforce
your understanding. Then, you'll write a few GPU implementations: "naive" and
"work-efficient." Finally, you'll use some of these to implement GPU stream
compaction.

**Algorithm overview & details:** There are two primary references for details
on the implementation of scan and stream compaction.

* The [slides on Parallel Algorithms](https://github.com/CIS565-Fall-2015/cis565-fall-2015.github.io/raw/master/lectures/2-Parallel-Algorithms.pptx)
for Scan, Stream Compaction, and Work-Efficient Parallel Scan.
* GPU Gems 3, Chapter 39 - [Parallel Prefix Sum (Scan) with CUDA](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).

Your GPU stream compaction implementation will live inside of the
`stream_compaction` subproject. This way, you will be able to easily copy it
over for use in your GPU path tracer.


## Part 0: The Usual

This project (and all other CUDA projects in this course) requires an NVIDIA
graphics card with CUDA capability. Any card with Compute Capability 2.0
(`sm_20`) or greater will work. Check your GPU on this
[compatibility table](https://developer.nvidia.com/cuda-gpus).
If you do not have a personal machine with these specs, you may use those
computers in the Moore 100B/C which have supported GPUs.

**HOWEVER**: If you need to use the lab computer for your development, you will
not presently be able to do GPU performance profiling. This will be very
important for debugging performance bottlenecks in your program.

### Useful existing code

* `stream_compaction/common.h`
* `checkCUDAError` macro: checks for CUDA errors and exits if there were any.
* `ilog2ceil(x)`: computes the ceiling of log2(x), as an integer.
* `main.cpp`
* Some testing code for your implementations.


## Part 1: CPU Scan & Stream Compaction

This stream compaction method will remove `0`s from an array of `int`s.

In `stream_compaction/cpu.cu`, implement:

* `StreamCompaction::CPU::scan`: compute an exclusive prefix sum.
* `StreamCompaction::CPU::compactWithoutScan`: stream compaction without using
the `scan` function.
* `StreamCompaction::CPU::compactWithScan`: stream compaction using the `scan`
function. Map the input array to an array of 0s and 1s, scan it, and use
scatter to produce the output. You will need a **CPU** scatter implementation
for this (see slides or GPU Gems chapter for an explanation).

These implementations should only be a few lines long.


## Part 2: Naive GPU Scan Algorithm

In `stream_compaction/naive.cu`, implement `StreamCompaction::Naive::scan`

This uses the "Naive" algorithm from GPU Gems 3, Section 39.2.1. We haven't yet
taught shared memory, and you **shouldn't use it yet**. Example 39-1 uses
shared memory, but is limited to operating on very small arrays! Instead, write
this using global memory only. As a result of this, you will have to do
`ilog2ceil(n)` separate kernel invocations.

Beware of errors in Example 39-1 in the book; both the pseudocode and the CUDA
code in the online version of Chapter 39 are known to have a few small errors
(in superscripting, missing braces, bad indentation, etc.)

Since the parallel scan algorithm operates on a binary tree structure, it works
best with arrays with power-of-two length. Make sure your implementation works
on non-power-of-two sized arrays (see `ilog2ceil`). This requires extra memory
- your intermediate array sizes will need to be rounded to the next power of
two.


## Part 3: Work-Efficient GPU Scan & Stream Compaction

### 3.1. Scan

In `stream_compaction/efficient.cu`, implement
`StreamCompaction::Efficient::scan`

All of the text in Part 2 applies.

* This uses the "Work-Efficient" algorithm from GPU Gems 3, Section 39.2.2.
* Beware of errors in Example 39-2.
* Test non-power-of-two sized arrays.

### 3.2. Stream Compaction

This stream compaction method will remove `0`s from an array of `int`s.

In `stream_compaction/efficient.cu`, implement
`StreamCompaction::Efficient::compact`

For compaction, you will also need to implement the scatter algorithm presented
in the slides and the GPU Gems chapter.

In `stream_compaction/common.cu`, implement these for use in `compact`:

* `StreamCompaction::Common::kernMapToBoolean`
* `StreamCompaction::Common::kernScatter`


## Part 4: Using Thrust's Implementation

In `stream_compaction/thrust.cu`, implement:

* `StreamCompaction::Thrust::scan`

This should be a very short function which wraps a call to the Thrust library
function `thrust::exclusive_scan(first, last, result)`.

To measure timing, be sure to exclude memory operations by passing
`exclusive_scan` a `thrust::device_vector` (which is already allocated on the
GPU). You can create a `thrust::device_vector` by creating a
`thrust::host_vector` from the given pointer, then casting it.


## Part 5: Radix Sort (Extra Credit) (+10)

Add an additional module to the `stream_compaction` subproject. Implement radix
sort using one of your scan implementations. Add tests to check its correctness.


## Write-up

1. Update all of the TODOs at the top of this README.
2. Add a description of this project including a list of its features.
3. Add your performance analysis (see below).

All extra credit features must be documented in your README, explaining its
value (with performance comparison, if applicable!) and showing an example how
it works. For radix sort, show how it is called and an example of its output.

Always profile with Release mode builds and run without debugging.
### Description
1. Implemented CPU Scan & Stream Compaction
2. Implemented Naive GPU Scan & Stream Compaction
3. Implemented Work-Efficient GPU Scan & Stream Compaction
4. Tested Thrust's Implementation
5. Implemented Radix Sort (Extra Credit)
6. Compared the performance of CPU & GPU on Scan Algorithm

### Questions

* Roughly optimize the block sizes of each of your implementations for minimal
run time on your GPU.
* (You shouldn't compare unoptimized implementations to each other!)

* Compare all of these GPU Scan implementations (Naive, Work-Efficient, and
Thrust) to the serial CPU version of Scan. Plot a graph of the comparison
(with array size on the independent axis).
* You should use CUDA events for timing. Be sure **not** to include any
explicit memory operations in your performance measurements, for
comparability.
* To guess at what might be happening inside the Thrust implementation, take
a look at the Nsight timeline for its execution.
![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_compare.png?raw=true)
![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_thrust_running.png?raw=true)

* Write a brief explanation of the phenomena you see here.
* Can you find the performance bottlenecks? Is it memory I/O? Computation? Is
it different for each implementation?
From the performance analysis given below, I think the bootleneck is the memory I/O part. As we can see from the timeline, the computation time only takes a little portion of the whole time, but the memory I/O takes a lot. This is nearly the same for all implementations.

![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_bottleneck_analysis.png?raw=true)

* Paste the output of the test program into a triple-backtick block in your
README.
* If you add your own tests (e.g. for radix sort or to test additional corner
cases), be sure to mention it explicitly.

These questions should help guide you in performance analysis on future
assignments, as well.

## Submit
Apart from the given test cases, I have also added my own test case for the radix sort. I used the STL sort method to get the CPU version of sort result and compare it with the GPU radix sort.
<<<![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_testing_output.png?raw=true) >>>

### Extra Credit
I have also Implemented the Radix Sort. Within the method, I am using the thrust::scan method to do the prefix-sum. The comparison with the CPU STL sort can be seen below:
![alt tag](https://github.com/ziyezhou-Jerry/Project2-Stream-Compaction/blob/master/proj2_extra_output.png?raw=true)

The input I used is manually generate by this for loop:
``` C++
int m_array[M_SIZE];
int m_out[M_SIZE];
for (int i = 0; i < M_SIZE / 2; i++)
{
m_array[i] = M_SIZE / 2 - i;
}
for (int i = M_SIZE / 2; i < M_SIZE; i++)
{
m_array[i] = i - M_SIZE / 2;
}
```
The output is the sorted array, we can see it from the screenshot above.

If you have modified any of the `CMakeLists.txt` files at all (aside from the
list of `SOURCE_FILES`), you must test that your project can build in Moore
100B/C. Beware of any build issues discussed on the Google Group.

1. Open a GitHub pull request so that we can see that you have finished.
The title should be "Submission: YOUR NAME".
2. Send an email to the TA (gmail: kainino1+cis565@) with:
* **Subject**: in the form of `[CIS565] Project 2: PENNKEY`
* Direct link to your pull request on GitHub
* In the form of a grade (0-100+) with comments, evaluate your own
performance on the project.
* Feedback on the project itself, if any.
125 changes: 125 additions & 0 deletions bin_new/Activity1.nvact
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
<Model DefinitionId="3a3ca3f4-7ba7-4c09-a182-6ba59bf12599" DisplayName="Analysis Activity" xmlns="clr-namespace:Ark.PropertyModel;assembly=Ark">
<Property Name="Application" Value="C:\Users\Shun\Desktop\Project2-Stream-Compaction\bin_new\Release\cis565_stream_compaction_test.exe" />
<Property Name="EnableCudaApiContextAttachDetach" Value="True" />
<Property Name="EnableCudaApiContextManagement" Value="True" />
<Property Name="EnableCudaApiD3D10Interop" Value="True" />
<Property Name="EnableCudaApiD3D11Interop" Value="True" />
<Property Name="EnableCudaApiD3D9Interop" Value="True" />
<Property Name="EnableCudaApiDeviceManagement" Value="True" />
<Property Name="EnableCudaApiEventManagement" Value="True" />
<Property Name="EnableCudaApiExecutionControl" Value="True" />
<Property Name="EnableCudaApiGraphicsInterop" Value="True" />
<Property Name="EnableCudaApiMemoryManagement" Value="True" />
<Property Name="EnableCudaApiMemoryOperations" Value="True" />
<Property Name="EnableCudaApiModuleManagement" Value="True" />
<Property Name="EnableCudaApiOpenGlInterop" Value="True" />
<Property Name="EnableCudaApiPeerAccess" Value="True" />
<Property Name="EnableCudaApiProfiler" Value="True" />
<Property Name="EnableCudaApiStreamManagement" Value="True" />
<Property Name="EnableCudaApiSurfaceObjectManagement" Value="True" />
<Property Name="EnableCudaApiSurfaceReferenceManagement" Value="True" />
<Property Name="EnableCudaApiTextureObjectManagement" Value="True" />
<Property Name="EnableCudaApiTextureReferenceManagement" Value="True" />
<Property Name="EnableCudaApiTrace" Value="True" />
<Property Name="EnableCudaApiUnifiedAddressing" Value="True" />
<Property Name="EnableCudaCallbackTrace" Value="True" />
<Property Name="EnableCudaGpuTaskTrace" Value="True" />
<Property Name="EnableCudaRuntimeApiD3D10Interop" Value="True" />
<Property Name="EnableCudaRuntimeApiD3D11Interop" Value="True" />
<Property Name="EnableCudaRuntimeApiD3D9Interop" Value="True" />
<Property Name="EnableCudaRuntimeApiDeviceManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiErrorHandling" Value="True" />
<Property Name="EnableCudaRuntimeApiEventManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiExecutionControl" Value="True" />
<Property Name="EnableCudaRuntimeApiGraphicsInterop" Value="True" />
<Property Name="EnableCudaRuntimeApiMemoryManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiMemoryOperations" Value="True" />
<Property Name="EnableCudaRuntimeApiOpenGlInterop" Value="True" />
<Property Name="EnableCudaRuntimeApiPeerAccess" Value="True" />
<Property Name="EnableCudaRuntimeApiProfiler" Value="True" />
<Property Name="EnableCudaRuntimeApiStreamManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiSurfaceObjectManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiSurfaceReferenceManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiTextureObjectManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiTextureReferenceManagement" Value="True" />
<Property Name="EnableCudaRuntimeApiTrace" Value="True" />
<Property Name="EnableCudaRuntimeApiUnifiedAddressing" Value="True" />
<Property Name="EnableCudaRuntimeApiVersionManagement" Value="True" />
<Property Name="EnableCudaSoftwareCounters" Value="True" />
<Property Name="EnableCudaTrace" Value="True" />
<Property Name="EnableDirectXApiBlit" Value="True" />
<Property Name="EnableDirectXApiClear" Value="True" />
<Property Name="EnableDirectXApiCommandList" Value="True" />
<Property Name="EnableDirectXApiDispatch" Value="True" />
<Property Name="EnableDirectXApiLock" Value="True" />
<Property Name="EnableDirectXApiPresent" Value="True" />
<Property Name="EnableDirectXApiRender" Value="True" />
<Property Name="EnableDirectXApiTrace" Value="True" />
<Property Name="EnableDirectXPerformanceMarkerTrace" Value="True" />
<Property Name="EnableDirectXPerformanceRangeTrace" Value="True" />
<Property Name="EnableDirectXShaderCompileTimes" Value="True" />
<Property Name="EnableDirectXWorkloadCopies" Value="True" />
<Property Name="EnableDirectXWorkloadCpuFrames" Value="True" />
<Property Name="EnableDirectXWorkloadDispatches" Value="True" />
<Property Name="EnableDirectXWorkloadDrawCalls" Value="True" />
<Property Name="EnableDirectXWorkloadGpuFrames" Value="True" />
<Property Name="EnableDirectXWorkloadPushBuffers" Value="True" />
<Property Name="EnableOpenClApiCommandQueue" Value="True" />
<Property Name="EnableOpenClApiContext" Value="True" />
<Property Name="EnableOpenClApiD3D10Interop" Value="True" />
<Property Name="EnableOpenClApiD3D11Interop" Value="True" />
<Property Name="EnableOpenClApiD3D9Interop" Value="True" />
<Property Name="EnableOpenClApiDevice" Value="True" />
<Property Name="EnableOpenClApiEnqueueCommands" Value="True" />
<Property Name="EnableOpenClApiEventObject" Value="True" />
<Property Name="EnableOpenClApiFlushAndFinish" Value="True" />
<Property Name="EnableOpenClApiKernelObject" Value="True" />
<Property Name="EnableOpenClApiMemoryObject" Value="True" />
<Property Name="EnableOpenClApiOpenGlInterop" Value="True" />
<Property Name="EnableOpenClApiPlatform" Value="True" />
<Property Name="EnableOpenClApiProgramObject" Value="True" />
<Property Name="EnableOpenClApiSampler" Value="True" />
<Property Name="EnableOpenClApiTrace" Value="True" />
<Property Name="EnableOpenClCommandTrace" Value="True" />
<Property Name="EnableOpenClResourceProgramBuildCallbackBinaryCodeTrace" Value="True" />
<Property Name="EnableOpenClResourceProgramBuildCallbackTrace" Value="True" />
<Property Name="EnableOpenClResourceProgramSourceCodeTrace" Value="True" />
<Property Name="EnableOpenClResourceRefCountTrace" Value="True" />
<Property Name="EnableOpenClResourceTrace" Value="True" />
<Property Name="EnableOpenGlApiBegin" Value="True" />
<Property Name="EnableOpenGlApiClear" Value="True" />
<Property Name="EnableOpenGlApiDisplayList" Value="True" />
<Property Name="EnableOpenGlApiErrors" Value="True" />
<Property Name="EnableOpenGlApiEval" Value="True" />
<Property Name="EnableOpenGlApiFence" Value="True" />
<Property Name="EnableOpenGlApiFlush" Value="True" />
<Property Name="EnableOpenGlApiFrameBufferObject" Value="True" />
<Property Name="EnableOpenGlApiGet" Value="True" />
<Property Name="EnableOpenGlApiLight" Value="True" />
<Property Name="EnableOpenGlApiPath" Value="True" />
<Property Name="EnableOpenGlApiPixel" Value="True" />
<Property Name="EnableOpenGlApiProgram" Value="True" />
<Property Name="EnableOpenGlApiRaster" Value="True" />
<Property Name="EnableOpenGlApiState" Value="True" />
<Property Name="EnableOpenGlApiTexture" Value="True" />
<Property Name="EnableOpenGlApiTrace" Value="True" />
<Property Name="EnableOpenGlApiTransform" Value="True" />
<Property Name="EnableOpenGlApiVertex" Value="True" />
<Property Name="EnableOpenGlApiVertexArray" Value="True" />
<Property Name="EnableOpenGlApiWgl" Value="True" />
<Property Name="EnableOpenGlTrace" Value="True" />
<Property Name="EnableOpenGlWorkloadCopies" Value="True" />
<Property Name="EnableOpenGlWorkloadCpuFrames" Value="True" />
<Property Name="EnableOpenGlWorkloadDispatches" Value="True" />
<Property Name="EnableOpenGlWorkloadDrawCalls" Value="True" />
<Property Name="EnableOpenGlWorkloadGpuFrames" Value="True" />
<Property Name="EnableOpenGlWorkloadPushBuffers" Value="True" />
<Property Name="EnableSystemCpuThreadTrace" Value="True" />
<Property Name="EnableSystemModuleTrace" Value="True" />
<Property Name="EnableToolsExtMarkerTrace" Value="True" />
<Property Name="EnableToolsExtPushPopRangeTrace" Value="True" />
<Property Name="EnableToolsExtResourceNaming" Value="True" />
<Property Name="EnableToolsExtStartEndRangeTrace" Value="True" />
<Property Name="ExtensionsToSynchronize" Value="*.*;" />
<Property Name="WorkingDirectory" Value="C:\Users\Shun\Desktop\Project2-Stream-Compaction\bin_new\" />
</Model>
Binary file added proj2_bottleneck_analysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added proj2_compare.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added proj2_data.xlsx
Binary file not shown.
Binary file added proj2_extra_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added proj2_runningTime.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added proj2_testing_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added proj2_thrust_running.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading