diff --git a/README.md b/README.md index 6e02afa..0d60152 100644 --- a/README.md +++ b/README.md @@ -1,133 +1,24 @@ -Project-2 -========= - A Study in Parallel Algorithms : Stream Compaction -# INTRODUCTION -Many of the algorithms you have learned thus far in your career have typically -been developed from a serial standpoint. When it comes to GPUs, we are mainly -looking at massively parallel work. Thus, it is necessary to reorient our -thinking. In this project, we will be implementing a couple different versions -of prefix sum. We will start with a simple single thread serial CPU version, -and then move to a naive GPU version. Each part of this homework is meant to -follow the logic of the previous parts, so please do not do this homework out of -order. - -This project will serve as a stream compaction library that you may use (and -will want to use) in your -future projects. For that reason, we suggest you create proper header and CUDA -files so that you can reuse this code later. You may want to create a separate -cpp file that contains your main function so that you can test the code you -write. - -# OVERVIEW -Stream compaction is broken down into two parts: (1) scan, and (2) scatter. - -## SCAN -Scan or prefix sum is the summation of the elements in an array such that the -resulting array is the summation of the terms before it. Prefix sum can either -be inclusive, meaning the current term is a summation of all the elements before -it and itself, or exclusive, meaning the current term is a summation of all -elements before it excluding itself. - -Inclusive: - -In : [ 3 4 6 7 9 10 ] - -Out : [ 3 7 13 20 29 39 ] - -Exclusive - -In : [ 3 4 6 7 9 10 ] - -Out : [ 0 3 7 13 20 29 ] - -Note that the resulting prefix sum will always be n + 1 elements if the input -array is of length n. Similarly, the first element of the exclusive prefix sum -will always be 0. In the following sections, all references to prefix sum will -be to the exclusive version of prefix sum. - -## SCATTER -The scatter section of stream compaction takes the results of the previous scan -in order to reorder the elements to form a compact array. - -For example, let's say we have the following array: -[ 0 0 3 4 0 6 6 7 0 1 ] - -We would only like to consider the non-zero elements in this zero, so we would -like to compact it into the following array: -[ 3 4 6 6 7 1 ] - -We can perform a transform on input array to transform it into a boolean array: - -In : [ 0 0 3 4 0 6 6 7 0 1 ] - -Out : [ 0 0 1 1 0 1 1 1 0 1 ] - -Performing a scan on the output, we get the following array : - -In : [ 0 0 1 1 0 1 1 1 0 1 ] - -Out : [ 0 0 0 1 2 2 3 4 5 5 ] - -Notice that the output array produces a corresponding index array that we can -use to create the resulting array for stream compaction. - -# PART 1 : REVIEW OF PREFIX SUM -Given the definition of exclusive prefix sum, please write a serial CPU version -of prefix sum. You may write this in the cpp file to separate this from the -CUDA code you will be writing in your .cu file. - -# PART 2 : NAIVE PREFIX SUM -We will now parallelize this the previous section's code. Recall from lecture -that we can parallelize this using a series of kernel calls. In this portion, -you are NOT allowed to use shared memory. - -### Questions -* Compare this version to the serial version of exclusive prefix scan. Please - include a table of how the runtimes compare on different lengths of arrays. -* Plot a graph of the comparison and write a short explanation of the phenomenon you - see here. - -# PART 3 : OPTIMIZING PREFIX SUM -In the previous section we did not take into account shared memory. In the -previous section, we kept everything in global memory, which is much slower than -shared memory. - -## PART 3a : Write prefix sum for a single block -Shared memory is accessible to threads of a block. Please write a version of -prefix sum that works on a single block. +There are two main components of stream compaction: scan and scatter. -## PART 3b : Generalizing to arrays of any length. -Taking the previous portion, please write a version that generalizes prefix sum -to arbitrary length arrays, this includes arrays that will not fit on one block. +Here is a comparison of the various mehtods I used to scan: -### Questions -* Compare this version to the parallel prefix sum using global memory. -* Plot a graph of the comparison and write a short explanation of the phenomenon - you see here. +![](http://i.imgur.com/AaR3gk0.png) -# PART 4 : ADDING SCATTER -First create a serial version of scatter by expanding the serial version of -prefix sum. Then create a GPU version of scatter. Combine the function call -such that, given an array, you can call stream compact and it will compact the -array for you. Finally, write a version using thrust. +As you can see, the serial version is faster for small arrays, but is quickly out matched as the array length grows. The global +memory version is always just a bit slower than the shared memory version, which makes sense, as the only difference is the slowdown +that comes from fetching from global memory often. The work efficient algorithm that I've implemented must have a bug in it, because +it only becomes comparable to the naive shared memory version after the array is over 10 million elements long. Further investigation is +needed. -### Questions -* Compare your version of stream compact to your version using thrust. How do - they compare? How might you optimize yours more, or how might thrust's stream - compact be optimized. +And here is a comparison of my scatter implementation and thrust's. I think I'm using a slow +thrust version of this, becuase I don't think my basic version in CUDA should be as fast as thrust. But, to be honest, +I'm not sure how else to optimize my implementation of scatter any further. It has 3 global memory reads that are absolutely necessary, +and a branch. -# EXTRA CREDIT (+10) -For extra credit, please optimize your prefix sum for work parallelism and to -deal with bank conflicts. Information on this can be found in the GPU Gems -chapter listed in the references. +![](http://i.imgur.com/V55kt3w.png) -# SUBMISSION -Please answer all the questions in each of the subsections above and write your -answers in the README by overwriting the README file. In future projects, we -expect your analysis to be similar to the one we have led you through in this -project. Like other projects, please open a pull request and email Harmony. # REFERENCES "Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3. diff --git a/src/main.cpp b/src/main.cpp new file mode 100644 index 0000000..fcaa421 --- /dev/null +++ b/src/main.cpp @@ -0,0 +1,309 @@ +#include + +#include "streamCompaction.h" + +using namespace std; + +void serialSum(){ + int numElements = 256; + + dataPacket * ints = new dataPacket[numElements]; + for (int i=0; i 0){ + int toKill = rand() % ds.numAlive(); + ds.kill(toKill); + ds.compactWorkEfficientArbitrary (); + + cout<<"killing "< 0){ + for (int i=0; i 0 && bound < 10){ + + int toKill = rand() % ds.numAlive(); + ds.kill(toKill); + dataPacket cur; + ds.getData(toKill, cur); + cout<<"killed "< 0){ + for (int i=0; i 0 && bound < 20){ + int toKill = rand() % ds.numAlive(); + // toKill = 10; + ds.kill(toKill); + ds.compactWorkEfficientArbitrary (); + + dataPacket cur; + ds.getData(toKill, cur); + cout<<"killing "< +#include +#include +#include +#include + +#include "streamCompaction.h" + +using namespace std; +#include + +#define NUM_BANKS 16 +#define LOG_NUM_BANKS 4 +#define CONFLICT_FREE_OFFSET(n) \ + ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS)) + +void checkCUDAError(const char *msg) { + cudaError_t err = cudaGetLastError(); + if( cudaSuccess != err) { + fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); + } +} + +__global__ void sum(int* in, int* out, int n, int d1){ + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + if (k=d1){ + out[k] = in[k-d1] + ink; + } + else{ + out[k] = ink; + } + } +} + +__global__ void shift(int* in, int* out, int n){ + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + + out[0] = 0; + if (k0){ + out[k] = in[k-1]; + } +} + +__global__ void naiveSumGlobal(int* in, int* out, int n){ + + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + int logn = ceil(log(float(n))/log(2.0f)); + for (int d=1; d<=logn; d++){ + + int offset = powf(2.0f, d-1); + + if (index >= offset){ + out[index] = in[index-offset] + in[index]; + } + else{ + out[index] = in[index]; + } + __syncthreads(); + + int* temp = in; + in = out; + out = temp; + } +} + +__global__ void naiveSumSharedSingleBlock(int* in, int* out, int n){ + + int index = threadIdx.x; + + if (index >= n) return; + + extern __shared__ int shared[]; + int *tempIn = &shared[0]; + int *tempOut = &shared[n]; + + tempOut[index] = (index > 0) ? in[index-1] : 0; + + __syncthreads(); + + for (int offset = 1; offset <= n; offset *= 2){ + int* temp = tempIn; + tempIn = tempOut; + tempOut = temp; + + if (index >= offset){ + tempOut[index] = tempIn[index-offset] + tempIn[index]; + } + else{ + tempOut[index] = tempIn[index]; + } + __syncthreads(); + } + out[index] = tempOut[index]; +} + +__global__ void naiveSumSharedArbitrary(int* in, int* out, int n, int* sums=0){ + + int localIndex = threadIdx.x; + int globalIndex = (blockIdx.x * blockDim.x) + threadIdx.x; + + extern __shared__ int shared[]; + int *tempIn = &shared[0]; + int *tempOut = &shared[n]; + + tempOut[localIndex] = in[globalIndex]; + + __syncthreads(); + + for (int offset = 1; offset < n; offset *= 2){ + int* temp = tempIn; + tempIn = tempOut; + tempOut = temp; + + if (localIndex >= offset){ + tempOut[localIndex] = tempIn[localIndex-offset] + tempIn[localIndex]; + } + else{ + tempOut[localIndex] = tempIn[localIndex]; + } + __syncthreads(); + } + + if (sums) sums[blockIdx.x] = tempOut[n-1]; + out[globalIndex] = tempOut[localIndex]; +} + +__global__ void workEfficientSumSingleBlock(int* in, int* out, int n){ + + extern __shared__ float temp[]; + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + int offset = 1; + + if (2*index+1<=n){ + temp[2*index] = in[2*index]; + temp[2*index+1] = in[2*index+1]; + + for (int d = n>>1; d>0; d >>= 1){ + __syncthreads(); + if (index < d){ + int ai = offset * (2*index+1) - 1; + int bi = offset * (2*index+2) - 1; + + temp[bi] += temp[ai]; + } + offset *= 2; + } + + if (index == 0) temp[n - 1] = 0; + + for (int d = 1; d>= 1; + __syncthreads(); + if (index < d){ + + int ai = offset * (2*index+1) - 1; + int bi = offset * (2*index+2) - 1; + + if (ai < n && bi < n){ + float t = temp[ai]; + temp[ai] = temp[bi]; + temp[bi] += t; + } + } + } + __syncthreads(); + + out[2*index] = temp[2*index]; + out[2*index+1] = temp[2*index+1]; + } + +} + +__global__ void workEfficientArbitrary(int* in, int* out, int n, int* sums=0){ + + extern __shared__ float temp[]; + + int offset = 1; + int index = threadIdx.x; + + int indexA = index; + int indexB = index + (n/2); + int bankOffsetA = CONFLICT_FREE_OFFSET(indexA); + int bankOffsetB = CONFLICT_FREE_OFFSET(indexB); + temp[indexA + bankOffsetA] = in[indexA]; + temp[indexB + bankOffsetB] = in[indexB]; + + for (int d = n>>1; d>0; d >>= 1){ + __syncthreads(); + if (index < d){ + int ai = offset * (2*index+1) - 1; + int bi = offset * (2*index+2) - 1; + + ai += CONFLICT_FREE_OFFSET(ai); + bi += CONFLICT_FREE_OFFSET(bi); + + temp[bi] += temp[ai]; + } + offset *= 2; + } + + if (index == 0){ + if (sums) sums[blockIdx.x] = temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)]; + temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)] = 0; + } + + for (int d = 1; d>= 1; + __syncthreads(); + if (index < d){ + + int ai = offset * (2*index+1) - 1; + int bi = offset * (2*index+2) - 1; + + ai += CONFLICT_FREE_OFFSET(ai); + bi += CONFLICT_FREE_OFFSET(bi); + + if (ai < n && bi < n){ + float t = temp[ai]; + temp[ai] = temp[bi]; + temp[bi] += t; + } + } + } + __syncthreads(); + + out[indexA] = temp[indexA + bankOffsetA]; + out[indexB] = temp[indexB + bankOffsetB]; +} + +__global__ void addIncs(int* cudaAuxIncs, int* cudaIndicesB, int n){ + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + + + // if (index < n){ + // cudaIndicesB[index] = blockIdx.x; //cudaAuxIncs[blockIdx.x]; + cudaIndicesB[index] += cudaAuxIncs[blockIdx.x]; + // } +} + +__global__ void streamCompaction(dataPacket* inRays, int* indices, dataPacket* outRays, int numElements){ + int k = (blockIdx.x * blockDim.x) + threadIdx.x; + + if (k>>(testin, testout, numElements, int(pow(2.0f,d-1))); + cudaThreadSynchronize(); + cudaMemcpy(cputest, testout, numElements*sizeof(int), cudaMemcpyDeviceToHost); + + + int* temp = testin; + testin=testout; + testout=temp; + } + //Compact + streamCompaction<<>>(cudaArrayA, testin, cudaArrayB, numElements); + cudaArrayA = cudaArrayB; + cudaThreadSynchronize(); + + cudaMemcpy(&numElements, &testin[numElements-1], 1*sizeof(int), cudaMemcpyDeviceToHost); + + std::cout<<"number of rays left: "<>>(in, out, m_numElementsAlive, powf(2.0f, d-1)); + cudaThreadSynchronize(); + int* temp = in; + in = out; + out = temp; + } + shift<<>>(in, out, m_numElementsAlive); +} + +void DataStream::thrustStreamCompact(){ + clock_t t = clock (); + thrust::copy_if (m_data, m_data+m_numElements, m_indices, m_data, isOne()); + t = clock() - t; + cout<<(float)t/CLOCKS_PER_SEC<>>(cudaIndicesA, cudaIndicesB, procsPefBlock, cudaAuxSums); + + for (int d=1; d<=ceil(log(sumSize)/log(2)); d++){ + sum<<>>(cudaAuxSums, cudaAuxIncs, sumSize, powf(2.0f, d-1)); + cudaThreadSynchronize(); + int* temp = cudaAuxSums; + cudaAuxSums = cudaAuxIncs; + cudaAuxIncs = temp; + } + shift<<>>(cudaAuxSums, cudaAuxIncs, m_numElements/(THREADS_PER_BLOCK*2)); + addIncs<<>>(cudaAuxIncs, cudaIndicesB, m_numElements); + + //Stream compation from A into B, then save back into A + streamCompaction<<>>(cudaDataA, cudaIndicesB, cudaDataB, m_numElementsAlive); + dataPacket * temp = cudaDataA; + cudaDataA = cudaDataB; + cudaDataB = temp; + + // update numrays + cudaMemcpy(&m_numElementsAlive, &cudaIndicesB[m_numElementsAlive], sizeof(int), cudaMemcpyDeviceToHost); + + resetStreams<<>>(cudaDataA, cudaIndicesA, m_numElementsAlive); +} + +void DataStream::compactNaiveSumGlobal(){ + + int threadsPerBlock = THREADS_PER_BLOCK; + + dim3 threadsPerBlockL(threadsPerBlock); + dim3 fullBlocksPerGridL(m_numElements/threadsPerBlock); + + clock_t t = clock(); + for (int d=1; d<=ceil(log(m_numElementsAlive)/log(2)); d++){ + sum<<>>(cudaIndicesA, cudaIndicesB, m_numElementsAlive, powf(2.0f, d-1)); + checkCUDAError("kernel failed 1 !"); + cudaThreadSynchronize(); + int* temp = cudaIndicesA; + cudaIndicesA = cudaIndicesB; + cudaIndicesB = temp; + } + shift<<>>(cudaIndicesA, cudaIndicesB, m_numElementsAlive); + checkCUDAError("kernel failed 1 !"); + t = clock() - t; + + cudaMemcpy(m_indices, cudaIndicesB, m_numElementsAlive*sizeof(int), cudaMemcpyDeviceToHost); + + cout<>>(cudaDataA, cudaIndicesB, cudaDataB, m_numElementsAlive); + dataPacket * temp = cudaDataA; + cudaDataA = cudaDataB; + cudaDataB = temp; + + // update numrays + cudaMemcpy(&m_numElementsAlive, &cudaIndicesA[m_numElementsAlive-1], sizeof(int), cudaMemcpyDeviceToHost); + resetStreams<<>>(cudaDataA, cudaIndicesA, m_numElementsAlive); +} + +void DataStream::compactNaiveSumSharedSingleBlock(){ + + int threadsPerBlock = THREADS_PER_BLOCK; + + dim3 threadsPerBlockL(threadsPerBlock); + dim3 fullBlocksPerGridL(int(ceil(float(m_numElementsAlive)/float(threadsPerBlock)))); + + naiveSumSharedSingleBlock<<>>(cudaIndicesA, cudaIndicesB, m_numElements); + checkCUDAError("kernel failed!"); + + cudaMemcpy(m_indices, cudaIndicesB, m_numElements*sizeof(int), cudaMemcpyDeviceToHost); +} + +void DataStream::compactNaiveSumSharedArbitrary(){ + + //////////////////////////////////////////////////////////////////////////////////////// + int threadsPerBlock = THREADS_PER_BLOCK; + + dim3 threadsPerBlockL(threadsPerBlock*2); + dim3 fullBlocksPerGridL(m_numElements/(threadsPerBlock*2)); + + naiveSumSharedArbitrary<<>>(cudaIndicesA, cudaIndicesB, threadsPerBlock*2, cudaAuxSums); + checkCUDAError("kernel failed 1 !"); + //////////////////////////////////////////////////////////////////////////////////////// + + //////////////////////////////////////////////////////////////////////////////////////// + int sumSize = m_numElements/(THREADS_PER_BLOCK*2); + dim3 initialScanThreadsPerBlock2(threadsPerBlock); + dim3 initialScanBlocksPerGrid2(sumSize/threadsPerBlock+1); + + dim3 threadsPerBlockOld(threadsPerBlock); + dim3 fullBlocksPerGridOld(int(ceil(float(sumSize)/float(threadsPerBlock)))); + + // cudaMemcpy(cudaAuxIncs, cudaAuxSums, m_numElements/(THREADS_PER_BLOCK*2)*sizeof(int), cudaMemcpyDeviceToDevice); + + for (int d=1; d<=ceil(log(sumSize)/log(2)); d++){ + sum<<>>(cudaAuxSums, cudaAuxIncs, sumSize, powf(2.0f, d-1)); + cudaThreadSynchronize(); + int* temp = cudaAuxSums; + cudaAuxSums = cudaAuxIncs; + cudaAuxIncs = temp; + } + shift<<>>(cudaAuxSums, cudaAuxIncs, m_numElements/(THREADS_PER_BLOCK*2)); + + addIncs<<>>(cudaAuxIncs, cudaIndicesB, m_numElements); + + shift<<>>(cudaIndicesB, cudaIndicesA, m_numElementsAlive); + int * temp = cudaIndicesA; + cudaIndicesA = cudaIndicesB; + cudaIndicesB = temp; + + dim3 threadsPerBlockLL(threadsPerBlock); + dim3 fullBlocksPerGridLL(m_numElements/threadsPerBlock); + + clock_t t = clock(); + //Stream compation from A into B, then save back into A + streamCompaction<<>>(cudaDataA, cudaIndicesB, cudaDataB, m_numElementsAlive); + dataPacket * tempDP = cudaDataA; + cudaDataA = cudaDataB; + cudaDataB = tempDP; + t = clock() - t; + cout<<(float)t / CLOCKS_PER_SEC<>>(cudaDataA, cudaIndicesA, m_numElementsAlive); +} + +bool DataStream::getData(int index, dataPacket& data){ + + if (index > m_numElements) return false; + + data = m_data[index]; + return true; +} + +int DataStream::numAlive(){ + return m_numElementsAlive; +} + +void DataStream::fetchDataFromGPU(){ + cudaMemcpy(m_data, cudaDataA, m_numElementsAlive*sizeof(dataPacket), cudaMemcpyDeviceToHost); +} + +void DataStream::kill(int index){ + if (index > m_numElementsAlive) return; + + dim3 threadsPerBlockL(64); + dim3 fullBlocksPerGridL(int(ceil(float(m_numElementsAlive)/64.0f))); + + killStream<<>>(index, cudaDataA, cudaIndicesA, m_numElementsAlive); + + cudaMemcpy(m_indices, cudaIndicesA, m_numElementsAlive*sizeof(int), cudaMemcpyDeviceToHost); +} \ No newline at end of file diff --git a/src/streamCompaction.h b/src/streamCompaction.h new file mode 100755 index 0000000..5585e70 --- /dev/null +++ b/src/streamCompaction.h @@ -0,0 +1,81 @@ +// CIS565 CUDA Raytracer: A parallel raytracer for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania +// Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania +// This file includes code from: +// Rob Farber for CUDA-GL interop, from CUDA Supercomputing For The Masses: http://www.drdobbs.com/architecture-and-design/cuda-supercomputing-for-the-masses-part/222600097 +// Peter Kutz and Yining Karl Li's GPU Pathtracer: http://gpupathtracer.blogspot.com/ +// Yining Karl Li's TAKUA Render, a massively parallel pathtracing renderer: http://www.yiningkarlli.com + +#ifndef STREAM_COMPACTION_H +#define STREAM_COMPACTION_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#define THREADS_PER_BLOCK 256 + +struct dataPacket{ + int index; + bool alive; + dataPacket(){ + index = -1; + alive = true; + } + dataPacket(int i){ + index = i; + alive = true; + } +}; + +class DataStream{ +private: + + dataPacket * m_data; + + dataPacket * cudaDataA; + dataPacket * cudaDataB; + + int * cudaIndicesA; + int * cudaIndicesB; + + int * cudaAuxSums; + int * cudaAuxIncs; + + void globalSum(int* in, int* out, int n); + void thrustStreamCompact(); + +public: + int * m_indices; + int * m_auxSums; + + int m_numElementsAlive, m_numElements; + + DataStream(int numElements, dataPacket * data); + ~DataStream(); + + void serialScan(); + void serialScatter(); + + void compactWorkEfficientArbitrary(); + void compactNaiveSumGlobal(); + void compactNaiveSumSharedSingleBlock(); + void compactNaiveSumSharedArbitrary(); + bool getData(int index, dataPacket& data); + int numAlive(); + void kill(int index); + void fetchDataFromGPU(); + +}; + +void cudaVectorSum(int * indicesA, int * indicesB, int numElements, float k); + +void cudaInit (dataPacket * a, dataPacket * b, int * ia, int * ib); + +void testStreamCompaction(); + +#endif