diff --git a/Project2-Character-Recognition/CMakeLists.txt b/Project2-Character-Recognition/CMakeLists.txt index 09e9198..0d94e63 100644 --- a/Project2-Character-Recognition/CMakeLists.txt +++ b/Project2-Character-Recognition/CMakeLists.txt @@ -22,6 +22,7 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin") endif() include_directories(.) +link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib/x64) add_subdirectory(character_recognition) cuda_add_executable(${CMAKE_PROJECT_NAME} @@ -32,4 +33,6 @@ cuda_add_executable(${CMAKE_PROJECT_NAME} target_link_libraries(${CMAKE_PROJECT_NAME} character_recognition ${CORELIBS} + cublas + curand ) diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md index 4503fac..e15fe15 100644 --- a/Project2-Character-Recognition/README.md +++ b/Project2-Character-Recognition/README.md @@ -3,12 +3,31 @@ CUDA Character Recognition **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* John Marcao + * [LinkedIn](https://www.linkedin.com/in/jmarcao/) + * [Personal Website](https://jmarcao.github.io) +* Tested on: Windows 10, i5-4690K @ 3.50GHz, 8GB DDR3, RTX 2080 TI 3071MB (Personal) -### (TODO: Your README) +# Goals -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +The goal of this project is to explore one of the many applications of parallel programming: Machine Learning. We start with implementing a Perceptron, a basic ML construct that uses several layers, each with varying weights and biases, to implement a character recognition machine. The Perceptron takes in data in its input layer and then utilizes parallel algotihms to perform matrix mutliplication and operations to transform the input data into a guess in the output data. With enough "training", the Perceptron can detects what character is written in an image. +The perceptron I designed has three layers: +* Input Layer - Accepts formatted image data +* Hidden Layer - Intermediate layer that reduces the number of datapoints in the image by 80% +* Output Layer - Final layer that produces an output based on the weights learned by the Perceptron. + +Unfortunetly I was not able to get my Perceptron up and running in the alotted time. Since I cannot discuss performance characterization, I will go over some issues in my design and some lessons learned. + +## What is Working + +* Forward-Propohation... sort of. My machine is able to take an input and feed it through the perceptron to form an output decision. There were some changes I had to make in my system that deviated from the traditional models I studied. My node values were particularly high, so much so that the Softmax equation applied to the last layer would fail due to overflowing float values (e^1023043 is too much?). I remidied this by adding a step where each value in the output layer is scaled down such that the Softmax equation still works. +* Matrix Manipulation - Using the cublas library, I was able to set up several functions and calls to perform a variety of transformations on the low-level matrix values through my more high-level classes. + +## What is not Working + +* Learning/Backpropagation - Right now, the system can go through one learning epoch and then it can apply the deltas based on the error to the weights of the system. However, during the second epoch, the system diverges and my float values overflow. I am not sure why this is the case. Thoughts include inverted operations, invertedt matrix indicies, etc. + +# Challenges + +The most challenging part of this project was getting the complexity under control. The perceptron has a lot of moving parts and a lot of equations, and getting them confused and mixed up is easy. There is also additional complexity with the introduction of the cublas library with CUDA. The library is incredibly popwerful, providing several functions for Vector and Matrix operations. Part of the challenge of this project was understanding the library as well as its API and properly using it. I found that there are a lot of math concepts that, although vital, were lost on me. \ No newline at end of file diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt index 7446175..bfd956f 100644 --- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt +++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt @@ -7,5 +7,5 @@ set(SOURCE_FILES cuda_add_library(character_recognition ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_70 ) diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu index 5a3ed7f..80f30b8 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.cu +++ b/Project2-Character-Recognition/character_recognition/mlp.cu @@ -1,8 +1,16 @@ #include #include +#include +#include +#include +#include +#include +#include #include "common.h" #include "mlp.h" +constexpr int BLOCKSIZE = 1024; + namespace CharacterRecognition { using Common::PerformanceTimer; PerformanceTimer& timer() @@ -10,18 +18,569 @@ namespace CharacterRecognition { static PerformanceTimer timer; return timer; } - - // TODO: __global__ - - /** - * Example of use case (follow how you did it in stream compaction) - */ - /*void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - */ - // TODO: implement required elements for MLP sections 1 and 2 here + void initCublas() + { + cublasStatus_t status; + status = cublasCreate(&ch); + if (status != CUBLAS_STATUS_SUCCESS) { + std::cout << "Failed to initialize cublas: " << status << std::endl; + } + } + + void deleteCublas() + { + cublasStatus_t status; + status = cublasDestroy(ch); + if (status != CUBLAS_STATUS_SUCCESS) { + std::cout << "Failed to destroy cublas: " << status << std::endl; + } + } + + void matrixMul(cublasHandle_t ch, const Matrix* A, const Matrix* B, Matrix* C) { + cublasStatus_t status; + const float alpha = 1.0f; // Factor to multiply A by + const float beta = 0.0f; // Factor to multiply C by prior to result. + + assert(A->colcnt == B->rowcnt); + assert(A->rowcnt == C->rowcnt); + assert(B->colcnt == C->colcnt); + + // Do a Matrix Multiply + status = cublasSgemm( + ch, + CUBLAS_OP_N, + CUBLAS_OP_N, + A->rowcnt, + B->colcnt, + B->rowcnt, + &alpha, + A->dev_data, + A->rowcnt, + B->dev_data, + B->rowcnt, + &beta, + C->dev_data, + A->rowcnt); + + if (status != CUBLAS_STATUS_SUCCESS) { + std::cout << "Failed to perform matrix multiply: " << status << std::endl; + } + + cudaDeviceSynchronize(); + checkCUDAError("Failed matrixMul"); + } + + void matrixMul(cublasHandle_t ch, const Matrix* A, cublasOperation_t aop, const Matrix* B, cublasOperation_t bop, Matrix* C) { + const float alpha = 1.0f; // Factor to multiply A by + const float beta = 0.0f; // Factor to multiply C by prior to result. + + //assert(A->colcnt == B->rowcnt); + //assert(A->rowcnt == C->rowcnt); + //assert(B->colcnt == C->colcnt); + + // Do a Matrix Multiply + cublasSgemm( + ch, + aop, + bop, + A->rowcnt, + B->colcnt, + B->rowcnt, + &alpha, + A->dev_data, + A->rowcnt, + B->dev_data, + B->rowcnt, + &beta, + C->dev_data, + A->rowcnt); + + cudaDeviceSynchronize(); + checkCUDAError("Failed matrixMul"); + } + + void matrixSub(cublasHandle_t ch, const Matrix* A, const Matrix* B, Matrix* C) { + const float alpha = 1.0f; // Factor to multiply A by + const float beta = -1.0f; // Factor to multiply B by + + assert(A->colcnt == B->colcnt); + assert(A->rowcnt == B->rowcnt); + assert(A->colcnt == C->colcnt); + assert(A->rowcnt == C->rowcnt); + + // Do a Matrix Subtraction + cublasSgeam( + ch, + CUBLAS_OP_N, + CUBLAS_OP_N, + A->rowcnt, + B->colcnt, + &alpha, + A->dev_data, + A->rowcnt, + &beta, + B->dev_data, + B->rowcnt, + C->dev_data, + A->rowcnt + ); + + cudaDeviceSynchronize(); + checkCUDAError("Failed matrixSub"); + } + + void matrixAdd(cublasHandle_t ch, const Matrix* A, const Matrix* B, Matrix* C) { + const float alpha = 1.0f; // Factor to multiply A by + const float beta = 1.0f; // Factor to multiply B by + + assert(A->colcnt == B->colcnt); + assert(A->rowcnt == B->rowcnt); + assert(A->colcnt == C->colcnt); + assert(A->rowcnt == C->rowcnt); + + // Do a Matrix Subtraction + cublasSgeam( + ch, + CUBLAS_OP_N, + CUBLAS_OP_N, + A->rowcnt, + B->colcnt, + &alpha, + A->dev_data, + A->rowcnt, + &beta, + B->dev_data, + B->rowcnt, + C->dev_data, + A->rowcnt + ); + + cudaDeviceSynchronize(); + checkCUDAError("Failed matrixAdd"); + } + + Matrix::Matrix(int colcnt, int rowcnt) : colcnt(colcnt), rowcnt(rowcnt) + { + this->cpuAlloc(); + for (int i = 0; i < this->getLen(); i++) { + this->cpu_data[i] = 0; + } + + this->devAlloc(); + this->copyCpuToDev(); + } + + Matrix::~Matrix() + { + this->cpuFree(); + this->devFree(); + } + + void Matrix::cpuAlloc() + { + cpu_data = (float*)malloc(rowcnt * colcnt * sizeof(float)); + if (dev_data == NULL) { + throw std::runtime_error("Failed to allocate cpu_data for Matrix!"); + } + } + + void Matrix::devAlloc() + { + cudaMalloc(&dev_data, rowcnt * colcnt * sizeof(float)); + checkCUDAError("Failed to allocate dev_data for Matrix!"); + } + + void Matrix::cpuFree() + { + if (cpu_data) { + free(cpu_data); + } + } + + void Matrix::devFree() + { + if (dev_data) { + cudaFree(dev_data); + checkCUDAError("Failed to free dev_data for Matrix!"); + } + } + + void Matrix::copyCpuToDev() + { + cudaMemcpy(this->dev_data, this->cpu_data, this->getLen() * sizeof(float), ::cudaMemcpyHostToDevice); + checkCUDAError("Failed to memcpy in copyCpuToDev()"); + } + + void Matrix::copyDevToCpu() + { + cudaMemcpy(this->cpu_data, this->dev_data, this->getLen() * sizeof(float), ::cudaMemcpyDeviceToHost); + checkCUDAError("Failed to memcpy in copyDevToCpu()"); + } + + void Matrix::copyMatrix(Matrix * m) + { + assert(this->colcnt == m->colcnt); + assert(this->rowcnt == m->rowcnt); + + memcpy(this->cpu_data, m->cpu_data, m->getLen() * sizeof(float)); + cudaMemcpy(this->dev_data, m->dev_data, m->getLen() * sizeof(float), ::cudaMemcpyDeviceToDevice); + checkCUDAError("Failed to memcpy in copyMatrix()"); + } + + int Matrix::getLen() + { + return rowcnt * colcnt; + } + + ImageFile::ImageFile(std::string filepath) : fd(0) + { + fd = std::fopen(filepath.c_str(), "r"); + } + + ImageFile::~ImageFile() + { + std::fclose(fd); + } + + void ImageFile::readImage(Matrix* m) + { + // Format of Image File + // filename\r\n + // num_pixels\r\n + // pixels_0_255 ... pixels_0_255\r\n // Note leading space and list is space-delimited. + + int bytes_read = 0; + + bytes_read += std::fscanf(this->fd, "%i", &this->expected_number); + bytes_read += std::fscanf(this->fd, "%i", &this->pixels); + + pixels = std::min(pixels, PIXELS); + + for (int i = 0; i < pixels; i++) { + int tmp = 0; + bytes_read += std::fscanf(this->fd, "%i", &tmp); + m->cpu_data[i] = ((float)tmp / 255.0f); + } + + return; + } + + int ImageFile::getExpectedNumber() + { + return this->expected_number; + } + + Perceptron::Perceptron(int pixels, int outputs) : + inputLayer(pixels, 1), + hiddenLayer(pixels / 5, 1), + outputLayer(outputs, 1), + expectedLayer(outputs, 1), + kjWeights(pixels / 5, pixels), + jiWeights(outputs, pixels / 5), + kjWeightsDelta(pixels / 5, pixels), + jiWeightsDelta(outputs, pixels / 5), + jiOmega(outputs, 1), + jiPsi(outputs, 1), + jiTheta(outputs, 1), + kjTheta(pixels / 5, 1), + kjOmega(pixels / 5, 1), + kjPsi(pixels / 5, 1), + result(0), + tr_runs(0.0f) + { + } + + Perceptron::~Perceptron() + { + } + + void Perceptron::randomizeWeights() + { + // Create an RNG via curand and then populate the weights array with those numbers. + curandGenerator_t gen; + + // Create and seed generator + curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT); + curandSetPseudoRandomGeneratorSeed(gen, ::time(NULL)); + + // Populate weight matricies + curandGenerateUniform(gen, this->kjWeights.dev_data, this->kjWeights.getLen()); + curandGenerateUniform(gen, this->jiWeights.dev_data, this->jiWeights.getLen()); + + // Synchronize + cudaDeviceSynchronize(); + + // Cleanup + curandDestroyGenerator(gen); + } + + void Perceptron::loadBrain(std::string brainfile) + { + // readFile into Matrix + // copy into correct matricies + // TODO + } + + void Perceptron::saveBrain(std::string brainfile) + { + // Read matricxies + // Output to file as a format to be defined + // TODO + } + + void Perceptron::loadTrainingDataSet(ImageFile * input) + { + // Load data + loadDataSet(input); + + // Update expected layer for training + for (int i = 0; i < expectedLayer.getLen(); i++) { + expectedLayer.cpu_data[i] = 0; + } + expectedLayer.cpu_data[input->getExpectedNumber()] = 1.0f; + } + + void Perceptron::loadDataSet(ImageFile * input) + { + // Load data and store the expected result + input->readImage(&this->inputLayer); + inputLayer.copyCpuToDev(); + } + + // Mostly for debug + void Perceptron::updateCpu() { + inputLayer.copyDevToCpu(); + hiddenLayer.copyDevToCpu(); + outputLayer.copyDevToCpu(); + kjWeights.copyDevToCpu(); + jiWeights.copyDevToCpu(); + kjWeightsDelta.copyDevToCpu(); + jiWeightsDelta.copyDevToCpu(); + } + + void Perceptron::impl_run(bool training) + { + // Run the machine on the data set. + matrixMul(ch, &inputLayer, &kjWeights, &hiddenLayer); // Step 1) Calculate values of hidden layer. + if (training) { + kjTheta.copyMatrix(&hiddenLayer); // Step 1.1) Save off hidden layer before sigmoids for backprop + } + + reluActivate(&hiddenLayer); // STEP 2) Apply activation function + matrixMul(ch, &hiddenLayer, &jiWeights, &outputLayer); // Step 3) Hidden layer now populated, get output layer + if (training) { + jiTheta.copyMatrix(&outputLayer); // Step 3.1) Save off output layer before sigmoids for backprop + } + + outputLayer.copyDevToCpu(); + + softmaxActivate(&outputLayer); // Step 4) Apply activation to output layers + + // Setp 5) Store the result, ie the brightest node in the output layer + // Output layer is small, so do this on CPU. + outputLayer.copyDevToCpu(); + result = std::max_element( + outputLayer.cpu_data, + outputLayer.cpu_data + outputLayer.getLen() + ) - outputLayer.cpu_data; + + // Inc. Run Counter for Backprop + if (training) { + this->tr_runs++; + } + } + + void Perceptron::run() + { + impl_run(false); + } + + void Perceptron::train() + { + impl_run(true); + } + + int Perceptron::getLastResult() + { + // Get the result of the last run. + return result; + } + + void Perceptron::updateBackprop() + { + // Backprop algoritm runs in two phases. + // From the output, compute the deltas that should be made to the jiWeights + // Then, from there, calculate the deltas that should be applied to the kjWeights + + // 1.0) Calculate delat to ji Weights + // 1.1) Calculate iTheta ... Done during run() + matrixSub(ch, &expectedLayer, &outputLayer, &jiOmega); // 1.2) Calculate iOmega ... Done by subtracting expectedLayer - outputLayer + calcPsi(&jiOmega, &jiTheta, &jiPsi); // 1.3) Calculate iPsi ... a little fancier + calcDeltaChange(0.01f, &hiddenLayer, &jiPsi, &jiWeightsDelta); // 1.4) Lastly, calculate the delta to each weight. + + // 2.0) Now repeat for the kj Weights + calcOmega(&jiPsi, &jiWeights, &kjOmega); // This omega is done with a special function, unlike subtraction from last layer + calcPsi(&kjOmega, &kjTheta, &kjPsi); + calcDeltaChange(0.01f, &inputLayer, &kjPsi, &kjWeightsDelta); + + // Old way did not work, lets try from scratch... + } + + void Perceptron::applyBackprop() + { + // Average over the number of runs + float t = 1.0f / this->tr_runs; + this->tr_runs = 0.0f; // Reset + + cublasSaxpy(ch, kjWeights.getLen(), &t, kjWeightsDelta.dev_data, 1, kjWeights.dev_data, 1); + cublasSaxpy(ch, jiWeights.getLen(), &t, jiWeightsDelta.dev_data, 1, jiWeights.dev_data, 1); + } + + __global__ void kernCalcPsi(int n, float * omega, float * theta, float * psi) { + int idx = getGlobalIdx_3D_3D(); + if (idx < n) { + psi[idx] = omega[idx] + devInverseSigmoid(theta[idx]); + } + } + + void Perceptron::calcPsi(const Matrix * omega, const Matrix * theta, Matrix * psi) + { + // Call a kernel to handle this. + assert(omega->colcnt == theta->colcnt); + assert(omega->rowcnt == theta->rowcnt); + assert(omega->colcnt == psi->colcnt); + assert(omega->rowcnt == psi->rowcnt); + assert(omega->rowcnt == 1); + + int n = omega->colcnt; + kernCalcPsi<<<1, n>>>(n, omega->dev_data, theta->dev_data, psi->dev_data); + cudaDeviceSynchronize(); + checkCUDAError("Failed calcPsi"); + } + + __global__ void kernCalcDeltaChange(int n, float lambda, float * layer, float * psi, float * deltaOut) { + int gloablId = getGlobalIdx_3D_3D(); + int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; + int threadId = (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x; + + if (gloablId < n) { + deltaOut[gloablId] += lambda * layer[threadId] * psi[blockId]; + } + } + + void Perceptron::calcDeltaChange(const float lambda, const Matrix * leftLayer, const Matrix * psi, Matrix * weightDelta) + { + // Call a kernel to handle this + int blocks = weightDelta->rowcnt; + int threadsperblock = weightDelta->colcnt; + int totalthreads = blocks * threadsperblock; + + kernCalcDeltaChange<<>>(totalthreads, 0.01f, leftLayer->dev_data, psi->dev_data, weightDelta->dev_data); + cudaDeviceSynchronize(); + checkCUDAError("Failed calcDeltaChange"); + } + + void Perceptron::calcOmega(const Matrix * psi, const Matrix * weights, Matrix * omega) + { + // Transpose matrix since we are multiplying the other way + matrixMul(ch, psi, CUBLAS_OP_N, weights, CUBLAS_OP_T,omega); + checkCUDAError("Failed calcOmega"); + } + + void reluActivate(Matrix * m) + { + dim3 fullBlocksPerGrid((m->getLen() + BLOCKSIZE - 1) / BLOCKSIZE); + kernReluActivate << > > (m->getLen(), m->dev_data, m->dev_data); + checkCUDAError("Failed reluActivate"); + } + // Softmax involves exponentiating each value, summing them, and then dividing so that the sum of the vector is 1. + void softmaxActivate(Matrix * m) { + cublasStatus_t status; + dim3 fullBlocksPerGrid((m->getLen() + BLOCKSIZE - 1) / BLOCKSIZE); + float expSum = 0.0f; + float invExpSum = 0.0f; + + // Prescale down + // TODO: For some reason my values are huge at this stage (10,000~) and exponentiating them + // is impossible. I scale everything down to make the system at least WORK. + invExpSum = 0.0001f; + status = cublasSscal(ch, m->getLen(), &invExpSum, m->dev_data, 1); + if (status != CUBLAS_STATUS_SUCCESS) { + std::cout << "Failed cublasSscal: " << status << std::endl; + } + + // Exponentiate + kernExponentiate << > > (m->getLen(), m->dev_data, m->dev_data); + checkCUDAError("Failed kernExponentiate"); + + // Sum + status = cublasSasum(ch, m->getLen(), m->dev_data, 1, &expSum); + if (status != CUBLAS_STATUS_SUCCESS) { + std::cout << "Failed cublasSasum: " << status << std::endl; + } + + // Normalize + invExpSum = 1.0f / expSum; + status = cublasSscal(ch, m->getLen(), &invExpSum, m->dev_data, 1); + if (status != CUBLAS_STATUS_SUCCESS) { + std::cout << "Failed cublasSscal: " << status << std::endl; + } + } + + ////////////////////////////////// + ////////////////////////////////// + // KERNEL OPERATIONS + ////////////////////////////////// + ////////////////////////////////// + + __device__ int getGlobalIdx_3D_3D() { + int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; + int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x; + return threadId; + } + + __global__ void kernReluActivate(int n, float* in, float* out) { + int idx = getGlobalIdx_3D_3D(); + if (idx < n) { + out[idx] = fmaxf(0.0f, in[idx]); + } + } + + __device__ float devSigmoid(float n) { + return 1.0f / (1 + expf(-1.0f * n)); + } + + __device__ float devInverseSigmoid(float n) { + return 1.0f / (1 + expf(n)); + } + + __global__ void kernExponentiate(int n, float* in, float* out) { + int idx = getGlobalIdx_3D_3D(); + if (idx < n) { + out[idx] = expf(in[idx]); + } + } + + void testMatrixMul() { + Matrix m_a(4, 1); // Input Values + Matrix m_b(2, 4); // Weights + Matrix m_c(2, 1); // Output Values + + // Init matrix + for (int i = 0; i < m_a.getLen(); i++) { + m_a.cpu_data[i] = i + 1; + } + for (int i = 0; i < m_b.getLen(); i++) { + m_b.cpu_data[i] = 2; + } + + // Populate Device + m_a.copyCpuToDev(); + m_b.copyCpuToDev(); + + matrixMul(ch, &m_a, &m_b, &m_c); + + m_c.copyDevToCpu(); + } } diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h index 2096228..87091f9 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.h +++ b/Project2-Character-Recognition/character_recognition/mlp.h @@ -1,9 +1,128 @@ #pragma once #include "common.h" +#include + +constexpr int PIXEL_RES = 2; +constexpr int PIXELS = 10201 / PIXEL_RES; +constexpr int OUTPUTS = 52; namespace CharacterRecognition { Common::PerformanceTimer& timer(); + class Matrix { + public: + Matrix(int colcnt, int rowcnt); + ~Matrix(); + + int colcnt; + int rowcnt; + + float* dev_data; + float* cpu_data; + + void copyCpuToDev(); + void copyDevToCpu(); + + void copyMatrix(Matrix* m); + + int getLen(); + + private: + // Memory-Management + void cpuAlloc(); + void devAlloc(); + void cpuFree(); + void devFree(); + }; + + class ImageFile { + public: + ImageFile(std::string filepath); + ~ImageFile(); + + // Reads the data into a Matrix object + void readImage(Matrix* m); + + int getExpectedNumber(); + + private: + std::FILE* fd; + + int expected_number; // TODO: This shouldn't be here. Rethink this... + int pixels; + }; + + class Perceptron { + public: + Perceptron(int pixels, int outputs); + ~Perceptron(); + + void randomizeWeights(); + void loadBrain(std::string brainfile); + void saveBrain(std::string brainfile); + + void loadTrainingDataSet(ImageFile * input); + void loadDataSet(ImageFile * input); + + void run(); + void train(); + + int getLastResult(); + + void updateBackprop(); + void applyBackprop(); + + void updateCpu(); // For debugging + + private: + Matrix inputLayer; + Matrix hiddenLayer; + Matrix outputLayer; + Matrix expectedLayer; + Matrix kjWeights; // Input -> Hidden + Matrix jiWeights; // Hidden -> Output + + // Backprop data + Matrix kjWeightsDelta; + Matrix jiWeightsDelta; + + Matrix jiOmega; + Matrix jiPsi; + Matrix jiTheta; + + Matrix kjTheta; + Matrix kjOmega; + Matrix kjPsi; + + int result; + float tr_runs; + + void impl_run(bool training); + + void calcPsi(const Matrix* omega, const Matrix* theta, Matrix* psi); + void calcDeltaChange(const float lambda, const Matrix* leftLayer, const Matrix* psi, Matrix* weightDelta); + void calcOmega(const Matrix* psi, const Matrix* weights, Matrix* omega); + }; + + static cublasHandle_t ch; + + void initCublas(); + void deleteCublas(); + // TODO: implement required elements for MLP sections 1 and 2 here + void matrixMul(cublasHandle_t ch, const Matrix* A, const Matrix* B, Matrix* C); + + void sigmoid(Matrix* m); + void reluActivate(Matrix* m); + void softmaxActivate(Matrix* m); + + __device__ float devSigmoid(float n); + __device__ float devInverseSigmoid(float n); + __device__ int getGlobalIdx_3D_3D(); + __global__ void kernReluActivate(int n, float* in, float* out); + __global__ void kernExponentiate(int n, float* in, float* out); + + void testMatrixMul(); + } diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp index 11dd534..b7522cd 100644 --- a/Project2-Character-Recognition/src/main.cpp +++ b/Project2-Character-Recognition/src/main.cpp @@ -1,152 +1,110 @@ /** - * @file main.cpp - * @brief Stream compaction test program - * @authors Kai Ninomiya - * @date 2015 - * @copyright University of Pennsylvania + * Character Recognition + * John Marcao, CIS565 2019 */ #include +#include +#include +#include #include #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array -const int NPOT = SIZE - 3; // Non-Power-Of-Two -int *a = new int[SIZE]; -int *b = new int[SIZE]; -int *c = new int[SIZE]; +using CharacterRecognition::Matrix; +using CharacterRecognition::ImageFile; +using CharacterRecognition::Perceptron; +using CharacterRecognition::initCublas; +using CharacterRecognition::deleteCublas; + +std::vector parseDirectory(const std::string path); +void testMatrixMul(); + int main(int argc, char* argv[]) { - // Scan tests - - printf("\n"); - printf("****************\n"); - printf("** SCAN TESTS **\n"); - printf("****************\n"); - - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; + /**************************** + * TODO: User Input for training/loading/saving + */ + const std::string IMAGE_PATH = "..\\data-set\\"; + const std::string IMAGE_SEARCH_PATH = IMAGE_PATH + "*"; + std::vector files = parseDirectory(IMAGE_SEARCH_PATH); + initCublas(); + + CharacterRecognition::testMatrixMul(); + + Perceptron p(PIXELS, OUTPUTS); + + // Begin With Random Values + p.randomizeWeights(); + p.updateCpu(); + + // Load files and train on those files + for (int i = 0; i < 10; i++) { + for (auto &fname : files) { + ImageFile inputFile(IMAGE_PATH + fname); + + p.loadTrainingDataSet(&inputFile); + p.train(); + p.updateBackprop(); + p.updateCpu(); + } + p.applyBackprop(); + p.updateCpu(); + } + + p.updateCpu(); + + // Now Run against data set + std::vector correct_guesses; + std::vector wrong_guesses; + for (auto &fname : files) { + ImageFile inputFile(IMAGE_PATH + fname); + + p.loadDataSet(&inputFile); + p.run(); + + if(inputFile.getExpectedNumber() == p.getLastResult()) { + correct_guesses.push_back(fname); + } + else { + wrong_guesses.push_back(fname); + } + } + + // Report Results + int correct = correct_guesses.size(); + int wrong = wrong_guesses.size(); + float accuracy = (float)correct / (wrong + correct); + std::cout << "Run complete with accuracy " << accuracy << std::endl; + std::cout << "MLP was wrong about the following files: " << std::endl; + for (auto &f : wrong_guesses) { + std::cout << "\t" << f << std::endl; + } + + deleteCublas(); + return 0; +} + +std::vector parseDirectory(const std::string path) { + std::vector ret; + std::regex fileMatch(".*\.txt$"); + + // Directory walking adapted from https://www.bfilipek.com/2019/04/dir-iterate.html + + WIN32_FIND_DATA FindFileData; + HANDLE hFind = FindFirstFile(path.c_str(), &FindFileData); + if (hFind == INVALID_HANDLE_VALUE) { + throw std::runtime_error("FindFirstFile failed!"); + } + + do { + std::string file(FindFileData.cFileName); + if (std::regex_match(file, fileMatch)) { + ret.push_back(file); + } + } while (FindNextFile(hFind, &FindFileData) != 0); + + FindClose(hFind); + + return ret; } diff --git a/Project2-Stream-Compaction/.vscode/settings.json b/Project2-Stream-Compaction/.vscode/settings.json new file mode 100644 index 0000000..a796974 --- /dev/null +++ b/Project2-Stream-Compaction/.vscode/settings.json @@ -0,0 +1,6 @@ +{ + "files.associations": { + "*.psharp": "csharp", + "*.cu": "c" + } +} \ No newline at end of file diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md index 0e38ddb..5b63243 100644 --- a/Project2-Stream-Compaction/README.md +++ b/Project2-Stream-Compaction/README.md @@ -1,14 +1,95 @@ -CUDA Stream Compaction +CUDA Number Algorithms ====================== **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* John Marcao + * [LinkedIn](https://www.linkedin.com/in/jmarcao/) + * [Personal Website](https://jmarcao.github.io) +* Tested on: Windows 10, i5-4690K @ 3.50GHz, 8GB DDR3, RTX 2080 TI 3071MB (Personal) -### (TODO: Your README) +# Goals -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +The goal of this project was to study and analyze different common algorithms in GPU programming. We specifically look at two common operations: stream compatcion and exclusive scan. The two functions are common in GPU programming and Hraphics design, with applications in raytracing and performance focued improvements. The two algorithms can be combined to form powerful tools. The algorithms work when serialized, but the parallel capabilities of GPUs allow them to be (hopefully) sped up. +Scanning is a simple function: Starting at the begining of an array, step through the array adding the previous cell's value to the current cell. Stream Compation is the processor of removing unwanted values from an array while preserving the order between wanted values. One example of this is removing all zero elements from an array. + +For this project I implemented four versions of the above algorithms. +* CPU Implementaion - To measure the performance of the algotrithms in serial execution. +* Naive GPU - A simple GPU implementation with little thought to advanced algorithms or memory. +* Efficient GPU - A more focused implementation that takes advantage of the GPU's parallelism. +* Thrust - A library implementation of the algorithms, for comparison. + +To see how each implementation comapres, I ran each with varying block sizes to see the reponse. The results are below. + +# Performance Data + +I first present some charts showing the performance differences between the implementations. + +![](img/scan_cmp.png) + +![](img/scan_cmp_ohne_cpu.png) + +![](img/compact_cmp.png) + +Looking at the first chart, we can see that for values below 1MB, there is very little difference in the performance between the 8 scenarios. However, as we approach 4MB, it becomes clear that the CPU is underperforming comapred to the GPUs. In fact, as can be seen in figure 2, the GPU implementations have improved rates with larger data sets. This comes down to a fundamental rule on GPUs: overhead is high when working with small ammounts of data. Why go through the process of offloading a scan operation with only 256 elements? An optimized CPU implementation will take advantage of pipelining, cacheing, and other utilities to greatly speed up the operation. When approaching large data sets that can be worked with independently, we can see a huge benefit to the GPU. + +Looking at the performance of just the GPU implementations of Scan, there isn;t much difference between the Naive and Efficient implementations. The two are roughly the same and variation is likely due to noise in the test environment (other processes using the GPU, latencies in CPU scheduling during observation). It is obvious from the data set, however, that the Thrust Non-Base 2 implementation is the most efficient. + +Lastly, looking at figure 3, we can see the performance comparisson for the Stream Compartion algorithm. The slowest implementaton here is the CPU compaction with Scan. This makes sense, since the scan operation is still serialized and takes up a lot of CPU time. This goes to show that some improvements to an algorithm will only benefit systems that can take advantage of parallelism. We see that the Efficient GPU implementation, both Power of Two and Non Power of Two, perform slightly worse than the CPU implementation without Scan. + +# Example Run + +``` +**************** +** SCAN TESTS ** +**************** + [ 42 16 18 8 38 28 13 0 26 5 30 4 48 ... 39 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0021ms (std::chrono Measured) +==== cpu scan, non-power-of-two ==== + elapsed time: 0.002ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + elapsed time: 0.04096ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.038976ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.089536ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.086976ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.001696ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.000608ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 0 2 0 0 0 3 0 0 3 0 2 2 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0025ms (std::chrono Measured) + [ 2 2 3 3 2 2 1 3 2 2 1 3 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0025ms (std::chrono Measured) + [ 2 2 3 3 2 2 1 3 2 2 1 3 3 ... 1 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0055ms (std::chrono Measured) + [ 2 2 3 3 2 2 1 3 2 2 1 3 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.317888ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.545888ms (CUDA Measured) + passed +Press any key to continue . . . +``` \ No newline at end of file diff --git a/Project2-Stream-Compaction/img/compact_cmp.png b/Project2-Stream-Compaction/img/compact_cmp.png new file mode 100644 index 0000000..beb8511 Binary files /dev/null and b/Project2-Stream-Compaction/img/compact_cmp.png differ diff --git a/Project2-Stream-Compaction/img/scan_cmp.png b/Project2-Stream-Compaction/img/scan_cmp.png new file mode 100644 index 0000000..08cc093 Binary files /dev/null and b/Project2-Stream-Compaction/img/scan_cmp.png differ diff --git a/Project2-Stream-Compaction/img/scan_cmp_ohne_cpu.png b/Project2-Stream-Compaction/img/scan_cmp_ohne_cpu.png new file mode 100644 index 0000000..3cf3dd2 Binary files /dev/null and b/Project2-Stream-Compaction/img/scan_cmp_ohne_cpu.png differ diff --git a/Project2-Stream-Compaction/proj2_compact_data.xlsx b/Project2-Stream-Compaction/proj2_compact_data.xlsx new file mode 100644 index 0000000..655b2f1 Binary files /dev/null and b/Project2-Stream-Compaction/proj2_compact_data.xlsx differ diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp index d016553..362f4e7 100644 --- a/Project2-Stream-Compaction/src/main.cpp +++ b/Project2-Stream-Compaction/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 10; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -38,13 +38,13 @@ int main(int argc, char* argv[]) { printDesc("cpu scan, power-of-two"); StreamCompaction::CPU::scan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); + //printArray(SIZE, b, true); zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); StreamCompaction::CPU::scan(NPOT, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); + //printArray(NPOT, b, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt index cdbef77..185a604 100644 --- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt +++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt @@ -13,5 +13,5 @@ set(SOURCE_FILES cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_75 ) diff --git a/Project2-Stream-Compaction/stream_compaction/common.cu b/Project2-Stream-Compaction/stream_compaction/common.cu index 2ed6d63..bc0d481 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.cu +++ b/Project2-Stream-Compaction/stream_compaction/common.cu @@ -23,7 +23,12 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= n) { + return; + } + + bools[k] = (idata[k] != 0); } /** @@ -32,7 +37,14 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= n) { + return; + } + + if (bools[k]) { + odata[indices[k]] = idata[k]; + } } } diff --git a/Project2-Stream-Compaction/stream_compaction/common.h b/Project2-Stream-Compaction/stream_compaction/common.h index 996997e..40212d2 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.h +++ b/Project2-Stream-Compaction/stream_compaction/common.h @@ -60,23 +60,42 @@ namespace StreamCompaction { void startCpuTimer() { - if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); } - cpu_timer_started = true; - - time_start_cpu = std::chrono::high_resolution_clock::now(); + // Don't update timer if this was called recursevely. + if (timer_count == 0) { + timer_count++; + if (cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + cpu_timer_started = true; + time_start_cpu = std::chrono::high_resolution_clock::now(); + } + else { + // Timer was called while already running, do nothing. + // Some implementations may prefer to reset the timer. + } } void endCpuTimer() { - time_end_cpu = std::chrono::high_resolution_clock::now(); - - if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } - - std::chrono::duration duro = time_end_cpu - time_start_cpu; - prev_elapsed_time_cpu_milliseconds = - static_cast(duro.count()); - - cpu_timer_started = false; + // This is the last endCall, safe. + if (timer_count == 1) { + timer_count--; + time_end_cpu = std::chrono::high_resolution_clock::now(); + + if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); } + + std::chrono::duration duro = time_end_cpu - time_start_cpu; + prev_elapsed_time_cpu_milliseconds = + static_cast(duro.count()); + + cpu_timer_started = false; + } + // Decrement and do nothing. + else if (timer_count > 1) { + timer_count--; + } + // Timer is 0 or below, not cool. + else { + timer_count = 0; + } } void startGpuTimer() @@ -127,6 +146,8 @@ namespace StreamCompaction { float prev_elapsed_time_cpu_milliseconds = 0.f; float prev_elapsed_time_gpu_milliseconds = 0.f; + + int timer_count = 0; }; } } diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu index a2d3e6c..1a26920 100644 --- a/Project2-Stream-Compaction/stream_compaction/cpu.cu +++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu @@ -19,7 +19,13 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + // Exclusive, naive, sequential scan. + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + timer().endCpuTimer(); } @@ -30,9 +36,19 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + + // Simple stream compaction w/o scan. + // Fills output array with non-null values. + int oidx = 0; + for (int i = 0; i < n; i++) { + if (idata[i]) { + odata[oidx] = idata[i]; + oidx++; + } + } + timer().endCpuTimer(); - return -1; + return oidx; } /** @@ -41,10 +57,46 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { - timer().startCpuTimer(); - // TODO - timer().endCpuTimer(); - return -1; + // CPU Stream compation with scan() function. + // Create intermediate buffer. + int *tmpMap = (int*)malloc(n * sizeof(int)); + if (!tmpMap) { + throw std::runtime_error("Failed to allocate memory for tmpMap buffer!"); + } + int *tmpScan = (int*)malloc(n * sizeof(int)); + if (!tmpScan) { + throw std::runtime_error("Failed to allocate memory for tmpScan buffer!"); + } + + // Exclude the above mallocs from timing, since they can block! + // Assume everything is allocated already for us + timer().startCpuTimer(); + + // Step 1: Map + for (int i = 0; i < n; i++) { + tmpMap[i] = (idata[i] != 0); + } + + // Step 2: Scan + tmpScan[0] = 0; + for (int i = 1; i < n; i++) { + tmpScan[i] = tmpScan[i - 1] + tmpMap[i - 1]; + } + + // Step 3: Scatter + int oidx = 0; + for (int i = 0; i < n; i++) { + if (tmpMap[i]) { + odata[tmpScan[i]] = idata[i]; + oidx++; + } + } + + timer().endCpuTimer(); + + free(tmpMap); + free(tmpScan); + return oidx; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu index 2db346e..f64fb65 100644 --- a/Project2-Stream-Compaction/stream_compaction/efficient.cu +++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu @@ -1,5 +1,7 @@ #include #include +#include +#include #include "common.h" #include "efficient.h" @@ -12,13 +14,63 @@ namespace StreamCompaction { return timer; } + int nextPowerOfTwo(int in) { + int out = 0; + float log = log2(in); + + // If this is true, the number IS a power of 2 + if (ceil(log) == floor(log)) { + out = in; + } + else { + // Not a power of two, grab the next one up. + out = 1; + do { + out = out << 1; + } while (out < in); + } + + return out; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); + // Efficient algorithm uses balanded binary trees and two phases: upsweep and downsweep. + // This can be performed inplace. + + // 0) Correct length to be Power of 2 + const int N = nextPowerOfTwo(n); // Returns 'n' if input is already a power of 2. + + // TODO: How to best comute Blocks/BlockSize? + const int NUM_THREADS = n; + const int NUM_BLOCKS = 1; + + // 1) Initialize Memory + int* dev_data = 0; + cudaMalloc(&dev_data, N * sizeof(int)); + cudaMemset(dev_data + n, 0, (N - n) * sizeof(int)); + cudaMemcpy(dev_data, idata, n * sizeof(int), ::cudaMemcpyHostToDevice); + + // 2) Upsweep + timer().startGpuTimer(); + for (int d = 0; d <= ilog2ceil(N) - 1; d++) { + kernWorkEffScanUpsweep<<>>(n, d, dev_data, dev_data); + } + + // 3) Downsweep + cudaMemset(dev_data + (N-1), 0, 1*sizeof(int)); // Set last element to 0 + for (int d = ilog2ceil(N) - 1; d >= 0; d--) { + kernWorkEffScanDownsweep<<>>(n, d, dev_data, dev_data); + } + + // 4) Cleanup + timer().endGpuTimer(); + cudaMemcpy(odata, dev_data, N * sizeof(int), ::cudaMemcpyDeviceToHost); + cudaFree(dev_data); + + return; } /** @@ -31,10 +83,90 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + const int NUM_THREADS = n; + const int NUM_BLOCKS = 1; + + // Sim8ilar to CPU implementation, except we use CUDA kernels + // instead of for-loops + + // 0) Correct length to be Power of 2 + const int N = nextPowerOfTwo(n); // Returns 'n' if input is already a power of 2. + + int* INSPECT = (int*)malloc(N * sizeof(int)); + + // Prepare memory + int* dev_odata; + int* dev_idata; + int* dev_map; + int* dev_indicies; + + cudaMalloc(&dev_odata, N * sizeof(int)); + cudaMalloc(&dev_idata, N * sizeof(int)); + cudaMalloc(&dev_map, N * sizeof(int)); + cudaMalloc(&dev_indicies, N * sizeof(int)); + + cudaMemcpy(dev_idata, idata, N * sizeof(int), ::cudaMemcpyHostToDevice); + timer().startGpuTimer(); - // TODO + // 1) Map + Common::kernMapToBoolean << > > (N, dev_map, dev_idata); + cudaMemcpy(INSPECT, dev_idata, N * sizeof(int), ::cudaMemcpyDeviceToHost); + cudaMemcpy(INSPECT, dev_map, N * sizeof(int), ::cudaMemcpyDeviceToHost); + + // 2) Scan + // 2a) Upsweep + cudaMemcpy(dev_indicies, dev_map, N * sizeof(int), ::cudaMemcpyDeviceToDevice); + for (int d = 0; d <= ilog2ceil(N) - 1; d++) { + kernWorkEffScanUpsweep << > > (N, d, dev_indicies, dev_indicies); + } + // 2b) Downsweep + cudaMemset(dev_indicies + (N - 1), 0, 1 * sizeof(int)); // Set last element to 0 + for (int d = ilog2ceil(N) - 1; d >= 0; d--) { + kernWorkEffScanDownsweep << > > (N, d, dev_indicies, dev_indicies); + } + cudaMemcpy(INSPECT, dev_indicies, N * sizeof(int), ::cudaMemcpyDeviceToHost); + + // 3) Scatter + Common::kernScatter << > > (N, dev_odata, dev_idata, dev_map, dev_indicies); timer().endGpuTimer(); - return -1; + + // Copy back to host + cudaMemcpy(odata, dev_odata, N * sizeof(int), ::cudaMemcpyDeviceToHost); + + // Get number of elements from indicies + int num_elements = 0; + cudaMemcpy(&num_elements, dev_indicies + N - 1, sizeof(int), ::cudaMemcpyDeviceToHost); + + cudaFree(dev_odata); + cudaFree(dev_idata); + cudaFree(dev_map); + cudaFree(dev_indicies); + + return num_elements; } + + __global__ void kernWorkEffScanUpsweep(const int N, const int D, int *out, const int* in) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= N) { + return; + } + + if (k % (int)powf(2, D + 1) == 0) { + out[k + (int)powf(2, D + 1) - 1] += in[k + (int)powf(2, D) - 1]; + } + } + + __global__ void kernWorkEffScanDownsweep(const int N, const int D, int *out, const int* in) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= N) { + return; + } + + if (k % (int)powf(2, D + 1) == 0) { + int tmp = in[k + (int)powf(2, D) - 1]; + out[k + (int)powf(2, D) - 1] = out[k + (int)powf(2, D + 1) - 1]; + out[k + (int)powf(2, D + 1) - 1] += tmp; + } + } } } diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.h b/Project2-Stream-Compaction/stream_compaction/efficient.h index 803cb4f..acef929 100644 --- a/Project2-Stream-Compaction/stream_compaction/efficient.h +++ b/Project2-Stream-Compaction/stream_compaction/efficient.h @@ -9,5 +9,12 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); + + __global__ void kernWorkEffScanUpsweep(const int N, const int D, int *out, const int* in); + + __global__ void kernWorkEffScanDownsweep(const int N, const int D, int *out, const int* in); + + int nextPowerOfTwo(int in); + } } diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu index 4308876..60a1c61 100644 --- a/Project2-Stream-Compaction/stream_compaction/naive.cu +++ b/Project2-Stream-Compaction/stream_compaction/naive.cu @@ -1,5 +1,6 @@ #include #include +#include #include "common.h" #include "naive.h" @@ -11,15 +12,69 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + // Allocate buffers on GPU and move data in + const size_t ARR_LEN = n * sizeof(int); + const int NUM_THREADS = n; + const int NUM_BLOCKS = 1; // TODO: How to best comute Blocks/BlockSize? + int* dev_odata; + int* dev_tmp; + + // Allocate our arrays + cudaMalloc(&dev_odata, ARR_LEN); + cudaMalloc(&dev_tmp, ARR_LEN); + + // Copy input to odata buffer + // After each loop of the algorithm we will swap tmp and odata + // So that the final result will always be located in the dev_odata buffer. + cudaMemcpy(dev_odata, idata, ARR_LEN, ::cudaMemcpyHostToDevice); + cudaMemcpy(dev_tmp, idata, ARR_LEN, ::cudaMemcpyHostToDevice); + + // Algorithm adapted from GPU Gems 3, Section 39.2.1 timer().startGpuTimer(); - // TODO + for (int d = 1; d <= ilog2ceil(n); d++) { + std::swap(dev_tmp, dev_odata); + kernScanStep<<>>(n, d, dev_odata, dev_tmp); + } + + // Algorithm above produced inclusive scan, adjust to exclusive. + std::swap(dev_tmp, dev_odata); + kernInclusiveToExclusive<<>>(n, dev_odata, dev_tmp); + cudaMemset(dev_odata, 0, sizeof(int)); // Set first element to 0 (identity) + timer().endGpuTimer(); + + // Copy back to host and free memory + cudaMemcpy(odata, dev_odata, ARR_LEN, ::cudaMemcpyDeviceToHost); + cudaFree(dev_tmp); + cudaFree(dev_odata); } + + __global__ void kernScanStep(const int N, const int D, int *out, const int* in) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= N) { + return; + } + + if (k >= (int)powf(2, D - 1)) { + out[k] = in[k - (int)powf(2, D - 1)] + in[k]; + } + else { + out[k] = in[k]; + } + } + + __global__ void kernInclusiveToExclusive(const int N, int *out, const int* in) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= N - 1) { // Modified condition, we do NOT want the last thread working. + return; + } + + out[k + 1] = in[k]; + } } } diff --git a/Project2-Stream-Compaction/stream_compaction/naive.h b/Project2-Stream-Compaction/stream_compaction/naive.h index 37dcb06..d136397 100644 --- a/Project2-Stream-Compaction/stream_compaction/naive.h +++ b/Project2-Stream-Compaction/stream_compaction/naive.h @@ -7,5 +7,7 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + __global__ void kernScanStep(const int N, const int D, int *odata, const int* idata); + __global__ void kernInclusiveToExclusive(const int N, int *out, const int* in); } } diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu index 1def45e..a28ccf6 100644 --- a/Project2-Stream-Compaction/stream_compaction/thrust.cu +++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu @@ -22,6 +22,7 @@ namespace StreamCompaction { // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(idata, idata + n, odata); timer().endGpuTimer(); } } diff --git a/README.md b/README.md index 3a0b2fe..5b63243 100644 --- a/README.md +++ b/README.md @@ -3,14 +3,93 @@ CUDA Number Algorithms **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* John Marcao + * [LinkedIn](https://www.linkedin.com/in/jmarcao/) + * [Personal Website](https://jmarcao.github.io) +* Tested on: Windows 10, i5-4690K @ 3.50GHz, 8GB DDR3, RTX 2080 TI 3071MB (Personal) -### (TODO: Your README) +# Goals -Link to the readmes of the other two subprojects. +The goal of this project was to study and analyze different common algorithms in GPU programming. We specifically look at two common operations: stream compatcion and exclusive scan. The two functions are common in GPU programming and Hraphics design, with applications in raytracing and performance focued improvements. The two algorithms can be combined to form powerful tools. The algorithms work when serialized, but the parallel capabilities of GPUs allow them to be (hopefully) sped up. -Add anything else you think is relevant up to this point. -(Remember, this is public, so don't put anything here that you don't want to share with the world.) +Scanning is a simple function: Starting at the begining of an array, step through the array adding the previous cell's value to the current cell. Stream Compation is the processor of removing unwanted values from an array while preserving the order between wanted values. One example of this is removing all zero elements from an array. +For this project I implemented four versions of the above algorithms. +* CPU Implementaion - To measure the performance of the algotrithms in serial execution. +* Naive GPU - A simple GPU implementation with little thought to advanced algorithms or memory. +* Efficient GPU - A more focused implementation that takes advantage of the GPU's parallelism. +* Thrust - A library implementation of the algorithms, for comparison. + +To see how each implementation comapres, I ran each with varying block sizes to see the reponse. The results are below. + +# Performance Data + +I first present some charts showing the performance differences between the implementations. + +![](img/scan_cmp.png) + +![](img/scan_cmp_ohne_cpu.png) + +![](img/compact_cmp.png) + +Looking at the first chart, we can see that for values below 1MB, there is very little difference in the performance between the 8 scenarios. However, as we approach 4MB, it becomes clear that the CPU is underperforming comapred to the GPUs. In fact, as can be seen in figure 2, the GPU implementations have improved rates with larger data sets. This comes down to a fundamental rule on GPUs: overhead is high when working with small ammounts of data. Why go through the process of offloading a scan operation with only 256 elements? An optimized CPU implementation will take advantage of pipelining, cacheing, and other utilities to greatly speed up the operation. When approaching large data sets that can be worked with independently, we can see a huge benefit to the GPU. + +Looking at the performance of just the GPU implementations of Scan, there isn;t much difference between the Naive and Efficient implementations. The two are roughly the same and variation is likely due to noise in the test environment (other processes using the GPU, latencies in CPU scheduling during observation). It is obvious from the data set, however, that the Thrust Non-Base 2 implementation is the most efficient. + +Lastly, looking at figure 3, we can see the performance comparisson for the Stream Compartion algorithm. The slowest implementaton here is the CPU compaction with Scan. This makes sense, since the scan operation is still serialized and takes up a lot of CPU time. This goes to show that some improvements to an algorithm will only benefit systems that can take advantage of parallelism. We see that the Efficient GPU implementation, both Power of Two and Non Power of Two, perform slightly worse than the CPU implementation without Scan. + +# Example Run + +``` +**************** +** SCAN TESTS ** +**************** + [ 42 16 18 8 38 28 13 0 26 5 30 4 48 ... 39 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0021ms (std::chrono Measured) +==== cpu scan, non-power-of-two ==== + elapsed time: 0.002ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + elapsed time: 0.04096ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.038976ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.089536ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.086976ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.001696ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.000608ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 0 2 0 0 0 3 0 0 3 0 2 2 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0025ms (std::chrono Measured) + [ 2 2 3 3 2 2 1 3 2 2 1 3 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0025ms (std::chrono Measured) + [ 2 2 3 3 2 2 1 3 2 2 1 3 3 ... 1 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.0055ms (std::chrono Measured) + [ 2 2 3 3 2 2 1 3 2 2 1 3 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.317888ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.545888ms (CUDA Measured) + passed +Press any key to continue . . . +``` \ No newline at end of file