Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion INSTRUCTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,4 +310,4 @@ The template of the comment section of your pull request is attached below, you
+ You can try multi-threading on CPU if you want (not required and not our focus)
+ for each element input[i] in original array
- if it's non-zero (given by mapped array)
- then put it at output[index], where index = scanned[i]
- then put it at output[index], where index = scanned[i]
34 changes: 30 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,37 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Anton Khabbaz
* pennkey:akhabbaz
* Tested on: Windows 10 surface book i7-6600u at 2.66 GHz with a GPU GTX 965M
Personal computer

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project aimed to understand parallel algorithms and how they are implemented on the GPU. A scan takes a running sum of integers and here we used the scan to implement three versions of scan: a cpu version in series, a naive GPU version, a work efficient version,
and one where we used the thrust library.

To do the work efficient scan I implemented the extra credit. All the threads are contiguous in both the up and down sweep. Furthermore, I trimmed the number of threads needed so that the grid size was enough to run all the threads needed. In the end with some 10^8 data elements, my work efficient scan matched the CPU scan in speed. The GPU and CPU timers measured computation, not the time to copy and transfer data.

For all number of elements all the tests passed.

![](img/RunTimes.png)
THis figure plots the log10 of the run time per element in the array. This measure was relatively constant as N increased. Here we can see that the work efficient scan at first is worse than the CPU scan but as the number of elements increases, it first beats the naive scan and then matches the CPU scan. Thrust on the other hand is the fastest scan for large data (faster by a factor of about 6).
The non power of two scans were slightly faster but comparable in speed to the power of two scans.

Stream Compaction is one application of scan and it allows one to remove elements from a stream. Here around 10^5 elements, the work efficient compaction actually beat the other compactions. A thrust scan would beat the work efficient compaction, since a majority of the time to compact is spent making the exclusive scan.

![](img/StreamCompactionTimes.png)


Here I implemented the efficient scan using contiguous threads. This worked perfectly up to one block but beyond one block the code failed. The issue was that threads beyond one block do not synchronize.

I had many difficulties. First I used syncthreads to synchronize and that worked only when one block had all the threads. I then changed my code to iterate over strides on the CPU and this allowed all the blocks to be synchronized.

Another major issue I had was that aroud 2^15 or so, I got an allocation error. I traced it down to the work efficient scan and then used Memtracker in cuda to catch the out of bounds error. That showed me that I was multiplying 2 65K integers to calculate the index (some threads had high indices). That produced a negative index. I got around that by returning depending on the stride, so that the actual index would never be larger that the avalable array. Ultimately I also chose the grid size so that the minimum number of threads would be available, a number that varied with the stride. THis culled the number of possible threads and sped up the run time.


A final issue I had was that in the compaction I did not return when index was beyond the array size. That caused multiple writes to the same location and that was hard to debug.

The code now works for any N I tried up to the maximum memory on my GPU, about 2 GB.

Binary file added RunTimes.xlsx
Binary file not shown.
Binary file added img/RunTimes.pdf
Binary file not shown.
Binary file added img/RunTimes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/StreamCompactionTimes.pdf
Binary file not shown.
Binary file added img/StreamCompactionTimes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 21 additions & 12 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,23 @@
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"
#include <iostream>

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 26;// feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int a[SIZE], b[SIZE], c[SIZE];
//int a[SIZE], b[SIZE], c[SIZE];

int main(int argc, char* argv[]) {
// Scan tests

// Scan tests
int* a = static_cast<int*>(malloc(SIZE * sizeof(int)));
int* b = static_cast<int*>(malloc(SIZE * sizeof(int)));
int* c = static_cast<int*>(malloc(SIZE * sizeof(int)));
printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
printf("****************\n");

printf("Size: %d\n", SIZE);
genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);
Expand All @@ -45,12 +49,14 @@ int main(int argc, char* argv[]) {
printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");


printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(cuda measured)");
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);


zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
Expand All @@ -63,14 +69,14 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -129,15 +135,18 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
free(a);
free(b);
free(c);
}
2 changes: 1 addition & 1 deletion src/testing_helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ void zeroArray(int n, int *a) {
}

void genArray(int n, int *a, int maxval) {
srand(time(nullptr));
// srand(time(nullptr));

for (int i = 0; i < n; i++) {
a[i] = rand() % maxval;
Expand Down
2 changes: 1 addition & 1 deletion stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ set(SOURCE_FILES

cuda_add_library(stream_compaction
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_52
)
14 changes: 12 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= n) {
return;
}
bools[index] = ( idata[index] == 0) ? 0 : 1;
}

/**
Expand All @@ -32,7 +36,13 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= n) {
return;
}
if ( bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
71 changes: 37 additions & 34 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
@@ -1,42 +1,45 @@
#pragma once
#pragma once

#include <cuda.h>
#include <cuda_runtime.h>

#include <cstdio>
#include <cstring>
#include <cmath>
#include <algorithm>
#include <chrono>
#include <chrono>
#include <stdexcept>

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);

inline int ilog2(int x) {
int lg = 0;
while (x >>= 1) {
++lg;
}
return lg;
}

inline int ilog2ceil(int x) {
return ilog2(x - 1) + 1;
}

namespace StreamCompaction {
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);


#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);

inline int ilog2(int x) {
if (x < 0) {
return 0;
}
int lg = 0;
while (x >>= 1) {
++lg;
}
return lg;
}

inline int ilog2ceil(int x) {
return ilog2(x - 1) + 1;
}

namespace StreamCompaction {
namespace Common {
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);

/**
* This class is used for timing the performance
* Uncopyable and unmovable
Expand Down Expand Up @@ -127,6 +130,6 @@ namespace StreamCompaction {

float prev_elapsed_time_cpu_milliseconds = 0.f;
float prev_elapsed_time_gpu_milliseconds = 0.f;
};
}
}
};
}
}
Loading