-
Notifications
You must be signed in to change notification settings - Fork 2
Thrust Exercise
Thrust is a template library implemented on top of CUDA. Thus Thrust routines are included 'on-the-fly' during compilation with NVIDIA's nvcc compiler. By this time, you should have logged into the course platform, and have loaded the CUDA 5.0 toolkit by loading the craype-accel-nvidia35 module (see instructions on main page)
The simplest compilation is
nvcc -o myprog.exe -I/apps/todi/thrust myprog.cu
where /apps/todi/thrust is the location of the Thrust include files, myprog.cu is a CUDA-C++ program, and myprog.exe is the resulting executable. Probably the best documentation of Thrust is the QuickStart Guide on its home page:
Sorting is one of the most fundamental kernels and is required for a vast number of algorithms. In this exercise, you are given a Thrust application which sorts an integer vector of arbitrary length specified at run-time on the CPU (host). CUDA events (see CUDA introductory lecture) are used to time the sort operation.
- Copy sort_host.cu to a file in your course account, easiest with cut-and-paste.
cat > sort_host.cu
-
Compile the code, e.g.,
nvcc -o sort_host.exe -I/apps/todi/thrust sort_host.cu -
Run the sort_host.exe program. If you have not already, allocate one GPU compute node on the course platform:
salloc -N 1
When the prompt returns you should be logged onto a compute node where you can work for one hour interactively. Run the executable with the Cray program launcher "aprun" with one process (since only one process can access the GPU at a time), where the second argument "N" is a value from 10^3 to 10^8 elements, e.g.,
aprun -n 1 ./sort_host.exe 1000
The sort algorithm has the complexity N*log(N), which implies there should be a strong linear signal in the resulting timings. Start increasing N by orders of magnitude. When does this linear signal break down? Take a guess why it might be breaking down. After it breaks down, do you see the linear signal again?
- Implement a sort_device.cu program which performs the sort on the device. If you feel so inclined, you can keep the host sort, and add a new sort on the device in order to compare timings. Alternatively you can just revise the current host sort to have it run on the GPU. First, you should keep the h_vec since it needs to initialized with random numbers on the host anyway. Next, you will need to declare a device vector with the same values as the host vector. You could achieve this with:
thrust::device_vector<int> d_vec(N); thrust::copy(h_vec.begin(), h_vec.end(), d_vec.begin());
But there is a much better way (see lecture materials)! Now you will need to perform the sort on the device, but since Thrust is templated, you need only specify the device vector to get the appropriate implementation. Finally, don't forget to copy the result back to the host, in order to make sure that the device sort determines the correct smallest and largest elements.
- Run your sorting algorithm now on GPU for 10^3 - 10^8. For small vectors, is it slower or faster than the CPU? Why? Is there the same transition point as on the CPU?
A solution can be found here: