-
Notifications
You must be signed in to change notification settings - Fork 3
OLCF Hackathon Notes
Jeff Larkin edited this page Oct 22, 2015
·
5 revisions
Welcome to the OLCFHack15 wiki!
How to use the drop-in library (available starting from CUDA 6.0):
- https://developer.nvidia.com/cublasxt
- http://devblogs.nvidia.com/parallelforall/drop-in-acceleration-gnu-octave
How to use batching:
PCIe transfers can run at twice the bandwidth if they're coming from "pinned" memory. By making your OpenACC copies asynchronous you will force the compiler to use pinned memory, which will often speed-up the code.
!JL Making these async and then waiting immeidately often forces
!JL the runtime to use "pinned" CPU memory which doubles the
!JL PCIe bandwidth
!$acc enter data copyin(a, b) create(c) async(1)
!$acc host_data use_device(a, b, c)
call dgemm_acc_openacc_async (1,'n', 'n', m, n, k, alpha, a, m, b, k, beta, c, m)
!$acc end host_data
!$acc exit data copyout(c) delete(a, b) async(1)
!$acc wait(1)
If you're using the PGI compiler, you can also try -ta=tesla:pin to force all memory allocations into pinned memory.