Skip to content

OLCF Hackathon Notes

Jeff Larkin edited this page Oct 22, 2015 · 5 revisions

Welcome to the OLCFHack15 wiki!

BLAS: drop-in library, batching small dgemms

How to use the drop-in library (available starting from CUDA 6.0):

How to use batching:

Speeding-Up data transfers

PCIe transfers can run at twice the bandwidth if they're coming from "pinned" memory. By making your OpenACC copies asynchronous you will force the compiler to use pinned memory, which will often speed-up the code.

      !JL Making these async and then waiting immeidately often forces
      !JL the runtime to use "pinned" CPU memory which doubles the
      !JL PCIe bandwidth
      !$acc enter data copyin(a, b) create(c) async(1)
      !$acc host_data use_device(a, b, c)

      call dgemm_acc_openacc_async (1,'n', 'n', m, n, k, alpha, a, m, b, k, beta, c, m)

      !$acc end host_data
      !$acc exit data copyout(c) delete(a, b) async(1)
      !$acc wait(1)

If you're using the PGI compiler, you can also try -ta=tesla:pin to force all memory allocations into pinned memory.

Clone this wiki locally