OLCF Hackathon Notes

Welcome to the OLCFHack15 wiki!

BLAS: drop-in library, batching small dgemms

How to use the drop-in library (available starting from CUDA 6.0):

How to use batching:

Speeding-Up data transfers

PCIe transfers can run at twice the bandwidth if they're coming from "pinned" memory. By making your OpenACC copies asynchronous you will force the compiler to use pinned memory, which will often speed-up the code.

      !JL Making these async and then waiting immeidately often forces
      !JL the runtime to use "pinned" CPU memory which doubles the
      !JL PCIe bandwidth
      !$acc enter data copyin(a, b) create(c) async(1)
      !$acc host_data use_device(a, b, c)

      call dgemm_acc_openacc_async (1,'n', 'n', m, n, k, alpha, a, m, b, k, beta, c, m)

      !$acc end host_data
      !$acc exit data copyout(c) delete(a, b) async(1)
      !$acc wait(1)

If you're using the PGI compiler, you can also try -ta=tesla:pin to force all memory allocations into pinned memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OLCF Hackathon Notes

BLAS: drop-in library, batching small dgemms

Speeding-Up data transfers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally