Collection of information about programming models. Particularly about implementation and lower level information. Knowing lower levels of abstraction is invaluable for debugging and reasoning about performance.
-
Khronos ICD loader: https://github.com/KhronosGroup/OpenCL-ICD-Loader
Can be useful to build yourself to debug problems (for example, an OpenCL implementation library not showing up in the list of devices because it is silently failing at the dynamic link stage)
-
OpenCL device simulator, Oclgrind: https://github.com/jrprice/Oclgrind
-
Open source implementation, POCL: http://portablecl.org/
-
SPIR-V is used for an intermediate device-independent distribution format
See the OpenMP_examples page.
Observations and comments on programming OpenMP in the OpenMP_observations page.
- Core library: https://github.com/kokkos/kokkos
Chapel is a programming language designed for productive parallel programming. It now suports GPU programming.
Machine learning has given rise to a number of models for writing kernels
- Triton language and compiler and https://github.com/triton-lang/triton
- Not to be confused with the Triton Inference Server from NVidia
- Apache TVM. TensorIR is at the closest semantic level to Triton. Nice tutorial from Machine Learning Compliation
My own repository for samples is here: https://github.com/markdewing/qmc_kernels
The vector_add kernel has the most implementations, being the simplest example that actually does something: https://github.com/markdewing/qmc_kernels/tree/master/kernels/vector_add