PR #18 proposes per-process error histograms for errors generated by cudaMalloc()/cudaFree().
The goal of this task is to extend this to cudaLaunchKernel(). This will involve slightly more work than the first case, however, as our code doesn't define or attach any cudaLaunchKernel() uretprobes which will be required for getting the return value of the call in order to catch errors.