[FEATURE] error book-keeping for cudatrace prog

PR #18 proposes per-process error histograms for errors generated by `cudaMalloc()`/`cudaFree()`. 

The goal of this task is to extend this to `cudaLaunchKernel()`. This will involve slightly more work than the first case, however, as our code doesn't define or attach any `cudaLaunchKernel()` uretprobes which will be required for getting the return value of the call in order to catch errors.