When one of the kernels produces a bug that affects the GPU. all follow-on computations on the GPU are broken.
E.g.:
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
After that even doing some basic torch calculation causes the same torch error. (Therefore it is not possible to recover from this within the same python process (or so it seems at least)).
Have you encountered this? @simonguozirui do you know how to overcome such an error? (I am wondering whether the eval pipeline needs to be re-written for such a case, e.g. via bash scripts so that the eval pipeline can be in independent python processes?)