[torch.AcceleratorError] causes entire benchmark to crash

When one of the kernels produces a bug that affects the GPU. all follow-on computations on the GPU are broken.

E.g.:
```
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
```

After that even doing some basic torch calculation causes the same torch error. (Therefore it is not possible to recover from this within the same python process (or so it seems at least)).

Have you encountered this? @simonguozirui do you know how to overcome such an error? (I am wondering whether the eval pipeline needs to be re-written for such a case, e.g. via bash scripts so that the eval pipeline can be in independent python processes?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.AcceleratorError] causes entire benchmark to crash #76

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[torch.AcceleratorError] causes entire benchmark to crash #76

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions