Skip to content

[torch.AcceleratorError] causes entire benchmark to crash #76

@ai-nikolai

Description

@ai-nikolai

When one of the kernels produces a bug that affects the GPU. all follow-on computations on the GPU are broken.

E.g.:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

After that even doing some basic torch calculation causes the same torch error. (Therefore it is not possible to recover from this within the same python process (or so it seems at least)).

Have you encountered this? @simonguozirui do you know how to overcome such an error? (I am wondering whether the eval pipeline needs to be re-written for such a case, e.g. via bash scripts so that the eval pipeline can be in independent python processes?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions