Skip to content

Probable GPU Memory Leak In Finetuning #30

@israwal

Description

@israwal

Hi,

Thanks for making the code available. I recently encountered error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments).
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 3104850 got signal: 1",

This seems to be an issue of GPU memory leak.
del question_input, image, answer_input at the end of training and evaluation loops in vqa.py helped me resolve the issue.

PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.

Thanks!
I.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions