Probable GPU Memory Leak In Finetuning

Hi,

Thanks for making the code available. I recently encountered  error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments).
`
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
  "message": {
    "message": "SignalException: Process 3104850 got signal: 1",
`

This seems to be an issue of GPU memory leak. 
`del question_input, image, answer_input` at the end of training and evaluation loops in `vqa.py` helped me resolve the issue.

PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.

Thanks!
I.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Probable GPU Memory Leak In Finetuning #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Probable GPU Memory Leak In Finetuning #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions