feature request: More robust error recovery

Hi there!

The FAQ (latest version says):
> ### Some runs failed. How can I rerun them?
> If the failed runs were never started, for example, due to grid node failures, you can simply run the “start” experiment step again. It will skip all runs that have already been started. Afterwards, run “fetch” and make reports as usual.
> Lab detects which runs have already been started by checking if the driver.log file exists. So if you have failed runs that were already started, but you want to rerun them anyway, go to their run directories, remove the driver.log files and then run the “start” experiment step again as above.

It would be nice to have the option that restarting an experiment is idempotent.
That is to automatize that restarting a failed run protects the integrity of the experiment without the manual deletion of files like `driver.log`. 
That would be useful when using the lab in a computing infrastructure where jobs could be preempted to run another task with higher priority. (This is typical in cases where many other tasks are training jobs that are idempotent).
 
If that were not convenient as the default behaviour, perhaps this behaviour could be enabled by some additional option.

I understand a potential issue is that some runs can just keep failing, so perhaps reaching idempotence is more subtle, but it'd be a great feature.

/cc @matgreco @alvaro-torralba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: More robust error recovery #110

Some runs failed. How can I rerun them?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature request: More robust error recovery #110

Description

Some runs failed. How can I rerun them?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions