-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Hi there!
The FAQ (latest version says):
Some runs failed. How can I rerun them?
If the failed runs were never started, for example, due to grid node failures, you can simply run the “start” experiment step again. It will skip all runs that have already been started. Afterwards, run “fetch” and make reports as usual.
Lab detects which runs have already been started by checking if the driver.log file exists. So if you have failed runs that were already started, but you want to rerun them anyway, go to their run directories, remove the driver.log files and then run the “start” experiment step again as above.
It would be nice to have the option that restarting an experiment is idempotent.
That is to automatize that restarting a failed run protects the integrity of the experiment without the manual deletion of files like driver.log.
That would be useful when using the lab in a computing infrastructure where jobs could be preempted to run another task with higher priority. (This is typical in cases where many other tasks are training jobs that are idempotent).
If that were not convenient as the default behaviour, perhaps this behaviour could be enabled by some additional option.
I understand a potential issue is that some runs can just keep failing, so perhaps reaching idempotence is more subtle, but it'd be a great feature.