Skip to content

resume training with next batch of data#32

Open
aws-zhenguo wants to merge 1 commit intodataloader_seedfrom
resume_training
Open

resume training with next batch of data#32
aws-zhenguo wants to merge 1 commit intodataloader_seedfrom
resume_training

Conversation

@aws-zhenguo
Copy link
Collaborator

No description provided.

output = None
stop_trace_step = None

# skip previous batches
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Axlearn not already do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

axlearn provide a flag save_input_iterator which was supposed to save the input iterator as part of the checkpoint to achieve the same functionality (otherwise, training always restart from the beginning of the data). but it is running into exception as commented in their repo here. I encountered the same exception with this flag, so added the logic to skip batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants