-
Notifications
You must be signed in to change notification settings - Fork 86
Cooldown state rework #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gcs-2
Are you sure you want to change the base?
Conversation
| hub_repo | ||
| ) | ||
| let Model::LLM(LLM { checkpoint, .. }) = &self.coordinator_state.model; | ||
| if !self.skip_upload_check { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can invert the check and return early
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to execute the rest of the code in that function, so I don’t think we can return early there, unless you meant something else?
| if skip_upload { | ||
| info!("Skipping checkpoint save and upload (skip_upload flag is set)"); | ||
| checkpoint_completed.store(true, Ordering::SeqCst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return Ok(evals) early and you don't need to have an else branch here
shared/data-provider/src/errors.rs
Outdated
| //#[error("GCS operation failed: {0}")] | ||
| //GcsStorage(#[from] google_cloud_storage::client::Error), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we delete this?
This PR contains several changes to client arguments, dependencies, and the general cooldown state logic. Here’s a summary of the changes made in this PR:
SOLANA_MAX_NUM_CHECKPOINTERS, which is currently 16) is deterministically selected as checkpointers using a seeded shuffle algorithm based on the round’s random seed.cooldown_timebefore moving on.--hub-repoand--gcs-bucketflags were removed from the training arguments. Checkpoint destinations are now derived from the coordinator configuration, and there is a single repo or bucket that contains all the checkpoints for the run.--skip-checkpoint-upload, for testing purposes. This avoids uploading and saving checkpoints locally and should not be used outside testing environments.google-cloud-storagecrate to version 1.6.SCRATCH_DIRin order to work correctly when using GCP checkpoints.