Skip to content

Conversation

@IAvecilla
Copy link
Contributor

@IAvecilla IAvecilla commented Dec 23, 2025

This PR contains several changes to client arguments, dependencies, and the general cooldown state logic. Here’s a summary of the changes made in this PR:

  • All clients are now potential checkpointers and must be prepared to perform that task. At the end of an epoch, during the Cooldown state, a subset of clients (1/3 of the total clients, capped at SOLANA_MAX_NUM_CHECKPOINTERS, which is currently 16) is deterministically selected as checkpointers using a seeded shuffle algorithm based on the round’s random seed.
  • Cooldown is now an opportunistic state, just like the others. Once a client uploads the entire model to external storage, it sends a transaction that moves the run forward to the next epoch. If no transaction is received, the coordinator waits for the full cooldown_time before moving on.
  • Added cancellation support so that once one client successfully checkpoints, the others can cancel their upload tasks and stop uploading in order to move to the next epoch.
  • The --hub-repo and --gcs-bucket flags were removed from the training arguments. Checkpoint destinations are now derived from the coordinator configuration, and there is a single repo or bucket that contains all the checkpoints for the run.
  • Added a new client flag, --skip-checkpoint-upload, for testing purposes. This avoids uploading and saving checkpoints locally and should not be used outside testing environments.
  • Updated the google-cloud-storage crate to version 1.6.
  • Added bucket and hub repo permission verification before joining a run to ensure that the client can eventually checkpoint.
  • The run manager now expects the GCP bucket credentials to be present inside the SCRATCH_DIR in order to work correctly when using GCP checkpoints.
  • Updated the psyche book with an explanation of the new cooldown behavior.

@IAvecilla IAvecilla self-assigned this Dec 23, 2025
hub_repo
)
let Model::LLM(LLM { checkpoint, .. }) = &self.coordinator_state.model;
if !self.skip_upload_check {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can invert the check and return early

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to execute the rest of the code in that function, so I don’t think we can return early there, unless you meant something else?

Comment on lines +211 to +213
if skip_upload {
info!("Skipping checkpoint save and upload (skip_upload flag is set)");
checkpoint_completed.store(true, Ordering::SeqCst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return Ok(evals) early and you don't need to have an else branch here

Comment on lines 16 to 17
//#[error("GCS operation failed: {0}")]
//GcsStorage(#[from] google_cloud_storage::client::Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we delete this?

@IAvecilla IAvecilla marked this pull request as ready for review January 23, 2026 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants