-
Notifications
You must be signed in to change notification settings - Fork 89
Cooldown state rework #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
IAvecilla
wants to merge
93
commits into
gcs-2
Choose a base branch
from
cooldown-rework
base: gcs-2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
93 commits
Select commit
Hold shift + click to select a range
0836df8
Make cooldown opportunistic
IAvecilla 80550e6
Fix some things
IAvecilla be24bda
Fix compilation
IAvecilla 20ac8a2
End epoch after checkpointer
IAvecilla 514c993
Make hub_repo checkpoint mandatory
IAvecilla 3553f7d
Use client as checkpointer and pick random
IAvecilla 00b97f1
Reuse opportunistic logic
IAvecilla 0566a4a
Add anyhow to cargo lock
IAvecilla 203d0c4
Fix opportunistic cooldown message
IAvecilla e8b1dd8
Add random index for client
IAvecilla 0b6e9f5
Remove rand and get checkpointers at cooldown state
IAvecilla c001fa5
Rework on cooldown to select pseudo random
IAvecilla ef3b468
Fix clippy
IAvecilla 8d8c3f4
Add support for multiple checkpointers
IAvecilla 410c192
Fix lint
IAvecilla 3ee7cb6
Update nano config
IAvecilla 2b07c7d
Remove send_checkpoint function from backend
IAvecilla 7fb8add
Reduce total amount of checkpointers
IAvecilla e7421ed
Remove check for permissions on hf repo
IAvecilla a215881
Merge branch 'main' into cooldown-rework
IAvecilla a66f089
Fix compilation error after merge
IAvecilla 5d99bf0
Use convert function only on python trainer
IAvecilla 2f77f46
Add conditional import for python feature
IAvecilla 284ae5d
Fix decentralized integration test config and entrypoints
IAvecilla ba70ece
fix bug and remove debug prints
f82e387
remove unnecesary code
10e81f3
Add GCS checkpoint variant
IAvecilla 0cb4ba0
Merge branch 'gcs-2' into gcs-upload-model
IAvecilla 9fcf808
polish nits on coordinator code
9209a86
Merge branch 'main' into cooldown-rework
e893c7c
Fix convert call
IAvecilla df99802
Refactor on model download and upload
IAvecilla a41c457
Fix import with python feature
IAvecilla b19bc90
Remove vllm from nix
IAvecilla f02fd59
Merge branch 'gcs-upload-model' into cooldown-rework
IAvecilla 9daad93
Add cancellation process after one client checkpoints
IAvecilla 4511ae4
Fix cancellation for upload task
IAvecilla 64ace6f
General refactor and cleanup for new checkpointers logic
IAvecilla d19df41
Remove comments
IAvecilla 86ba846
Remove hub-repo and gcs-bucket from train args
IAvecilla 4482623
Fix prefix
IAvecilla 5ea5564
Fix tcp example
IAvecilla 8da9c66
Calculate checkpointers instead of adding new config
IAvecilla 6f7aaf2
fixing centralized tests
IAvecilla 860f50d
Merge branch 'gcs-2' into cooldown-rework
IAvecilla 3471706
Fix tcp example compilation
IAvecilla e8a4e50
Fix centralized tests avoiding uploading checks
IAvecilla 1f545f9
Add test mode cli arg for training
IAvecilla 5706ab7
Fix flag
IAvecilla 5f0edb4
Lower cooldown time for centralized tests
IAvecilla 8b0608e
Merge branch 'cooldown-rework' into remove-checkpointers-config
IAvecilla d53ff7b
update gcp crate to 1.5.x version
dsocolobsky 978b63d
Merge branch 'cooldown-rework' into dy/gcp-new-version
dsocolobsky 15b2d5e
Add docs with new cooldown behavior
IAvecilla 634d926
Merge branch 'cooldown-rework' into dy/gcp-new-version
entropidelic 172293d
Fix extra docs
IAvecilla d2ca5cb
Merge branch 'cooldown-rework' into dy/gcp-new-version
dsocolobsky 90339fe
Merge branch 'gcs-2' into cooldown-rework
IAvecilla 6254788
fix google cloud storage code
dsocolobsky 64f683b
Merge branch 'cooldown-rework' into dy/gcp-new-version
dsocolobsky 3bf0e42
Merge branch 'cooldown-rework' into remove-checkpoint-repo-arg
IAvecilla 5355592
Remove hub-repo flag from test
IAvecilla e99a049
Merge branch 'gcs-2' into cooldown-rework
IAvecilla cb5b265
Add check for permissions before joining the run
IAvecilla 1b0c1a6
Remove `--hub-repo` and `--gcs-bucket`from train args (#490)
IAvecilla f155690
add GCS credentials to scratch dir used in run manager
329c604
Merge branch 'cooldown-rework' into run-manager/gcs-credentials
entropidelic f361b93
Remove google-cloud-storage from coordinator toml
IAvecilla 611f322
Remove hub repo arguments in test
IAvecilla dd55838
Merge branch 'cooldown-rework' into run-manager/gcs-credentials
entropidelic 2809135
Update to version 1.6
IAvecilla 089ca65
Fix extra args in script
IAvecilla 2402b4e
Remove sanity check for permissions
IAvecilla ad5ba5e
Merge branch 'cooldown-rework' into dy/gcp-new-version
IAvecilla 6983523
Check bucket and repo permissions before joining run
IAvecilla de7e19d
Fix centralized config
IAvecilla 95c8a83
Add new NoUpload to avoid checkpointing in tests
IAvecilla 84c02b3
Merge branch 'dy/gcp-new-version' into cooldown-rework
IAvecilla 49669ff
Fix train solana test script
IAvecilla e048b25
CooldownStep.checkpoint_complete
pefontana 581d87f
Fix crash after credential errors
IAvecilla 5ff3184
Update only safetensors for cehckpointing
IAvecilla a8444b2
Refactor hub repo upload
IAvecilla b6802b5
Uncomment docker cleanup
IAvecilla da2fab6
Remove NoCheckpoint variant and use a client flag instead
IAvecilla d6a1201
Send cooldown witness on skip checkpoint
IAvecilla 816ba5b
Skip local save with skip upload flag
IAvecilla 3b83b2c
Merge branch 'cooldown-rework' into run-manager/gcs-credentials
IAvecilla c6016b7
Remove comment and add early return on skip upload check
IAvecilla 98c1abe
Fix centralized permission check to upload
IAvecilla 7d8bf2c
Fix cooldown checks and update comment on test flag
IAvecilla fb4810d
Merge branch 'gcs-2' into cooldown-rework
IAvecilla 0134d01
Merge branch 'gcs-2' into cooldown-rework
IAvecilla File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can invert the check and return early
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to execute the rest of the code in that function, so I don’t think we can return early there, unless you meant something else?