Skip to content

fix: address multiple correctness and safety bugs#3

Open
Ridwannurudeen wants to merge 1 commit intoPluralisResearch:mainfrom
Ridwannurudeen:fix/correctness-and-safety-bugs
Open

fix: address multiple correctness and safety bugs#3
Ridwannurudeen wants to merge 1 commit intoPluralisResearch:mainfrom
Ridwannurudeen:fix/correctness-and-safety-bugs

Conversation

@Ridwannurudeen
Copy link

Summary

  • Remove undefined $dd variable in run.bash: The gpipe baseline command (line 30) references $dd which is never defined, causing the sync baseline to misparse arguments. Removed the stray variable to match the async command structure.
  • Fix unclosed file handles: json.load(open(...)) in both main_with_runtime.py and sync_main.py leaks file descriptors. Replaced with proper with statements.
  • Add weights_only to torch.load calls: Resolves PyTorch >=2.6 FutureWarning deprecation across main_with_runtime.py, sync_main.py, and optim/optimizer.py.
  • Use collections.deque in threadsafe_queue.py: Replaces list.pop(0) (O(n)) with deque.popleft() (O(1)) in the inter-stage communication path.
  • Add missing __init__.py files: optim/ and runtime/ lack __init__.py, relying on implicit namespace packages. Added explicit init files for robust package resolution.
  • Fix argparse description: Changed from "PyTorch ImageNet Training" (leftover from PipeDream) to match the actual language model training use case.

Motivation

Found these while reading through the codebase after the ICML 2025 paper. The $dd bug silently breaks the gpipe baseline script, and the torch.load deprecation will become an error in future PyTorch versions.

Test plan

  • Verified $dd is not defined anywhere in run.bash — confirmed stray variable
  • Confirmed the async basecmdstr (line 13) does not use $dd, so removing it from the sync basecmdstr is consistent
  • Verified deque.popleft() maintains identical FIFO semantics to list.pop(0)
  • Checked that optim/ and runtime/ are imported via sys.path.append — adding __init__.py makes this more robust without breaking existing imports

- Remove undefined $dd variable in run.bash that causes gpipe baseline
  to misparse command-line arguments
- Close file handles properly when loading JSON config (use with-statement
  instead of bare open() in both main_with_runtime.py and sync_main.py)
- Add weights_only parameter to all torch.load calls to resolve PyTorch
  >=2.6 deprecation warnings
- Replace list with collections.deque in threadsafe_queue.py for O(1)
  popleft instead of O(n) list.pop(0) in the pipeline communication path
- Add missing __init__.py files for optim/ and runtime/ to ensure proper
  Python package resolution
- Fix argparse description from "PyTorch ImageNet Training" to match the
  actual language model training use case
@Ridwannurudeen
Copy link
Author

Hi! Friendly ping on this PR — fixes multiple correctness and safety bugs identified during a code review. Happy to walk through the changes or make adjustments if needed. Let me know if this repo is still actively maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant