Skip to content

fix: address multiple correctness and safety bugs#1

Open
Ridwannurudeen wants to merge 1 commit intoPluralisResearch:mainfrom
Ridwannurudeen:fix/correctness-and-safety-bugs
Open

fix: address multiple correctness and safety bugs#1
Ridwannurudeen wants to merge 1 commit intoPluralisResearch:mainfrom
Ridwannurudeen:fix/correctness-and-safety-bugs

Conversation

@Ridwannurudeen
Copy link

@Ridwannurudeen Ridwannurudeen commented Feb 7, 2026

Summary

  • Fix AttributeError in load_model(): self.models does not exist on DilocoSetup, changed to self.model which is the actual attribute
  • Fix broken argparse boolean flags: type=bool silently breaks CLI — bool("False") evaluates to True in Python. Replaced with a proper string-to-bool converter across both trainer scripts (affects --cosine_anneal, --deterministic, --adaptive_momentum, --sparta_nesterov, --sparta_adaptive_momentum, --buffer_to_cpu)
  • Guard wandb.finish() crash: _cleanup() called wandb.finish() unconditionally on rank 0, even when wandb was never initialized (no wandb_project set). Now checks self.config.wandb_project first
  • Add weights_only to torch.load calls: Resolves PyTorch >=2.6 deprecation warnings and improves checkpoint loading safety
  • Use collections.deque in threadsafe_queue.py: Replaces list.pop(0) (O(n)) with deque.popleft() (O(1)) in the pipeline communication hot path
  • Fix duplicate wandb in requirements.txt and add missing pyarrow dependency (imported in data_utils.py but not listed)
  • Add missing __init__.py files for asyncpp/optim/, asyncpp/runtime/, and examples/models/ to ensure proper Python package resolution

Motivation

Found these issues while reading through the codebase after the paper release. The load_model bug and argparse boolean bug are correctness issues that would cause runtime failures. The remaining fixes improve safety, performance, and packaging hygiene.

Test plan

  • Verified load_model references the correct attribute (self.model)
  • Verified argparse boolean parsing: --deterministic False now correctly evaluates to False
  • Verified _cleanup() no longer crashes when wandb_project is None
  • Confirmed deque.popleft() maintains the same FIFO semantics as list.pop(0)
  • Confirmed pyarrow is imported in examples/data_utils.py (lines 5-6)

- Fix AttributeError in load_model(): self.models does not exist, changed to self.model
- Fix argparse type=bool which silently breaks CLI flags (bool("False") is True);
  replaced with a proper string-to-bool converter in both trainer scripts
- Guard wandb.finish() to only run when wandb was actually initialized,
  preventing crash when wandb_project is not set
- Add weights_only parameter to torch.load calls to resolve PyTorch
  deprecation warnings and improve checkpoint loading safety
- Replace list with collections.deque in threadsafe_queue.py for O(1)
  popleft instead of O(n) list.pop(0) in the pipeline communication hot path
- Fix duplicate wandb entry in requirements.txt and add missing pyarrow dependency
- Add missing __init__.py files for asyncpp/optim/, asyncpp/runtime/,
  and examples/models/ to ensure proper Python package resolution
@Ridwannurudeen Ridwannurudeen force-pushed the fix/correctness-and-safety-bugs branch from ca6e6b2 to b83c1c5 Compare February 9, 2026 18:03
@Ridwannurudeen
Copy link
Author

Hi! Friendly ping on this PR — fixes multiple correctness and safety bugs identified during a code review. Happy to walk through the changes or make adjustments if needed. Let me know if this repo is still actively maintained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant