HG - Hardware-specific sharding test and instructions with torchrun#1479
Draft
HG - Hardware-specific sharding test and instructions with torchrun#1479
Conversation
…del) and changed allowed patterns to allow for the json index file
…projs are sharded in fs2)
…d references; new hg doc added
Contributor
Author
|
I cleaned this up a bit. Things had been moved to models/utils/ to avoid name conflicts with 'hg'. Everything now in hg alongside module rename. Safe serialization removed. Tested working with gpt2, Qwen2.5 Omni, and gemma2. |
Design for parallel training comparison system that: - Extracts identical batches from fairseq2 data pipeline - Runs fairseq2 and Unsloth training in parallel on separate GPUs - Matches all hyperparameters (AdamW, cosine LR, bf16, grad clip) - Compares checkpoints with np.allclose for convergence validation
Detailed step-by-step plan with: - Data extraction script (fairseq2 pipeline to disk) - Unsloth training script (matched hyperparameters) - Checkpoint comparison script (np.allclose validation) - Bash orchestration for parallel execution - Config updates and documentation
- Extracts batches from fairseq2's SFT data pipeline - Saves input_ids, attention_mask, labels to disk - Uses static batching (batch_size=1) for reproducibility - Only rank 0 saves batches in distributed setting
- Uses pre-extracted batches from fairseq2 pipeline - Matches optimizer: AdamW with lr=3e-4, betas=(0.9,0.95), weight_decay=0.1 - Matches LR schedule: CosineAnnealing with eta_min=6e-5 - Matches training: bf16, grad_clip=1.0, batch_size=1 - Supports distributed training with DDP
- Loads fairseq2 and Unsloth checkpoints - Maps parameter names (_wrapped_hf_model prefix handling) - Uses np.allclose with bf16-appropriate tolerances - Reports detailed mismatch information (max/mean/rel diff) - Returns exit code 0 for success, 1 for failure
- Sets up directories for fairseq2, Unsloth, extracted data - Extracts dataset batches (or skips with --skip-extraction) - Runs fairseq2 and Unsloth training in parallel on separate GPUs - Compares checkpoints and reports results - Returns exit code 0 for success, 1 for failure
- Set num_steps to 100 for quick convergence test - Checkpoint only at step 100 - Keep only last checkpoint to save space
- Usage instructions for each script - Architecture overview - Troubleshooting guide
- Add enable_gradient_checkpointing parameter to HgCausalLMAdapter
- Call gradient_checkpointing_enable() on HF model when enabled
- Throw RuntimeError with diagnostics if checkpointing fails
- Add enable_gradient_checkpointing field to HuggingFaceModelConfig
- Update wrap_hg_model_if_causal_lm to pass checkpointing flag from config
Usage:
model:
config_overrides:
enable_gradient_checkpointing: true
- Remove non-existent load_dataset import - Add LMSFTDataSource import - Create LMSFTDataset directly instead of using load_dataset - Remove unused get_world_size import
- extract_batches_slurm.sh: Extract batches on Slurm cluster - run_convergence_slurm.sh: Run full convergence test on Slurm (8 GPUs) - Proper output/error path configuration for Slurm - Resource allocation: 8 GPUs, 32 CPUs, 256GB RAM
- Replace get_rank() function calls with gang.rank property access - get_rank() doesn't exist as standalone function in fairseq2.gang - All fairseq2 code accesses rank via gang.rank property
- Replace setup_default_gang() with get_default_gangs() - setup_default_gang() doesn't exist in fairseq2.gang - get_default_gangs() returns a Gangs object with all parallel gangs - Update gang.rank references to gangs.root.rank - Remove custom gangs construction as get_default_gangs() provides proper object
- Add sys.path manipulation to find recipes module - Activate .venv in both Slurm batch scripts - Scripts need venv activated to have torch and other dependencies - Ensures recipes module is importable from scripts directory
- load_tokenizer() doesn't accept family parameter - Family is automatically determined from asset card - Update run_convergence_slurm.sh to use miniconda environment
- Replace load_tokenizer with load_hg_tokenizer_simple - load_hg_tokenizer_simple can load tokenizers directly from HuggingFace - Fixes TokenizerNotKnownError for google/gemma-3-1b-it - Tokenizer not registered in fairseq2 asset hub, load from HF instead
- DataPipelineReader returns list[Batch], not single Batch - Even with num_accumulate=1, returns list of length 1 - Extract first element from batches list before processing - Fixes AttributeError: 'list' object has no attribute 'as_auto_regressive'
- Add --output and --error directives to avoid permission denied - Create slurm_logs directory for job output - Slurm needs explicit output paths instead of system defaults - Fixes 'cannot create directory /var/spool/slurmd/out' error
- Change relative paths (./slurm_logs) to absolute paths - Slurm resolves relative paths from /var/spool/slurmd/ not job directory - Fixes 'cannot create directory /var/spool/slurmd/slurm_logs' error - Both run_comparison.sh and extract_batches_slurm.sh now use absolute paths
- Slurm jobs start in /var/spool/slurmd/ by default - Script was trying to create directories relative to wrong location - Explicit cd ensures all paths resolve correctly - Fixes persistent 'cannot create directory /var/spool/slurmd/slurm_logs' error
- Replace dynamic BASH_SOURCE calculation with hardcoded paths - When run through Slurm, BASH_SOURCE[0] points to /var/spool/slurmd/ - This caused PROJECT_ROOT to be calculated as /var/spool/ - Now explicitly set PROJECT_ROOT and SCRIPT_DIR to absolute paths - Fixes persistent mkdir permission denied error
- Replace 'fairseq2' command with 'python -m recipes.lm.sft' - fairseq2 CLI not installed, run recipe as Python module instead - Check if unsloth is installed before trying to run it - Make Unsloth training and comparison conditional - Skip comparison if Unsloth not available - Fixes exit code 127 (command not found) - Handles missing unsloth module gracefully
…ormers in parallel. Convergence confirmed (disable FA for determinism). Documentation provided.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do? Please describe:
When more than 1 GPU present, model sharding can be tested with
torchrun --nproc-per-node 8 test_hg_factory.pyDoes your PR introduce any breaking changes? If yes, please list them:
None aware of
Check list: