Skip to content

HG - Hardware-specific sharding test and instructions with torchrun#1479

Draft
rsyue wants to merge 68 commits intomainfrom
rsy/hg-model-hardware-test
Draft

HG - Hardware-specific sharding test and instructions with torchrun#1479
rsyue wants to merge 68 commits intomainfrom
rsy/hg-model-hardware-test

Conversation

@rsyue
Copy link
Contributor

@rsyue rsyue commented Dec 10, 2025

What does this PR do? Please describe:
When more than 1 GPU present, model sharding can be tested with torchrun --nproc-per-node 8 test_hg_factory.py

Does your PR introduce any breaking changes? If yes, please list them:
None aware of

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

zyaoj and others added 24 commits September 20, 2025 11:48
…del) and changed allowed patterns to allow for the json index file
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 10, 2025
@rsyue
Copy link
Contributor Author

rsyue commented Feb 17, 2026

I cleaned this up a bit. Things had been moved to models/utils/ to avoid name conflicts with 'hg'. Everything now in hg alongside module rename. Safe serialization removed. Tested working with gpt2, Qwen2.5 Omni, and gemma2.

rsyue added 11 commits February 18, 2026 00:01
Design for parallel training comparison system that:
- Extracts identical batches from fairseq2 data pipeline
- Runs fairseq2 and Unsloth training in parallel on separate GPUs
- Matches all hyperparameters (AdamW, cosine LR, bf16, grad clip)
- Compares checkpoints with np.allclose for convergence validation
Detailed step-by-step plan with:
- Data extraction script (fairseq2 pipeline to disk)
- Unsloth training script (matched hyperparameters)
- Checkpoint comparison script (np.allclose validation)
- Bash orchestration for parallel execution
- Config updates and documentation
- Extracts batches from fairseq2's SFT data pipeline
- Saves input_ids, attention_mask, labels to disk
- Uses static batching (batch_size=1) for reproducibility
- Only rank 0 saves batches in distributed setting
- Uses pre-extracted batches from fairseq2 pipeline
- Matches optimizer: AdamW with lr=3e-4, betas=(0.9,0.95), weight_decay=0.1
- Matches LR schedule: CosineAnnealing with eta_min=6e-5
- Matches training: bf16, grad_clip=1.0, batch_size=1
- Supports distributed training with DDP
- Loads fairseq2 and Unsloth checkpoints
- Maps parameter names (_wrapped_hf_model prefix handling)
- Uses np.allclose with bf16-appropriate tolerances
- Reports detailed mismatch information (max/mean/rel diff)
- Returns exit code 0 for success, 1 for failure
- Sets up directories for fairseq2, Unsloth, extracted data
- Extracts dataset batches (or skips with --skip-extraction)
- Runs fairseq2 and Unsloth training in parallel on separate GPUs
- Compares checkpoints and reports results
- Returns exit code 0 for success, 1 for failure
- Set num_steps to 100 for quick convergence test
- Checkpoint only at step 100
- Keep only last checkpoint to save space
- Usage instructions for each script
- Architecture overview
- Troubleshooting guide
@rsyue rsyue marked this pull request as draft February 19, 2026 05:02
rsyue added 15 commits February 19, 2026 16:40
- Add enable_gradient_checkpointing parameter to HgCausalLMAdapter
- Call gradient_checkpointing_enable() on HF model when enabled
- Throw RuntimeError with diagnostics if checkpointing fails
- Add enable_gradient_checkpointing field to HuggingFaceModelConfig
- Update wrap_hg_model_if_causal_lm to pass checkpointing flag from config

Usage:
  model:
    config_overrides:
      enable_gradient_checkpointing: true
- Remove non-existent load_dataset import
- Add LMSFTDataSource import
- Create LMSFTDataset directly instead of using load_dataset
- Remove unused get_world_size import
- extract_batches_slurm.sh: Extract batches on Slurm cluster
- run_convergence_slurm.sh: Run full convergence test on Slurm (8 GPUs)
- Proper output/error path configuration for Slurm
- Resource allocation: 8 GPUs, 32 CPUs, 256GB RAM
- Replace get_rank() function calls with gang.rank property access
- get_rank() doesn't exist as standalone function in fairseq2.gang
- All fairseq2 code accesses rank via gang.rank property
- Replace setup_default_gang() with get_default_gangs()
- setup_default_gang() doesn't exist in fairseq2.gang
- get_default_gangs() returns a Gangs object with all parallel gangs
- Update gang.rank references to gangs.root.rank
- Remove custom gangs construction as get_default_gangs() provides proper object
- Add sys.path manipulation to find recipes module
- Activate .venv in both Slurm batch scripts
- Scripts need venv activated to have torch and other dependencies
- Ensures recipes module is importable from scripts directory
- load_tokenizer() doesn't accept family parameter
- Family is automatically determined from asset card
- Update run_convergence_slurm.sh to use miniconda environment
- Replace load_tokenizer with load_hg_tokenizer_simple
- load_hg_tokenizer_simple can load tokenizers directly from HuggingFace
- Fixes TokenizerNotKnownError for google/gemma-3-1b-it
- Tokenizer not registered in fairseq2 asset hub, load from HF instead
- DataPipelineReader returns list[Batch], not single Batch
- Even with num_accumulate=1, returns list of length 1
- Extract first element from batches list before processing
- Fixes AttributeError: 'list' object has no attribute 'as_auto_regressive'
- Add --output and --error directives to avoid permission denied
- Create slurm_logs directory for job output
- Slurm needs explicit output paths instead of system defaults
- Fixes 'cannot create directory /var/spool/slurmd/out' error
- Change relative paths (./slurm_logs) to absolute paths
- Slurm resolves relative paths from /var/spool/slurmd/ not job directory
- Fixes 'cannot create directory /var/spool/slurmd/slurm_logs' error
- Both run_comparison.sh and extract_batches_slurm.sh now use absolute paths
- Slurm jobs start in /var/spool/slurmd/ by default
- Script was trying to create directories relative to wrong location
- Explicit cd ensures all paths resolve correctly
- Fixes persistent 'cannot create directory /var/spool/slurmd/slurm_logs' error
- Replace dynamic BASH_SOURCE calculation with hardcoded paths
- When run through Slurm, BASH_SOURCE[0] points to /var/spool/slurmd/
- This caused PROJECT_ROOT to be calculated as /var/spool/
- Now explicitly set PROJECT_ROOT and SCRIPT_DIR to absolute paths
- Fixes persistent mkdir permission denied error
- Replace 'fairseq2' command with 'python -m recipes.lm.sft'
- fairseq2 CLI not installed, run recipe as Python module instead
- Check if unsloth is installed before trying to run it
- Make Unsloth training and comparison conditional
- Skip comparison if Unsloth not available
- Fixes exit code 127 (command not found)
- Handles missing unsloth module gracefully
…ormers in parallel. Convergence confirmed (disable FA for determinism). Documentation provided.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants