HG - Hardware-specific sharding test and instructions with torchrun by rsyue · Pull Request #1479 · facebookresearch/fairseq2

rsyue · 2025-12-10T19:26:24Z

What does this PR do? Please describe:
When more than 1 GPU present, model sharding can be tested with torchrun --nproc-per-node 8 test_hg_factory.py

Does your PR introduce any breaking changes? If yes, please list them:
None aware of

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…del) and changed allowed patterns to allow for the json index file

…projs are sharded in fs2)

…ls supported

…el-v2

…GPU present.

…d references; new hg doc added

…ization

rsyue · 2026-02-17T23:19:01Z

I cleaned this up a bit. Things had been moved to models/utils/ to avoid name conflicts with 'hg'. Everything now in hg alongside module rename. Safe serialization removed. Tested working with gpt2, Qwen2.5 Omni, and gemma2.

Design for parallel training comparison system that: - Extracts identical batches from fairseq2 data pipeline - Runs fairseq2 and Unsloth training in parallel on separate GPUs - Matches all hyperparameters (AdamW, cosine LR, bf16, grad clip) - Compares checkpoints with np.allclose for convergence validation

Detailed step-by-step plan with: - Data extraction script (fairseq2 pipeline to disk) - Unsloth training script (matched hyperparameters) - Checkpoint comparison script (np.allclose validation) - Bash orchestration for parallel execution - Config updates and documentation

- Extracts batches from fairseq2's SFT data pipeline - Saves input_ids, attention_mask, labels to disk - Uses static batching (batch_size=1) for reproducibility - Only rank 0 saves batches in distributed setting

- Uses pre-extracted batches from fairseq2 pipeline - Matches optimizer: AdamW with lr=3e-4, betas=(0.9,0.95), weight_decay=0.1 - Matches LR schedule: CosineAnnealing with eta_min=6e-5 - Matches training: bf16, grad_clip=1.0, batch_size=1 - Supports distributed training with DDP

- Loads fairseq2 and Unsloth checkpoints - Maps parameter names (_wrapped_hf_model prefix handling) - Uses np.allclose with bf16-appropriate tolerances - Reports detailed mismatch information (max/mean/rel diff) - Returns exit code 0 for success, 1 for failure

- Sets up directories for fairseq2, Unsloth, extracted data - Extracts dataset batches (or skips with --skip-extraction) - Runs fairseq2 and Unsloth training in parallel on separate GPUs - Compares checkpoints and reports results - Returns exit code 0 for success, 1 for failure

- Set num_steps to 100 for quick convergence test - Checkpoint only at step 100 - Keep only last checkpoint to save space

- Usage instructions for each script - Architecture overview - Troubleshooting guide

…nt, FSDP

- Add enable_gradient_checkpointing parameter to HgCausalLMAdapter - Call gradient_checkpointing_enable() on HF model when enabled - Throw RuntimeError with diagnostics if checkpointing fails - Add enable_gradient_checkpointing field to HuggingFaceModelConfig - Update wrap_hg_model_if_causal_lm to pass checkpointing flag from config Usage: model: config_overrides: enable_gradient_checkpointing: true

- Remove non-existent load_dataset import - Add LMSFTDataSource import - Create LMSFTDataset directly instead of using load_dataset - Remove unused get_world_size import

- extract_batches_slurm.sh: Extract batches on Slurm cluster - run_convergence_slurm.sh: Run full convergence test on Slurm (8 GPUs) - Proper output/error path configuration for Slurm - Resource allocation: 8 GPUs, 32 CPUs, 256GB RAM

- Replace get_rank() function calls with gang.rank property access - get_rank() doesn't exist as standalone function in fairseq2.gang - All fairseq2 code accesses rank via gang.rank property

- Replace setup_default_gang() with get_default_gangs() - setup_default_gang() doesn't exist in fairseq2.gang - get_default_gangs() returns a Gangs object with all parallel gangs - Update gang.rank references to gangs.root.rank - Remove custom gangs construction as get_default_gangs() provides proper object

- Add sys.path manipulation to find recipes module - Activate .venv in both Slurm batch scripts - Scripts need venv activated to have torch and other dependencies - Ensures recipes module is importable from scripts directory

- load_tokenizer() doesn't accept family parameter - Family is automatically determined from asset card - Update run_convergence_slurm.sh to use miniconda environment

- Replace load_tokenizer with load_hg_tokenizer_simple - load_hg_tokenizer_simple can load tokenizers directly from HuggingFace - Fixes TokenizerNotKnownError for google/gemma-3-1b-it - Tokenizer not registered in fairseq2 asset hub, load from HF instead

- DataPipelineReader returns list[Batch], not single Batch - Even with num_accumulate=1, returns list of length 1 - Extract first element from batches list before processing - Fixes AttributeError: 'list' object has no attribute 'as_auto_regressive'

- Add --output and --error directives to avoid permission denied - Create slurm_logs directory for job output - Slurm needs explicit output paths instead of system defaults - Fixes 'cannot create directory /var/spool/slurmd/out' error

- Change relative paths (./slurm_logs) to absolute paths - Slurm resolves relative paths from /var/spool/slurmd/ not job directory - Fixes 'cannot create directory /var/spool/slurmd/slurm_logs' error - Both run_comparison.sh and extract_batches_slurm.sh now use absolute paths

- Slurm jobs start in /var/spool/slurmd/ by default - Script was trying to create directories relative to wrong location - Explicit cd ensures all paths resolve correctly - Fixes persistent 'cannot create directory /var/spool/slurmd/slurm_logs' error

- Replace dynamic BASH_SOURCE calculation with hardcoded paths - When run through Slurm, BASH_SOURCE[0] points to /var/spool/slurmd/ - This caused PROJECT_ROOT to be calculated as /var/spool/ - Now explicitly set PROJECT_ROOT and SCRIPT_DIR to absolute paths - Fixes persistent mkdir permission denied error

- Replace 'fairseq2' command with 'python -m recipes.lm.sft' - fairseq2 CLI not installed, run recipe as Python module instead - Check if unsloth is installed before trying to run it - Make Unsloth training and comparison conditional - Skip comparison if Unsloth not available - Fixes exit code 127 (command not found) - Handles missing unsloth module gracefully

…ormers in parallel. Convergence confirmed (disable FA for determinism). Documentation provided.

zyaoj and others added 24 commits September 20, 2025 11:48

draft hg model integration

240440c

Handled edge case (no safe serialization possible for Qwen special mo…

1b3848c

…del) and changed allowed patterns to allow for the json index file

Updated docs prior to merging sharder deprecation

caf6519

added sharding capabilities and a test script

b6044af

v1 of sharding complete for Qwen2.5 Omni (MHA layers and FFNs, gated …

6a61d99

…projs are sharded in fs2)

Fixed parenthesis issues, cpu edge case (FakeGang only)

a3f43e8

Working first implementation of hf model sharding - Qwen2.5-Omni mode…

b83c2b4

…ls supported

wip with torch DataLoader

8a3dd41

Completed manual sharding strategy from GPU to GPU

e7cd277

enable fsdp for hg model

d9d1545

revert DataPipelineReader changes

da929c3

Merge remote-tracking branch 'origin/rsy/hg-model-v1' into ays/hg-mod…

7b61ccd

…el-v2

fix fsdp issue

70ee448

update fsdp for v1+v2

c49d030

lint

1a9b1a3

lint/format

37266b4

Added gangs to the model loading args in test

71bb1b8

Cleaned up comment tags

d8d1958

Removed print debug statements

6fcddc7

Updated unit tests to include model sharding - runs if more than one …

c0a6f07

…GPU present.

Added hardware-dependent sharding test and instructions

895bb79

Moved hg.py to utils to prevent conflicts with hg/__init__.py; update…

fd51539

…d references; new hg doc added

Reworked testing and some gang logic for HG Model Factory

990cdae

Fixed LocalFileSystem bug in updated hg_model_hub loader

d26616b

rsyue requested review from MartinGleize, cbalioglu, cirquit and zyaoj as code owners December 10, 2025 19:26

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 10, 2025

Extra line separation fix

5c904ab

rsyue and others added 2 commits February 17, 2026 23:16

Cleaned up utils, restored hg independent module, removed safe serial…

10166ae

…ization

Merge branch 'main' into rsy/hg-model-hardware-test

0401466

rsyue added 11 commits February 18, 2026 00:01

Updated hg_export to main version and removed test script

9bf6fa4

Mypy types and isort

f94a4f9

Add fairseq2 batch extraction script for convergence testing

475b2bf

- Extracts batches from fairseq2's SFT data pipeline - Saves input_ids, attention_mask, labels to disk - Uses static batching (batch_size=1) for reproducibility - Only rank 0 saves batches in distributed setting

Configure gemma_3_1b config for 100-step convergence test

aea0a5f

- Set num_steps to 100 for quick convergence test - Checkpoint only at step 100 - Keep only last checkpoint to save space

Add documentation for convergence testing scripts

5c7fdd9

- Usage instructions for each script - Architecture overview - Troubleshooting guide

Completed working Huggingface implementation with checkpoint manageme…

10298cc

…nt, FSDP

rsyue marked this pull request as draft February 19, 2026 05:02

rsyue added 15 commits February 19, 2026 16:40

Fix extraction script imports and dataset creation

b666821

- Remove non-existent load_dataset import - Add LMSFTDataSource import - Create LMSFTDataset directly instead of using load_dataset - Remove unused get_world_size import

Add Slurm batch scripts for convergence testing

5fd434d

- extract_batches_slurm.sh: Extract batches on Slurm cluster - run_convergence_slurm.sh: Run full convergence test on Slurm (8 GPUs) - Proper output/error path configuration for Slurm - Resource allocation: 8 GPUs, 32 CPUs, 256GB RAM

Fix get_rank() calls to use gang.rank property

baa0b6c

- Replace get_rank() function calls with gang.rank property access - get_rank() doesn't exist as standalone function in fairseq2.gang - All fairseq2 code accesses rank via gang.rank property

Fix Python path for recipes module import

32aca51

- Add sys.path manipulation to find recipes module - Activate .venv in both Slurm batch scripts - Scripts need venv activated to have torch and other dependencies - Ensures recipes module is importable from scripts directory

Fix load_tokenizer call to remove family argument

80b5f29

- load_tokenizer() doesn't accept family parameter - Family is automatically determined from asset card - Update run_convergence_slurm.sh to use miniconda environment

Add SBATCH output/error paths to run_comparison.sh

2e77b7f

- Add --output and --error directives to avoid permission denied - Create slurm_logs directory for job output - Slurm needs explicit output paths instead of system defaults - Fixes 'cannot create directory /var/spool/slurmd/out' error

Added convergence test script that runs fairseq2 and HG native transf…

a457303

…ormers in parallel. Convergence confirmed (disable FA for determinism). Documentation provided.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HG - Hardware-specific sharding test and instructions with torchrun#1479

HG - Hardware-specific sharding test and instructions with torchrun#1479
rsyue wants to merge 68 commits intomainfrom
rsy/hg-model-hardware-test

rsyue commented Dec 10, 2025

Uh oh!

rsyue commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rsyue commented Dec 10, 2025

Uh oh!

rsyue commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants