You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 3, 2026. It is now read-only.
Seems it doesn't support sharding on the sequence length when fsdp and dp is enabled, it can't figure out the sharding logic when three sharding specs are enabled.
After printed out the sharding of the convert_fn, it is correct, but the torch-xla function couldn't finish the sharding, the RCA is on the lower level in the _xla_tensors_from_aten function
To use context parallelism, we need to bypass the parrallel_loader and directly use activations that are sharded correctly