blogs/system/multipleprocess.md at master · LukeLIN-web/blogs

acclerate

accelerate config # 配置

FSDP

一个 step , 模型会seen多少数据? 是device batch size还是 global batch size . gpt和claude 说是global batch size.

DDP

pytorch 分布式计算你们都遇到过哪些坑/bug？ - 白露未晞me的回答 - 知乎 https://www.zhihu.com/question/351342218/answer/1977471497099550839

第一个 nccl 坑还是挺常见的.

pytorch 并行训练之DistributedDataParallel（代码样例和解释） - 格子不太方的文章 - 知乎 https://zhuanlan.zhihu.com/p/350301395

[原创][深度][PyTorch] DDP系列第一篇：入门教程 - 996黄金一代的文章 - 知乎 https://zhuanlan.zhihu.com/p/178402798 写的很好

https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Pytorch 分布式训练 - 会飞的闲鱼的文章 - 知乎 https://zhuanlan.zhihu.com/p/76638962

https://www.cnblogs.com/rossixyz/p/15553670.html DDP绝世好文!

DistributedSampler

https://pytorch.org/docs/stable/data.html?highlight=distributedsampler#torch.utils.data.distributed.DistributedSampler

sampler = DistributedSampler(dataset) if is_distributed else None
loader = DataLoader(dataset, shuffle=(sampler is None),
                    sampler=sampler)
for epoch in range(start_epoch, n_epochs):
    if is_distributed:
        sampler.set_epoch(epoch)
    train(loader)

dist.new_group

如果在nccl后端每台机器上使用多个进程，则每个进程必须对其使用的每个 GPU 具有独占访问权限，因为在进程之间共享 GPU 可能会导致死锁。

broadcast

一般有个root来指定id, 把arr, 复制到其他所有GPU. 每个GPU都要调用broadcast.就是SPMD

    if dist.get_rank() == 0:
        x = torch.arange(3.) 
        scatter_list =  list(torch.tensor_split(x, 3))
    else:
        x = torch.zeros(3) #必须要准备相同大小的tensor, 不能是none
    dist.broadcast(x, src=0)

启动方法

The CUDA runtime does not support the fork start method; either the spawn or fork server start method are required to use CUDA in subprocesses.

pytorch/pytorch#2517 非常常见的错误.

1. python3 -m torch.distributed.launch --nproc_per_node=8 DDP.py
   2.  mp.spawn(run, args=(world_size, dataset, args, queue),
             nprocs=world_size, join=True)

传播object

torch.distributed.broadcast_object_list(*object_list*, *src=0*, *group=None*, *device=None*)

(https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#broadcast_object_list)

通常不可选择的东西是，例如，套接字、文件（处理程序）、数据库连接等等。默认情况下，从基本 Python 类型（字典、列表、原语、对象、对象引用，甚至循环）构建（递归）的所有内容都可以腌制。

broadcast_object_list 很好,

更好的是scatter, 但是更难. 而且要变成list.

torch.distributed.all_gather_object(object_list, obj, group=None)

scatter object list 是torch dist 库目前不支持的, pytorch/pytorch#88685

试一下 scatter tensor行不行. nccl不行, 但是gloo可以.

问题

怎么让一个process控制两个GPU? dataparallel可以吗?

多进程的pytest好像是会hang住.

多进程的存储格式也不能是自定义的. 因为fork会pickle, 例如: RuntimeError: unsupported Storage type.

share memory

Once the tensor/storage is moved to shared_memory (see share_memory_()), it will be possible to send it to other processes without making any copies.

pipeline

https://vmlaker.github.io/mpipe/

可以用fence global barrier吗?

进程间通讯需要开销。尽量避开。不如三个进程一起sample。但是这个就是DDP。

状态机

状态机要求, 调用一个异步的函数sampler.start()

消息/事件, 同步/异步/协程, 并发/并行协程与状态机 ——从python asyncio引发的集中学习

状态机是比较容易编程的

class Sampler:
def start():
	sample node id
  copy into shared memory  idx1
  isfinished = True
    
class Gather:
def start():
  copy into GPU memory  feat1
  isfinished = True
    
class Trainer:
def start():
  train
  isfinished = True
    
S = sampling
G = gathering
T = training
F = finish
   // 要让FSM里面每个函数都是非阻塞的,
class FSM
def run(){
    switch (state)
      I:
        state <- S
        sampler.start()
      S:
        if sampler.isfinished():
          state <- G
          gather.start()
    G:
      if gather.isfinished():
          state <- T
          trainer.start()
    T:
      if trainer.isfinished():
          state <- S
}

FSM s[2]; //可以开很多,每次挑两个.为什么? 
while (true)
  for each S:
			S.run()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

acclerate

FSDP

DDP

broadcast

启动方法

传播object

问题

share memory

pipeline

状态机

FilesExpand file tree

multipleprocess.md

Latest commit

History

multipleprocess.md

File metadata and controls

acclerate

FSDP

DDP

broadcast

启动方法

传播object

问题

share memory

pipeline

状态机