Skip to content
  • Glenn Jocher's avatar
    Update DDP for `torch.distributed.run` with `gloo` backend (#3680) · fad27c00
    Glenn Jocher authored
    * Update DDP for `torch.distributed.run`
    
    * Add LOCAL_RANK
    
    * remove opt.local_rank
    
    * backend="gloo|nccl"
    
    * print
    
    * print
    
    * debug
    
    * debug
    
    * os.getenv
    
    * gloo
    
    * gloo
    
    * gloo
    
    * cleanup
    
    * fix getenv
    
    * cleanup
    
    * cleanup destroy
    
    * try nccl
    
    * return opt
    
    * add --local_rank
    
    * add timeout
    
    * add init_method
    
    * gloo
    
    * move destroy
    
    * move destroy
    
    * move print(opt) under if RANK
    
    * destroy only RANK 0
    
    * move destroy inside train()
    
    * restore destroy outside train()
    
    * update print(opt)
    
    * cleanup
    
    * nccl
    
    * gloo with 60 second timeout
    
    * update namespace printing
    fad27c00