Skip to content Skip to sidebar Skip to footer

Nccl Connection Failed Using Pytorch Distributed

I am trying to send a PyTorch tensor from one machine to another with torch.distributed. The dist.init_process_group function works properly. However, there is a connection failure

Solution 1:

unhandled system error means there are some underlying errors on the NCCL side. You should first rerun your code with NCCL_DEBUG=INFO (as the OP did). Then figure out what the error is from the debugging log (especially the warnings in log).

In the OP's log, I think the line iZbp11ufz31riqnssil53cZ:13530:13553 [0] include/socket.h:395 NCCL WARN Connect to 192.168.0.143<59811> failed : Connection timed out is the cause of unhandled system error

Solution 2:

I found that the two servers that I used were not under the same VPC. Therefore, they can never communicate. So I used other servers under the same VPC and it worked.

Solution 3:

unhandled system error in ProcessGroupNCCL.cpp usually means there are differences between codes on two nodes, and I wonder why you use a = torch.zeros((3,3)).cuda() on node 1 with the same para name as the node 0?

`

Post a Comment for "Nccl Connection Failed Using Pytorch Distributed"