Nccl Connection Failed Using Pytorch Distributed
Solution 1:
unhandled system error
means there are some underlying errors on the NCCL side. You should first rerun your code with NCCL_DEBUG=INFO
(as the OP did). Then figure out what the error is from the debugging log (especially the warnings in log).
In the OP's log, I think the line iZbp11ufz31riqnssil53cZ:13530:13553 [0] include/socket.h:395 NCCL WARN Connect to 192.168.0.143<59811> failed : Connection timed out
is the cause of unhandled system error
Solution 2:
I found that the two servers that I used were not under the same VPC. Therefore, they can never communicate. So I used other servers under the same VPC and it worked.
Solution 3:
unhandled system error in ProcessGroupNCCL.cpp usually means there are differences between codes on two nodes, and I wonder why you use a = torch.zeros((3,3)).cuda()
on node 1 with the same para name as the node 0?
`
Post a Comment for "Nccl Connection Failed Using Pytorch Distributed"