We're working on getting Hugging Face Accelerate 1.0.0 up and going, and decided to publish our roadmap publicly to get your thoughts, opinions, and just keep you in the loop! Check out more of what we're thinking: https://lnkd.in/er7_ZeRw Still learning the best ways to do things, for now the project has links to the relevant Accelerate issues once we've reached a point we can start discussing them. Please follow those to voice your thoughts! 🤗
Containers are given host network I guess then. Pytorch runs on nccl backend. Nccl isn't able to find out eth0 networks with overlay network.
Technical Program Manager AI | Google | Ex IAF | IIT
7moDoes accelerate multinode training work from inside containers hosted on different nodes?