We're working on getting Hugging Face Accelerate 1.0.0 up and going, and decided to publish our roadmap publicly to get your thoughts, opinions, and just keep you in the loop! Check out more of what we're thinking: https://lnkd.in/er7_ZeRw Still learning the best ways to do things, for now the project has links to the relevant Accelerate issues once we've reached a point we can start discussing them. Please follow those to voice your thoughts! 🤗
Containers are given host network I guess then. Pytorch runs on nccl backend. Nccl isn't able to find out eth0 networks with overlay network.
Senior Software Engineer | Gen AI | Cloud | IIT
5moDoes accelerate multinode training work from inside containers hosted on different nodes?