AI Networking - Ethernet vs InfiniBand
AI Networking (AI Fabrics) - InfiniBand vs Ethernet
—-----------------------------------------------------
Not a single day passes by without some news about AI. The industry buzz is all about the GPUs from Nvidia, AMD and Intel. While that ‘AI compute’ side is very important due to those heavy workloads for AI, how are we going to manage all that traffic generated from these GPUs? There needs to be focus on the ‘AI Networking’ side too since the Job Completion Time (JCT) is very important for the AI workloads. A slower network will lead to slower JCT. Slower JCT means less utilization of these expensive GPUs which will lead to the under-utilization of those expensive resources.
When it comes to AI fabrics, we have two major camps - InfiniBand supported by Nvidia and Ethernet supported by the major Ethernet vendors. InfiniBand came from Mellanox which was acquired by Nvidia in 2019. While InfiniBand has gained a strong early traction for the AI, Ethernet is starting to gain market share. If history is any indication, Ethernet will eventually win similar to how it did against the Token ring, CDDI/FDDI, ATM and many more networking technologies. Nvidia also supports Ethernet based solutions but those need other Nvidia products for the whole solution to work.
There is a new Ultra Ethernet Consortium (UEC) created by the Ethernet companies mainly to improve Ethernet to handle heavy workloads for AI. While major Ethernet vendors have already joined the UEC, Nvidia has stayed away from that so far. It will be interesting to see if they will also embrace the UEC in the future. For now, it will be easy if we divide the discussion into two camps - Nvidia camp and the UEC camp.
Nvidia:
-------
Anyone who is remotely interested in AI should know about Nvidia and their GPUs for the AI workloads. Nvidia came up with the NVLink for the GPUs to communicate among themselves inside a single server. We can expect about eight GPUs inside one server. But the AI workloads need a lot more GPUs which means multiple servers need to talk to each other. Here is where InfiniBand comes into play. InfiniBand came around 1999 and tried to fix the latency issue of Ethernet which is a much older technology.
InfiniBand uses Direct Memory Access (DMA) which basically accesses the memory directly from the Network Interface Card (NIC) and relieves the CPU from this intense memory access work. RDMA (Remote Direct Memory Access) is the basis for the InfiniBand which makes the network really fast since it can transfer the data from one memory to another memory without taking up the cycles from the CPU. This makes the communication much faster with extremely low latency (under 300 nanosecond level latency compared to the 800 nanosecond to 1 microsecond latency for the cut-through Ethernet or many microseconds latency for the store-and-forward Ethernet).
Since Ethernet is the most popular protocol, widely adopted by the industry, Mellanox also came up with the RoCE which stands for RDMA over Converged Ethernet. This is basically the RDMA encapsulated with Ethernet headers. Original RoCE was working inside the same L2 segment but RoCEv2 enhanced that to have L3 (with the added IP and UDP headers). This way they can cater to the Ethernet ecosystem which is much bigger and is cheaper than the InfiniBand which lacks the wider adoption of Ethernet.
In their recent OCP presentation, Nvidia claimed to achieve 8 Exaflops of peak AI performance using 256 Dell servers with 2048 H100 GPUs supported by 2560 Bluefield-3 Data Processing Units (DPU) and 80+ Spectrum-4 Ethernet switches. While this performance is very impressive, this is clearly a vertically integrated solution with GPUs, DPUs and Ethernet switches from Nvidia. They have the Nvidia Collective Communication Library (NCCL), an IO library to implement a series of GPU-accelerated collective operations. Nvidia is positioning their Adaptive Routing techniques for load balancing the traffic. This is very different from the Ethernet world which is much more open and customers are not locked into any particular vendor. But the rate with which Nvidia is moving the throughput war should be complimented.
To summarize the pitch from Nvidia, they are clearly positioning InfiniBand for the pure AI factories and Ethernet for the regular cloud with multi-tenancy. For the mixed mode where there is both AI traffic and regular traffic, they have two types of positioning - Ethernet for the North-South traffic and a mixture of RDMA based Ethernet with their GPU, DPU, and Ethernet switches for the East-West traffic.
Ultra Ethernet Consortium (UEC):
------------------------------------
Ethernet is the de facto standard for the Internet and it has evolved over a period of time from the Spanning Tree Protocol (STP) based networks to the Clos based IP Fabric which scales to millions of nodes. While STP was meant to just avoid loops, it doesn't load-balance the traffic. IP fabric uses the Equal Cost Multi Path (ECMP) techniques for load balancing to use all the links properly while providing redundancy too. But Ethernet is a lossy medium and has higher latency with the TCP retransmissions compared to the RDMA used by the InfiniBand. Also, the traffic may not be balanced equally with the ECMP load balancing since that’s hash based (Source/Dest IP, Source/Dest ports and protocols). As we all know, AI traffic is mostly elephant flows with smaller number of flows while the cloud with multi-tenants has a large number of smaller flows and hence easier to load balance with the current hashing mechanisms.
Recommended by LinkedIn
Ultra Ethernet addresses all these drawbacks with the following techniques:
From the vendor perspective, Arista and Broadcom seem to be the most aggressive in positioning Ethernet for AI. One of Broadcom's presentations also points to how Arista’s 7800 chassis has been deployed at Meta. I have given links to all of these in the References section below for those who are interested in the full details. This space is evolving so fast that there is a race between many companies. I do think there will be multiple winners coming out since the pie is so big for everyone to have a slice!.
References:
6. Cisco's pitch is with their Silicon One ASIC which exactly matches the Broadcom Tomahawk 5’s 51 Tbps capacity! -
7. Nvidia’s pitch from OCP 2023 where they try to position both InfiniBand and Ethernet (Spectrum-X) - NVIDIA Spectrum-X Network Platform Architecture
11. Broadcom’s presentation that shows Arista’s 7800 chassis has been deployed at Meta in their AI cluster. You will see this after 10 mins! Arista can do a 576 node cluster with their 7800R3 chassis! Broadcom Ethernet Fabric for AI ML at Scale Wrap Up
12. Arista’s Product Manager on the need for UEC and Arista’s involvement with UEC: Arista Networks and Ultra Ethernet Consortium (UEC)
13. Arista’s Engineers comparing Ethernet vs InfiniBand and positioning the right products from Arista based on the AI cluster size! AI Networking Fireside Chat
Founder, CEO/CTO at Orchestral.ai
11moNice article very well explained.
Distinguished Engineer, Linux Kernel, Device Drivers, Low-Level Software
12moI agree, Ethernet seems to be the way to go. When you said the ‘Networking AI war’ I thought somone finally used an incumbents firewall to train their model to detect nefarious packets … that would be a war to watch for sure.
Product Line Manager, Data Center and Campus Software
12moNice summary, Senthil. With the nature of these workloads, burstiness is very common and links can get saturated within microseconds. Uniform packet spraying is very important, hence load-aware ECMP becomes key. Pete Del Vecchio Mohan Kalkunte and many others from Broadcom are championing lot of interesting ideas and encouraging discussions in UEC. #broadcom #uec #ethernet
Great write up. Just to add to the north-south traffic, besides multi-tenancy with other traffic, pure AI north-south is also applicable to inference deployments. Even for cloud providers, separate services, Infiniband for training and Ethernet for inference is probably something we might see pan out