In early April, a 25-person team at Google published “Aquila: A Unified Low-latency Fabric for Datacenter Networks.” I've only started grinding through the original research paper and the very insightful article by The Next Platform, but here are some of the realizations that surprised me, and might interest others in the SmartNIC community.
- There is no Aquila NIC! Aquila created a Top of rack in NIC (TiN) chip. Then it put two of these NIC chips on a line card, and six line cards into their Aquila switch for a total of 24 PCIe bus connections to servers, per switch. These 24 servers represent, in Aquila terminology, a pod. There are then two pods per rack, for 48 servers/rack.
- Aquila Server Connectivity is Limited to 128 Gbps. The Aquila TiN chip utilizes a PCIe Gen3x16 cable to connect each server. In July, servers begin shipping with PCIe Gen5x16 slots (512 Gbps). While you can plug a Gen3 device into a Gen5 slot, that would be like taking a 20 MPH moped onto a California freeway where 65 MPH is a suggestion. Google does call out that two PCIe cables could go to each server, but they explicitly state that their pods are 24 servers each, so there is only one cable per server.
- The real Aquila Server Interconnect is a PCIe Gen3x16 Cable. It appears Google brings a PCIe Gen3x16 cable into the back of the server and then plugs it down into the PCIe bus. While this sounds trivial, PCIe signal integrity issues grow significantly with distance. This is why PCIe Gen3x16 extension cables are rarely more than 15 inches long and PCIe Gen4x16 only eight inches. Sure, custom shielded cables exist that can achieve significant distances, nearly three meters for PCIe Gen3, and clearly, Google has done some impressive cable engineering. If they ever move Aquila to PCIe Gen4 or Gen5 this will become their Achilles heel.
- Aquila Servers are 3.5 to 1 Oversubscribed. Each Aquila TiN chip has a single 100 GbE connection shared between two servers for 50 Gbps/each. Also, each TiN provides each server 300 Gbps within the pod and 100 Gbps of uplink beyond the pod. This works out to a total available interconnect bandwidth of 450 Gbps into a 128 Gbps PCIe Gen3x16 bus, which is 3.5 to 1. By contrast, an HPC Clos topology network is not oversubscribed, full-fat trees are 2 to 1 oversubscribed, and some other HPC implementations are 4:1 oversubscribed.
- Leaving Every Aquila Rack are 216 Data Cables! Each rack has 192x25Gbps dragonfly interconnect cables and 24x100 GbE cables. Within the rack, there are another 48 PCIe Gen3 cables connecting switches servers; that’s 264 data cables per rack, OMG. Perhaps Google is acquiring a cable company.
I’m still digging into both documents, so more to come later this week.
Product Marketing Manager
2yWOW! What are they thinking? Is this a solution, looking for a problem?
FPGA Engineer at Major defense contractor
2yScott Schweitzer, thank you for posting this diagram. There is no way around unless one reads the paper- thank you for providing the link to the paper- 😁 The key to the whole vision looks like this diagram which shows 2 pods of 24 servers each and in each pod there are two TiN ASICs. I have the PCIe3X versus PCIe 5X observation but lets just see the broader topology. I will read the paper at least 2-3 times and then we can compare notes.
Good observations, Scott. It is interesting that they used PCI Gen3 instead of Gen4 or Gen5. Also interesting is the emphasis on latency--that seems to be key point of the interconnect to minimize the round-trip between nodes. Also interesting will be to see how the relationship with Intel works out as they sell NICs.