Lowish Latency, Google Aquila

Scott Schweitzer, CISSP

Positioning Achronix FPGAs as 400GbE DPU Leaders

Published May 8, 2022

Google's own words Aquila was designed "To optimize for ultra-low latency, under load and in the tail, Aquila implements cell-based communication with shallow buffering for cells within the network, flow controlled links for near lossless cell delivery, and hardware adaptive routing to react in nanoseconds to link failures and to keep the network load balanced even at high loads." The goal of Aquila is to bring HPC performance to Google's hyperscaler data centers, and if this were a decade ago, they might have come close to prevailing HPC cluster performance, but not today. Above are the latency graphs from Google's "Aquila: A unified, low-latency fabric for datacenter networks" research paper.

In High-Performance Computing (HPC), engineers consider three architectural criteria when designing a cluster: latency, bandwidth, and link over-subscription. Latency is the time required to move a packet through the network and is reported as Round Trip Time (RTT), the time for a packet of data to reach a destination and return. Over-subscription is the ratio of bandwidth into the network fabric versus bandwidth through the fabric. Google states Aquila's oversubscription as 2:1, but this is only within the 24 servers connected to the same switch, while Infiniband, the leading HPC networking fabric, is 1:1 at this same level.

As mentioned in a prior post, Aquila NICs utilize a PCIe Gen3 x16 connector which is 128 Gbps in each direction. The Aquila NIC, which is shared between two servers, provides a single server with 16 links at 25Gbps each for an aggregate bandwidth of 400 Gbps. Google took this approach with 16 links to leverage the NIC chip's own internal switching. The benefit of Aqualia maybe switch latency. Within the Aquila chip, the latency is reported as 40 nanoseconds. Worst case, within the Aquila fabric, a cell will traverse two switch chips for a total switch latency of 80 nanoseconds. NVIDIA's current Quantum-2 Top of Rack switches feature 64 ports at 400Gbps each. If configured in a similar Dragonfly architecture with 40 nodes per switch and 24 Top of Rack switches, a cluster of 960 nodes could easily be assembled in the same data center 24 rack footprint. NVIDIA's Quantum-2 switch latency is likely on par with Aquila's, but the difference is that each link caries 16X the bandwidth of Aquila's using the same Dragonfly topology. One of the benefits of NVIDIA's Quantum-2 is that it also supports a Dragonfly+ architecture connecting up to one million nodes within a single cluster.

When measuring network latency, "tail" represents the worst case, sometimes known as the 99th percentile. Figures 7, 8, and 11 in the diagram above from the research paper are referenced as 99p or 99.9p, while the median latency is 50p, representing the lowest 50% of packet latencies in these same figures. In the paper, Google references the tail latency as sub-40 microseconds with 1RMA latency (50th percentile) as sub-10 microseconds. From Figure 7 above, it appears that Ethernet/IP latency through Aquila is 8 or so microseconds, and 1RMA latency, about 3 microseconds. For the past decade and a half, Infiniband has been the king of the HPC interconnects, and today their fabric latencies are often sub-1 microsecond, while switch latencies measured in tens of nanoseconds. While I've never been a fan of Infiniband, it has two decades of product maturity behind it, and has resolved all the Ethernet/IP bridging issues that have come up over this time. Also, Infiniband switches easily bridge traffic between Ethernet/IP and Infiniband so there would be no additional need to have the duplicate 100 GbE uplinks and switches outlined in Aquila's architecture.

In HPC networks, latency is the primary design criteria, sure bandwidth and over-subscription are important, but ultra-low latency is the target. In 2006, Myrinet 10G, then the king of the HPC Interconnect hill, had a measured latency through the fabric of 2.3 microseconds. Infiniband went on to crush that over the next 16 years. Perhaps 3 microseconds for 1RMA latencies is good, depending on ones perspective, but when taken in context with history perhaps it's just lowish-latency.

SmartNICs Today

883 followers

+ Subscribe

Steve Singer

Embedded Security Architect Experienced in ASIC/FPGA Platform Security for Networking, Auto, AI/ML, and Aero/Defense

Great research!

1 Reaction

To view or add a comment, sign in

Lowish Latency, Google Aquila

Scott Schweitzer, CISSP

Positioning Achronix FPGAs as 400GbE DPU Leaders

SmartNICs Today

883 followers

More articles by Scott Schweitzer, CISSP

Insights from the community

Explore topics

SmartNICs Today

883 followers

More articles by Scott Schweitzer, CISSP

SuperNIC Explained? Part 2

SuperNIC Explained? Part 1

SmartNIC = (DPU, IPU, NPU)

DPUs in ToR Switches

Top Ten DPU Features in 2028

GFTs, Hyperscaler Magic Pixie Dust

GFT, the Smart in SmartNIC

What Makes SmartNICs "Smart"

Will 100GbE Dominate Thru 2024?

A Server Designed for 2x200GbE!

Insights from the community

Explore topics