Lowish Latency, Google Aquila
"Aquila: A unified, low-latency fabric for datacenter networks" - Google Research Paper 2022

Lowish Latency, Google Aquila

Google's own words Aquila was designed "To optimize for ultra-low latency, under load and in the tail, Aquila implements cell-based communication with shallow buffering for cells within the network, flow controlled links for near lossless cell delivery, and hardware adaptive routing to react in nanoseconds to link failures and to keep the network load balanced even at high loads." The goal of Aquila is to bring HPC performance to Google's hyperscaler data centers, and if this were a decade ago, they might have come close to prevailing HPC cluster performance, but not today. Above are the latency graphs from Google's "Aquila: A unified, low-latency fabric for datacenter networks" research paper.

In High-Performance Computing (HPC), engineers consider three architectural criteria when designing a cluster: latency, bandwidth, and link over-subscription. Latency is the time required to move a packet through the network and is reported as Round Trip Time (RTT), the time for a packet of data to reach a destination and return. Over-subscription is the ratio of bandwidth into the network fabric versus bandwidth through the fabric. Google states Aquila's oversubscription as 2:1, but this is only within the 24 servers connected to the same switch, while Infiniband, the leading HPC networking fabric, is 1:1 at this same level.

As mentioned in a prior post, Aquila NICs utilize a PCIe Gen3 x16 connector which is 128 Gbps in each direction. The Aquila NIC, which is shared between two servers, provides a single server with 16 links at 25Gbps each for an aggregate bandwidth of 400 Gbps. Google took this approach with 16 links to leverage the NIC chip's own internal switching. The benefit of Aqualia maybe switch latency. Within the Aquila chip, the latency is reported as 40 nanoseconds. Worst case, within the Aquila fabric, a cell will traverse two switch chips for a total switch latency of 80 nanoseconds. NVIDIA's current Quantum-2 Top of Rack switches feature 64 ports at 400Gbps each. If configured in a similar Dragonfly architecture with 40 nodes per switch and 24 Top of Rack switches, a cluster of 960 nodes could easily be assembled in the same data center 24 rack footprint. NVIDIA's Quantum-2 switch latency is likely on par with Aquila's, but the difference is that each link caries 16X the bandwidth of Aquila's using the same Dragonfly topology. One of the benefits of NVIDIA's Quantum-2 is that it also supports a Dragonfly+ architecture connecting up to one million nodes within a single cluster.

When measuring network latency, "tail" represents the worst case, sometimes known as the 99th percentile. Figures 7, 8, and 11 in the diagram above from the research paper are referenced as 99p or 99.9p, while the median latency is 50p, representing the lowest 50% of packet latencies in these same figures. In the paper, Google references the tail latency as sub-40 microseconds with 1RMA latency (50th percentile) as sub-10 microseconds. From Figure 7 above, it appears that Ethernet/IP latency through Aquila is 8 or so microseconds, and 1RMA latency, about 3 microseconds. For the past decade and a half, Infiniband has been the king of the HPC interconnects, and today their fabric latencies are often sub-1 microsecond, while switch latencies measured in tens of nanoseconds. While I've never been a fan of Infiniband, it has two decades of product maturity behind it, and has resolved all the Ethernet/IP bridging issues that have come up over this time. Also, Infiniband switches easily bridge traffic between Ethernet/IP and Infiniband so there would be no additional need to have the duplicate 100 GbE uplinks and switches outlined in Aquila's architecture.

In HPC networks, latency is the primary design criteria, sure bandwidth and over-subscription are important, but ultra-low latency is the target. In 2006, Myrinet 10G, then the king of the HPC Interconnect hill, had a measured latency through the fabric of 2.3 microseconds. Infiniband went on to crush that over the next 16 years. Perhaps 3 microseconds for 1RMA latencies is good, depending on ones perspective, but when taken in context with history perhaps it's just lowish-latency.

Steve Singer

Embedded Security Architect Experienced in ASIC/FPGA Platform Security for Networking, Auto, AI/ML, and Aero/Defense

2y

Great research!

To view or add a comment, sign in

More articles by Scott Schweitzer, CISSP

  • SuperNIC Explained? Part 2

    SuperNIC Explained? Part 2

    Earlier this summer, in Part 1, I speculated on NVIDIA's definition of a SuperNIC. On Friday, I received an email…

    8 Comments
  • SuperNIC Explained? Part 1

    SuperNIC Explained? Part 1

    During Jensen’s NVIDIA GTC keynote a few months back, he used the term "SuperNIC" interchangeably when discussing the…

    2 Comments
  • SmartNIC = (DPU, IPU, NPU)

    SmartNIC = (DPU, IPU, NPU)

    When we name an object, or class of objects, that immediately endows a measure of permanence, then we can begin…

    1 Comment
  • DPUs in ToR Switches

    DPUs in ToR Switches

    Recently on a SmartNICs Summit panel about the future, I clearly stated: “that there is rarely anything new under the…

  • Top Ten DPU Features in 2028

    Top Ten DPU Features in 2028

    The last panel of the 2023 SmartNIC Summit was titled "SmartNICs in 2028 and How We Got There," it was chaired by…

    2 Comments
  • GFTs, Hyperscaler Magic Pixie Dust

    GFTs, Hyperscaler Magic Pixie Dust

    Recent experience has shown that Hyperscalers are gaga about Generic Flow Tables (GFT) because they appreciate the…

    2 Comments
  • GFT, the Smart in SmartNIC

    GFT, the Smart in SmartNIC

    From AI-based trading solutions to security and storage, there are dozens of use cases for SmartNICs, but the most…

  • What Makes SmartNICs "Smart"

    What Makes SmartNICs "Smart"

    Standard Network Interface Cards (NICs) are engineered to convert electrical signals from the Ethernet into data…

    2 Comments
  • Will 100GbE Dominate Thru 2024?

    Will 100GbE Dominate Thru 2024?

    Given that the new server processors from AMD (Genoa) and Intel (Sapphire Rapids) are hitting the market and providing…

    1 Comment
  • A Server Designed for 2x200GbE!

    A Server Designed for 2x200GbE!

    It appears Dell's engineers may have collaborated with NVIDIA when designing their new Intel Sapphire Rapids server…

    1 Comment

Insights from the community

Explore topics