The Networking That Powers the Chips

The Networking That Powers the Chips

AI Farms are data centers filled with stacks and racks of servers, each containing wires, chips, and disks. These haunting images of data rooms from the 1990s now have an elegant appearance. Copper wires neatly enclose these air-conditioned rooms. Not to mention, modern liquid cooling, which incorporates direct-to-chip cooling, that ensures the entire system operates efficiently.

The PC circuit board, (also known as the motherboard to many of us), is a very complicated piece of equipment. Motherboard communication involves complex protocols, multiple signaling layers, and sophisticated chipsets that manage the interactions between these various components.

Binary representation (1's and 0's) is fundamental to how digital systems represent data. The communication between motherboard components involves complex protocols, multiple signaling layers, and sophisticated chipsets that manage intricate interactions, including the transmission of 1's and 0's among themselves.

At that level, Assembly Language often comes into play. I am not even joking when I say that assembly is not something that even the most proficient of programmers today would want to program with.

Why?

The answer is simple. It involves moving data between registers (components on the motherboard), performing numerous binary computations, and occasionally loading these binary numbers into registers. A simple program to multiply two numbers in a language like C may take up to four lines with all the libraries, but a program like this in assembly may take three or even four times as many lines of code, meticulously and painstakingly.

Why did I bring this up?

When you program in assembly, you actually get an idea of how these digital representations of analog signals in terms of 0's and 1's whizz through the circuitry into registers, via the communication system on the PC we call a bus. Its truly remarkable to see how at a very fundamental level the digital nuts and bolts of our ecosystem interact.

Computer networks are normally hidden in plain sight. One place where data movement happens rapidly is in networking.

This article talks about three of the most important networking technologies today.


2 Men and a Baby

Let's define what a computer network is to start with.

In short, it's a system that connects two or more computing devices together, enabling them to communicate and share resources. When I say devices, I mean personal computers, servers, mobile phones, tablets, and even printers. Connections between these devices can be established through various means, including physical cables (like fiber optics) or even wireless technologies (such as Wi-Fi). In fact, these are the most common we see around us.

Networks, too, need a boundary. That's where the LAN and WAN come in.

In local area networks (LAN) or wide area networks (WAN), define the network boundaries for grouping these systems (also sometimes referred to as nodes) together. Needless to say, the absence of networking in a data center or wide area network can be compared to food lying in the restaurant kitchen while customers are sitting at tables waiting for their food in the absence of waiters or a delivery mechanism!

This article discusses the delivery mechanism, specifically computer networking.

Various networking systems operate globally, working behind the scenes to transport data from one system to another across the internet. You have Ethernet, which is used almost ubiquitously now; Infiniband, which is used for high-performance computing (these are the 2 men); and the new kid on the block, NVLink by NVIDIA, developed for its own high-end infrastructure.


Ethernet

Ethernet

The term Ethernet transports me back to my university days, when I was pursuing my post-graduation in computer science. As part of my final semester, I was saddled with a thick networking book from Andrew S. Tanenbaum called "Computer Networks." Not only was it a thick book but it was also very difficult to understand.

We had to learn about TCP/IP (the protocol that drives networking across the Internet today) and its 4 layers. It was both a fascinating and frustrating learning experience. Fascinating because you could see diagrams and understand how communication worked, frustrating because this was just limited to pictures and words without me being able to actually implement anything.

Then we were drowning in these different protocols, such as CDMA, TDMA, AlohaNet, and others that spoke of how networking evolved.

Now let's turn our attention to Ethernet.

In the beginning, in the early 70s, much before I was born, a young engineer named Robert Metcalfe, working in Xerox's Palo Alto research, invented Ethernet, taking inspiration from the AlohaNet system. This invention made it possible for personal computers to be able to communicate with other devices at high speed—speed being the key here.

He soon left Xerox and formed 3Com to commercialize the technology he built, Ethernet. He soon partnered with Intel and released Ethernet to the public. That time, it was a revolution with 10 Mbps speeds; it was difficult to counter.

At the time, there were other major market players. IBM in particular was pushing its own standard—the Token Ring network (which also found mention in Tanenbaum's book). Ethernet was able to eventually shrug off a lot of competition because it was adaptable. Initially, it could operate on a variety of coaxial cables, and as time progressed, it also became compatible with thinner cables. Soon the famous NIC (Network Interface Card) for Ethernet was built for the IBM PC.

In 1983, the now-famous Electrical and Electronics Engineers (IEEE) group was working on standardizing this technology. In 1983, they approved the original 802.3 standard for thick Ethernet and officially published it in 1985.

There was no turning back now. Ethernet had finally arrived.

Over the last 20 years Ethernet has grown by leaps and bounds in terms of data transmission speeds: -

  • 1990 - 100Mbps
  • 1998 - 1Gbps
  • 2002 - 10Gbps
  • 2010 - 40Gbps
  • 2015 - 100Gbps

Today, Ethernet is everywhere, connecting computers, phones, and even powering devices through what is sometimes called Power over Ethernet (PoE).

All good so far? Now comes the kick! Enter High Performance Computing (HPC). HPC requires massive computational power (in comparison to normal workloads) to perform complex calculations and process large datasets quickly. HPC systems are very often made up of clusters of thousands of nodes working in parallel.

Data moved across the internet from node to node at comparatively high speeds using Ethernet. That was its badge of honor!

HPC required significantly higher networking speeds and lower latency.

So what on earth is High Performance Computing (HPC)?

When we talk about High-Performance Computing (HPC), we normally refer to those large and expensive supercomputers and computer clusters to process large amounts of data and perform complex calculations at extraordinarily high speeds.

This is not the typical work we do at the office on a daily basis, unless you're a scientist working in research, conducting engineering simulations and developing data-intensive applications across various industries, such as healthcare, finance, and climate modeling.

This is where the second player enters the fray.


Infiniband

Infiniband

InfiniBand is a more recent technology compared to Ethernet. It could be argued that Infiniband emerged to overcome the limitations inherent in Ethernet. Initially it had the backing of major companies such as Sun Microsystems, Microsoft, HP, Intel, etc. 1999 saw the release of the first InfiniBand Architecture Specification Version 1.0, nearly a decade after Ethernet.

The key point to note here was that this new networking standard of InfiniBand was aimed at high-performance computing (HPC) clusters.

In 2001, a company named Mellanox joined the competition. It started shipping Infiniband adapters. Before doing so, it added its own customization to the pie, something called Remote Direct Memory Access (RDMA). RDMA was Mellanox's biggest contribution to InfiniBand. It enabled efficient data transfer between systems without needing to involve the CPU. This change reduced latency considerably and increased throughput.

Soon the Infiniband technology began to slowly gain traction in the HPC community, where speed requirements were very different. Adoption in Infiniband began to grow.

There was one niche that Infiniband carved out for itself. That was supercomputers. Academic institutions and other research organizations primarily use these for high-performance computing applications, which involve conducting experiments for scientific research.

Here are some of Infiniband's notable implementations: -

  • 2003: - Virginia Tech built an InfiniBand cluster that was at that time ranked third among the top supercomputers of that time.
  • 2019: - Summit Supercomputer (Oak Ridge National Laboratory, USA). Designed for a variety of scientific applications, this is currently the fastest supercomputer in the world.
  • 2016: - Wuxi Supercomputing Center (China). This supercomputer is used for a variety of scientific applications such as weather prediction, materials science, energy research, etc ..
  • 2018: - Sierra Supercomputer (Lawrence Livermore National Laboratory, USA. Designed for nuclear weapons simulations and advanced scientific research, Sierra ranked as the second fastest supercomputer in the world.

It's clear from the examples given above that it is indeed Infiniband that the industry seems to rely upon to power HPC, although Ethernet could also be used.

Then something occurred in 2021! NVIDIA acquired Mellanox Technologies for $6.9 billion.

Now NVIDIA owns the Infiniband technology. It has its own Mellanox division inside NVIDIA that works on improving the technology.

So now we have a networking technology that connects nodes in a network between supercomputers, enabling HPC at high speeds.

Today, however, we have a different problem. When you have clusters of GPUs, they need to be able to pass data among themselves and sometimes even back to the CPU. This chip-to-chip data transfer within the same node is what we turn our attention to next.


NVLink

Chip to Chip

Ethernet and Infiniband are not in the picture here.

Their role is just to deliver data to the right nodes on the system.

In AI factories, within each system, data often needs to be number crunched. The nature of AI training necessitates frequent, high-speed iterations. This implies that the rapid transfer of bits and bytes across chips (GPU's & CPU's) in a high-speed environment is crucial for the success of AI data factories.

NVLink, by NVIDIA, is currently the leader in this space.

So what does NVLInk do anyway?

NVIDIA developed this high-speed interconnect technology to facilitate direct communication between multiple GPUs and between GPUs and CPUs. With the capability to connect up to 256 GPUs using its own NVSwitch technology, NVLink facilitates scalable configurations that allow it to optimize performance for complex computational tasks. Their latest fourth-generation NVLink can achieve data transfer rates of up to 900 GB/s, making it a vital component in modern supercomputing environments.

Its competitors, such as PCIe and AMD Infinity Fabric, are way behind.

This technology clearly gives them the edge for high-performance computing (HPC) applications, artificial intelligence, and deep learning.


Networking in today's world is complex. From the bits and bytes buzzing over fiber-optic undersea cables at high speed right down to the chips on the same node.

Like the waiters in a restaurant, networking is the unsung hero who tirelessly works behind the scenes to provide us with the computing experience we are so accustomed to.

To view or add a comment, sign in

More articles by Mario Rozario

Explore topics