Networking among virtual machines in Microsoft Azure is going to get a whole lot faster thanks to some new hardware that Microsoft has rolled out across its fleet of data centers.
The company announced Monday that it has deployed hundreds of thousands of FPGAs (Field-Programmable Gate Arrays) across servers in 15 countries and five different continents. The chips have been put to use in a variety of first-party Microsoft services, and they’re now starting to accelerate networking on the company’s Azure cloud platform.
In addition to improving networking speeds, the FPGAs (which sit on custom, Microsoft-designed boards connected to Azure servers) can also be used to improve the speed of machine-learning tasks and other key cloud functionality. Microsoft hasn’t said exactly what the contents of the boards include, other than revealing that they hold an FPGA, static RAM chips and hardened digital signal processors.
Microsoft’s deployment of the programmable hardware is important as the previously reliable increase in CPU speeds continues to slow down. FPGAs can provide an additional speed boost in processing power for the particular tasks that they’ve been configured to work on, cutting down on the time it takes to do things like manage the flow of network traffic or translate text.
With Microsoft looking to squeeze every ounce of power out of the computing hardware and footprint that it already has to compete with other players in the cloud market, this hardware could give the company an advantage.
Accelerated Networking, a new feature available in beta on Monday, is one example of the features that an FPGA deployment enables. Between two VMs that both have it enabled, it will give users speeds as high as 25Gbps and latency between 25 and 50 microseconds, for no extra charge.
The Accelerated Networking announcement comes just a week after Oracle unveiled its second-generation infrastructure-as-a-service offering at OpenWorld, which also features off-server, software-defined networking to drive improved performance.
Azure CTO Mark Russinovich said using the FPGAs was key to helping Azure take advantage of the networking hardware that it put into its data centers. While the hardware could support 40Gbps speeds, actually moving all that network traffic with the different software-defined networking rules that are attached to it took a massive amount of CPU power.
“That’s just not economically viable,” he said in an interview. “Why take those CPUs away from what we can sell to customers in virtual machines, when we could potentially have that off-loaded into FPGA? They could serve that purpose as well as future purposes, and get us familiar with FPGAs in our data center. It became a pretty clear win for us.”
The project is the brainchild of Doug Burger, a distinguished engineer in Microsoft Research’s New Experiences and Technologies (NExT) group. Burger started the FPGA project, codenamed Catapult, in 2010. The team first started working with Bing, and then expanded to Azure. That work led to the second, current design of Microsoft’s FPGA hardware layout.
One FPGA card is assigned to every new Azure server and is connected to its NIC (network interface card), PCIe bus and the top-of-rack network switch. That means it’s possible for each FPGA to talk with others, and Microsoft can leverage many FPGAs across its data centers for big jobs, with low latency. That’s especially important for massive machine-learning applications.
“If we want to allocate 1,000 FPGAs to a single [deep neural net] we can,” Burger said. “We get that kind of scale.”
That scale can provide massive amounts of computing power. If Microsoft used Azure’s entire FPGA deployment to translate the English-language Wikipedia, it would take only a tenth of a second, Burger said on stage at Ignite.
Microsoft isn’t the only company turning to custom silicon for this sort of work. Google unveiled a Tensor Processing Unit earlier this year that’s supposed to accelerate some machine learning tasks in its cloud. The TPU is an Application-Specific Integrated Circuit or ASIC — a purpose-built chip.
Google used ASICs instead of FPGAs because of speed and efficiency. So, why did Microsoft choose FPGAs?
The industry moves far too quickly for him to be confident a particular ASIC will do what needs to be done over time, Burger said. While using only the reprogrammable hardware in an FPGA wouldn’t be great for performance, the hardened SRAM and DSP chips on the FPGA board can speed up certain applications, shrinking the performance gap.
“I’m not comfortable locking the control path down for three years and saying ‘I know what to do now,'” Burger said.
Right now, Accelerated Networking is only available for DS15v2 instances in Azure’s West Central U.S. and Western Europe regions. It’s only compatible with Windows Server 2012 R2 and Windows Server 2016 Technical Preview 5, though Microsoft plans to make it work with Linux instances soon.
In the future, the Accelerated Networking service will expand to Azure’s other virtual machine types and operating systems. It will go from being an opt-in enhancement to being a free, opt-out benefit that will increase networking speeds available by default.
Looking to the future, Microsoft has said that the FPGAs will be put to use in machine-learning applications. Burger said the company has set up code for its Cognitive Services to run in an FPGA-accelerated mode, so those may be next.
“This is going to be a journey for how we expose this capability to customers,” Russinovich said. “I think the first thing that we’re talking about doing is [deep learning] where we train the models and then let customers run them on CPUs or GPUs in our data center. In the future, they’ll be able to run the scoring on FPGAs, and potentially also train models themselves if they want to on the FPGAs. But we’re a ways off.”
For Burger, one of the biggest questions will be what the right mix of FPGAs and CPUs is inside an Azure datacenter. Even though Microsoft has hundreds of thousands of FPGAs already deployed, they aren’t enough to meet the company’s needs as more teams start using them.
“The CPUs are important and will continue to be important for all the software and all these products and services we have,” he said. “But I think for applications, the big breakthrough at scale is going to come from non-CPU technologies.”
The sixth paragraph of this story has been corrected to reflect accurate latency times.