Convergence of HPC + AI Use-Case Strategies to Gain Hardware Infrastructure Efficiency

Convergence of HPC + AI Use-Case Strategies to Gain Hardware Infrastructure Efficiency

By Jean S. Bozman and Srini Chari, Ph.D.

Two very different workloads, HPC and AI, are converging on the same types of underlying hardware infrastructure – leading some customers in enterprises, the public sector, and academia to build use-case scenarios that gain efficiency by leveraging both types of IT approaches together.

The key here is that HPC and AI workloads both leverage rack-integrated hardware infrastructure, but customers may benefit when both workloads are being used together.

For example, HPC supports high-volume data flow as data is transferred into scalable AI training models (LLMs) or traditional simulation algorithms. In another example, AI software supports efficient high-speed operations in semiconductor manufacturing fabs, such as updating information about the placement of materials throughout the factory.

These converged HPC and AI use-cases are rapidly emerging across every industry, as was clear at Supercomputing ’24 (SC 24) in Atlanta, its keynotes and breakout sessions.

We expect that customers will take advantage of both styles of computing (AI and HPC) as they update their data-center’s hardware infrastructure for faster processing. To do so, they must work to overcome whatever organizational barriers may exist before trying this converged HPC + AI strategy for any of their rack-enabled hardware infrastructure. In some cases, they may ask their cloud providers to use blended HPC and AI infrastructure to gain efficiencies and reduce duplication in their HPC and AI data-storage resources. Here's why:

  • Diversity of compute engines: We’re seeing a mix of many GPUs and CPUs in the same overall hardware infrastructure. This includes CPUs from AMD, Intel, and ARM; GPUs from NVIDIA and AMD; chiplets from multiple vendors; and storage resources utilizing 2-D and 3-D technologies for solid-state storage devices.
  • Density of populating racks: Provided that there is adequate cooling, either air-cooled or liquid-cooled, the density of CPUs and GPUs housed in the same rack is going up. We have seen 72 to 128 GPUs packed into the same air-cooled frame.
  • Fast interconnects: Interconnects are evolving to include Fast Ethernet, Optical interconnects, and fast CXL interconnects for chiplets and CPUs placed on the same system board. Without fast interconnects, a converged deployment to support HPC and AI workloads would quickly become overloaded – and throughout would visibly slow down.
  • Better fabrics: New types of data fabrics can reduce end-to-end processing time for both HPC and AI workloads. This will require the use of new-and-improved fabrics, interconnects, and on-board accelerators; new or updated software; and leveraging on-prem and off-prem cloud-based resources for data transfers.

Visions of a Converged HPC- AI Hardware Infrastructure

At Supercomputing 2024 (SC24) in Atlanta, we noticed the session called "How the Convergence of #HPC and #AI is Accelerating Innovation." Speakers from AMD; Ansys; Microsoft Azure and Microsoft; and Purdue University shared their use-case scenarios for combined HPC + AI approaches.

Panelists said they're looking at a range of use-cases that provide AI assists for the large data flows generated by HPC applications and the data they generate. They noted that rack-integrated infrastructure builds compute clusters – including CPUs, GPUs and TPUs, which can be connected to distributed storage clusters (in the same, or adjoining racks).

A Range of Use-Cases for HPC + AI

Customers are looking at a range of use cases that provide AI assists for the large data flows generated by HPC applications and the data they generate. But these use-cases must be selected carefully – because HPC and AI systems have evolved separately, over time.

For example, HPC systems at the large research labs have used highly parallel computing to sift through extremely large data flows generated by scientific research simulating rocket launches and “folding” large proteins. HPC systems are very good at moving through repetitive tasks quickly, relying on check-pointing of answers to preserve HPC data if these fast-moving systems stop – even momentarily.

But the software stacks for HPC and AI have developed separately, over many years, to serve different customer bases. As discussed at SC 24, much thought is being given to blending those computing, storage, and networking technologies to achieve “scale” for AI workloads and parallelization for repetitive tasks such as data de-duplication. This is where customers’ experience informs where an HPC + AI convergence strategy could be most helpful in future data-center implementations.

In AI, building large data resources for the purpose of training scalable AI data models demands extremely large “scale-up” resources and high-speed interconnects. By contrast, AI “inferencing” workloads often leverage hundreds or thousands of “scale-out” Edge devices to feed the fast-growing data models with newly-generated data findings. These deployments may use highly distributed scale-out networks to optimize data flows. Customers speaking at SC 24 said that AI’s support for both scale-up and scale-out hardware infrastructure could benefit HPC research projects and HPC platforms that are used in business and academic environments.

Convergence of HPC and AI to Optimize Use Cases 

As speakers in the “Convergence” panel discussed, HPC and AI approaches can be put together to optimize many types of use-cases that call for high-speed analysis of rapidly changing data. Examples in the panel discussion included:

  • Simulating current road conditions for self-driving (autonomous vehicles). Updates are key to avoiding accidents involving pedestrians and bicyclists.
  • Optimizing semiconductor manufacturing to sort out the use of materials during fabrication processes in real-time. The exact placement of those materials often changes, making fast updates about their exact location very important.
  • “Feeding” data to an AI training model (LLM), including de-duplication of repeated data inputs and “data-cleaning” for the sake of data integrity and reliability.
  •  Developing services that use AI to use HPC systems more effectively. One way is by allowing AI to make recommendations about outfitting the hardware infrastructure with power and cooling capabilities that are designed to prevent overheating caused by densely packed GPUs.
  • Putting AI to use to create predictive maintenance for workloads using the new – and popular  – GenAI open-source models to speed application development.

Acceleration and Scalability

We note here that blending HPC and AI was a theme for other sessions at the SC 24 conference. And, in a CUBE interview that took place at SC 24, Jason Schroedl of NVIDIA spoke about using a blended infrastructure to bring processing advantages to scale-up workloads.

Many SC24 sessions looked at AI’s capabilities, such as GenAI software development; AI-assisted troubleshooting of IT issues; and AI-based approaches to data de-duplication and ensuring data quality and compliance. Among these SC24 sessions were: High-Performance and Smart Networking Technologies for HPC and AI; Optimizing HPC for AI: Essential Architectural Insights; and New Optical-Based Scale-Up Fabrics to Meet the Performance and TCO Requirements of Next-Generation AI/HPC Architectures; From AI to HPC: Bridging Gaps in Domain-Specific Compilation.

Leveraging the Hardware Infrastructure

Why is there a focus on hardware efficiencies right now?

Rack-level integration is becoming commonplace in enterprise and HPC data centers. The compute, storage, and network pieces of that overall rack-level infrastructure can be connected more easily than possible several decades ago.

Multiple vendors, including Dell Technologies, HPE, Lenovo, Supermicro, and others, are showing how rack-level integration works at shows like SC 24 in Atlanta, the OCP Global Summit in San Jose, and the AI Hardware Summit in the San Francisco Bay Area.

However, we note here that the software stacks supporting HPC and AI have important differences, and these must be the focus of an HPC + AI convergence strategy.

In HPC, the emphasis is on scaling up to support large data resources in the PB (petabytes) or more range and high levels of parallelization of compute tasks to support those HPC workloads. In AI, scaling up is a pressing need when working with ever-larger LLMs and the data to “feed” them. But other parts of the technology stack—software, networking, and optical interconnects— are morphing to support specific AI tasks.

Reducing Total Cost of Ownership (TCO), Improving ROI

How would a converged infrastructure strategy help enterprises? Given historical differences in HPC and AI applications, there is a certain amount of hardware duplication for CPUs, GPUs, TPUs, NPUs, storage, networking, and interconnects.

Now, customers have an opportunity to update their data centers with new and improved power and cooling capabilities for all workloads, and to assign workloads across this highly scalable hardware environment. This supports better utilization of their data-center hardware – avoiding times when CPUs are not working at top capacity. That, in turn, is driven by the thought that consolidating workloads running on data-center racks will improve the customer’s TCO (total cost of ownership) and ROI (return on investment).

In our view, there is an analogy here: We saw this consolidation-and-efficiency strategy play out in the virtualization evolution of the early 2000s – when multiple applications were moved to multi-core processors for better – and faster – processing, reducing total TCO and improving ROI.

However, that converged environment for hardware infrastructure won’t work smoothly unless the software stack gets updated and optimized to make assignment and data transfers efficient. Just as importantly, customers must ensure that, in a converged environment, applications and data are isolated and secure for security and compliance reasons. That shows us why sophisticated software must be developed to keep overall data centers running smoothly and efficiently. This should be a customer priority for mission-critical applications running on a blended HPC + AI infrastructure.

Caution: Plan Carefully and Thoroughly for HPC + AI Infrastructure Use

Customers planning to move to a more efficient hardware infrastructure in their data centers must take note:

This is the time when careful planning and caution must be top priorities for HPC + AI convergence. The benefits are clear: better utilization of hardware resources, AI-assisted efficiency initiatives in the data center; reduction of incorrect or duplicated data that slows processing and produces inaccurate analyses of data results. 

For customers, the payoffs could be big: Greater efficiency in IT systems; accelerated HPC processing; faster, more accurate data analyses for enterprises; and better support of customers’ HPC scientific research and AI business analysis.

 As we’ve seen in previous eras of IT infrastructure convergence and efficiency:

Customers will benefit most by sharing their lived experience with scale-up and scale-out use-cases leveraging HPC and AI, to avoid costly mistakes in real-life implementations for HPC + AI convergence use-cases.

 

 

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1mo

Sounds like folks are really trying to make HPC and AI play nice together. That's cool, but how are they handling things like model parallelism and distributed training across these converged racks?

Like
Reply

To view or add a comment, sign in

More articles by Jean Bozman

Insights from the community

Others also viewed

Explore topics