4 Critical Insights for Ordering Data Center Liquid Cooling Systems
1. Specific Hardware: Understanding Chip Temperature and Facility Water Temperature
Requesting a quote for cooling a specific set of hardware is essential to optimize your total cost of ownership (TCO). Lately, we've seen liquid cooling quote requests that vaguely specify kW per rack or request a standalone CDU quote. Without details of the servers or CPUs/GPUs involved, the power a CDU can cool varies significantly. Standard procedures will simplify this process in time, but for now, ASHRAE and OCP are still figuring it out. Request a quote tailored to your identified hardware to avoid inaccurate estimates. Be mindful that cooling CPUs and GPUs below their maximum temperature prolongs their life, improve efficiency, and may enhance performance. Bargain shopping for a CDU, fluid connectors, manifolds, and cold plates, then connecting them won't necessarily minimize your total cost of ownership. We recommend proceeding cautiously due to potential complications arising from these makeshift systems.
2. Cooling and Power: Similar Terms, Different Meanings
Providing an accurate quote in response to incomplete RFPs is challenging as the required cooling water flow rate varies according to maximum chip temperature, chip power, and facility water temperature. This is akin to requesting a 10 kW PDU without specifying the voltage.
Technical Explanation: In liquid cooling, the flow rate is analogous to the current in power estimates, while the delta temperature between the CPU/GPU's maximum case temperature and the facility water temperature is similar to the voltage. CDUs, comprised of pumps and heat exchangers, can cool a certain number of servers given a particular flow rate. A higher deltaT (temperature rise at the servers) allows more servers to be cooled. This is akin to a PDU supplying a certain current, where a higher voltage can power more servers with the same current. The heat exchanger provides certain heat transfer while reducing the available delta T, much like an electrical cable run where a higher current causes a larger voltage drop, limiting the voltage available to the servers.
3. Practical Redundancy: Address Real-world Failure Modes
Redundancy requirements are essential, but they should tackle real-world problems. For example, including two centrifugal pumps in a CDU may not significantly increase uptime because these pumps are reliable and last five years or more. More frequent issues like leaks, contamination, and corrosion need attention. Our systems offer N+1 CDU redundancy to maximize data center uptime, considering the significant financial implications of downtime in a multi-million-dollar data center.
4. How to Analyze Your System Requirements
Determining the liquid cooling requirements of a system hinges on two key factors: 1) Power consumption and 2) Maximum case temperature. Higher power consumption and lower case temperature thresholds demand a greater volume of coolant per server. Recently, the upper limit for case temperatures has been 80-95 C. However, current models require case temperatures as low as 57 C. This shift is like different servers requiring different voltages.
Detailed Examination of Specific Systems
To illustrate this, let's consider a scenario in which the liquid cooling system primarily absorbs heat from the CPU/GPUs, with the cooling water maintained at 30 C. The rack power stated here pertains solely to CPU/GPU power; the overall power will be by 20-50% more. Assume we are assembling a cluster of Nvidia H100 GPUs, each drawing 700 W and featuring a maximum case temperature of 85 C. By using the Chilldyne calculator available at www.Chilldyne.com/calculator; we can determine the number of GPUs we can cool efficiently. Inputting the following specifications:
· Facility water temperature: 30 C
· Node maximum power: 700 W
· Target node temperature: 85 C
Recommended by LinkedIn
yields a cooling capacity of 572 GPUs. (A margin of error should be incorporated in real-world scenarios.) These GPUs can be housed in Dell XE9680 6U servers, each capable of handling 6 kW. With seven servers (56 GPUs) per rack, this results in the cooling capacity for ten racks, drawing 42 kW each, totaling 420 kW. (This example assumes that the servers' CPUs are under low load and contribute minimally to the overall power consumption.)
If the facility water temperature is increased to 40 C, the cooling capacity drops to 334 GPUs, restricting us to six racks (250 kW). A 10C increase in the facility water temperature reduces rack cooling capability by nearly 40%.
Alternate Scenario: Intel 8460Q Sapphire Rapids (Xeon 4th Gen)
Consider a situation where we are using 30 C water to cool Intel 8470Q Sapphire Rapids (Xeon 4th Gen) servers like the Dell c6620, equipped with eight CPUs per 2U server. Each CPU consumes 350 W and is rated for a maximum case temperature of 57 C. The smaller temperature differential between the CPU and coolant necessitates a substantially higher flow rate per watt. Employing the Chilldyne calculator once again, we can cool 477 CPUs. This equates to cooling 60 2U servers, or three racks, each drawing 59 kW (180 kW total). However, if the water temperature increases to 40 C, we might not have an adequate cooling margin.
Alternate Scenario: Supermicro AS -1125HS-TNR AMD Genoa
Lastly, let's evaluate the cooling capability for Supermicro AS -1125HS-TNR AMD Genoa servers. The Genoa configuration has six memory lanes and does not support a half-width form factor. Each 1U server accommodates two 400 W CPU sockets, rated for a maximum case temperature of 70 C. According to the calculator, we can cool 877 CPUs and 17 kW per rack, totaling 42 racks (714 kW). However, if the facility water temperature increases to 40C, we are limited to cooling 459 CPU across 22 racks (374 kW).
Therefore, the power per rack and the total cooling load can vary, ranging from 180kW to 714 kW. To design a future-proof liquid-cooled data center, ensure an ample supply of facility cooling water, and add or upgrade CDUs as needed. Direct-to-chip cooling systems offer flexibility; you can switch from one brand to another during a server refresh with relative ease. In some cases, if you install a positive pressure system, and it leaks, you can retrofit it with Chilldyne negative pressure CDUs and use the existing server hardware.
Here are some requests we have received and our reasons for caution:
· We've been asked to design a liquid cooling system for 3500 watts per node when the current server uses 500W. This leads to unnecessary expense and inefficiency. It's wiser to design for present requirements, scaling up only if necessary.
· System must be designed for 40+ psi DeltaP. Another request is for a positive pressure system designed for 40+ psi DeltaP, which exceeds typical requirements. This increases costs and the chance for failure, with Chilldyne's systems demonstrating efficient operation at less than 6 psi DeltaP over years.
· 90% heat capture. Finally, aiming for a 90% heat capture ratio might seem like an efficiency drive but it can lead to unnecessary costs. It's often more cost-effective to achieve a lower heat capture ratio like the 80% we achieve with Sandia Manzano and air cool the remaining percentage.
Global | B2B | Sales | Marketing | Strategy | NPI | MBA | Engineer
1yThanks for the insight Steve and keep up the excellent work.
Driving the global growth and adoption of liquid cooling technologies for data centers
1yThis article is spot on and serves as an excellent educational piece for an industry that is transitioning to #liquidcooling ! One of the fundamental takeaways from this article, which resonates deeply with the work I have been involved in, is the critical overlap between IT and #Datacenter facility considerations. Traditionally, facilities didn't need to dwell into detailed IT specifications like TDPs, case temperatures, etc. However, with liquid cooling, understanding these parameters along with thermal and fluid dynamics is paramount for successful implementation. The challenge for buyers is significant. They are now at the intersection where IT and facilities meet, and they need to navigate through both worlds. This change is profound, and the article does a commendable job in highlighting the intricacies involved. It’s imperative that as an industry, we rally together to educate and support each other through this transition. Collective learning and sharing of expertise will pave the way for an efficient and effective adoption of liquid cooling. Kudos to Steve Harrington and Chilldyne, Inc. for shedding light on this essential subject! Adding John Gross John Menoche, P.E., Advanced Cooling Facility leads in #OCP.
Datacenter and Infrastructure Engineering Professional
1yAmen. Great article and nice calculator.