Vertiv Holdings Co.

11/07/2024 | Press release | Distributed by Public on 11/07/2024 07:10

A roadmap for the future chip coolant temperature

The adoption of liquid cooling has dramatically increased in recent years due to the rapid increase of graphics processing unit/application-specific integrated circuit (GPU/ASIC) power consumption for AI/ML workloads. Cloud service providers (CSPs) and OEM adoption also rapidly went up as CPU energy consumption steadily increased to an estimated 100W in a decade, to over 400W in the last five years. GPU/ASIC energy use has dramatically increased compared to CPUs, with power rising by hundreds of watts annually. Silicon vendors like NVIDIA, AMD, and Intel have published GPU powers greater than 1kW. In addition, CSPs such as Microsoft, Google, Amazon, and Meta are designing their silicon for AI/ML workloads, with their future silicon requiring liquid cooling.

In collaboration with some members of the Open Compute Project (OCP), such as AMD, Intel, Meta, NVIDIA, Samsung, and Vertiv™, we published a white paper titled "30°C Coolant-A Durable Roadmap for the Future." The research provides business planning and efficiency insights on maintaining optimal temperatures for equipment longevity, supply chain stability, and environmental responsibility.

GPU architecture: The thermal stack in the modern silicon

Modern GPUs have had a dramatic shift in their silicon package construction, utilizing 2.5-dimensional (2.5D) multi-chiplet stacking (Chip-on-Wafer-on-Substrate, or CoWoS) to enhance computing performance. This involves combining system-on-chip (SOC) with high bandwidth memory (HBM), allowing for a different type and number of chiplets to be combined for more performant GPU packages. The improvements in process technology, package assembly techniques, processing performance requirements, and co-location of memory have largely driven the large power increases seen in GPUs.

By comparison, the complexity of GPU construction is much higher than CPUs and equally creates unique thermal challenges in the package, namely:

  • The different chiplets or components require different maximum junction temperatures.
  • The different chiplets have different stack heights.

For example, the SOC chiplet is typically operating with a maximum junction temperature of 105℃, whereas, the HBM will require a lower junction temperature of 85℃ for a single-refresh, and 95℃ for a double-refresh operation in the early generations. Later generations of HBM have increased that to 95℃ and 105℃, respectively, to better match with SOC junction temperature requirements. Each HBM manufacturer is trying to address this trend based on their expertise.

Durability from a manufacturer's and operator's perspective

It is in the best interest of silicon manufacturers to have a durable data center to establish that temperature requirements for silicon will not change with each generation. Next-generation AI products have thermal designs that require significantly cooler fluid temperatures than what some data center designers are currently planning for. By agreeing on a standard coolant, silicon manufacturers can design their products with the assurance that they can be sufficiently cooled. Additionally, a set temperature provides a good design direction for silicon manufacturers.

Meanwhile, data center operators benefit from aligning on a durable coolant temperature for several reasons. While it takes many years to plan, design, build, and commission a data center, AI silicon is rapidly evolving. The industry benefits from quick iterations of silicon. But if the lower limit of the coolant needed to deploy AI silicon were to change rapidly, then new data centers could be obsolete before they are even built. It is also relatively easy and efficient to operate a technology cooling system (TCS) loop at the base of its hottest temperature range. However, it is very expensive and time-consuming to modify a TCS loop to operate below its existing design range once built. Therefore, defining the lower temperature limit of a data center's TCS loop temperature is very important for long-term viability and investment in the data center infrastructure.

Determining the right temperature for the modern data center

Data centers must be designed for a specific coolant operating temperature and flow rate. If the temperature requirement for a particular generation of IT hardware is higher, the data center operational set points can be adjusted to raise the coolant temperature. This offers the opportunity for improved efficiency. If the temperature requirement is lower than the data center design temperature, expensive and time-consuming modifications to the physical design will be required. For example, lower temperatures may require additional chiller capacity or a new type of chiller. Given the consequences of such a change, it is important to set the coolant temperature requirement to not change through multiple generations of IT hardware.

This is especially challenging for AI hardware, given the rapid changes in silicon power and thermal requirements. Figure 2 illustrates the GPU and CPU power trends and associated technical fluid temperature requirements (Figure 6 in the white paper). AI hardware is driving the need for liquid cooling. GPU power is increasing much faster than CPU power. The chart shows the associated fluid temperature requirements for GPUs over time: the asymptotic temperature requirement is 30℃.

Figure 2. GPU and CPU power trends and associated technical fluid temperature requirements at 1.5 lpm per 1KW. Source: Open Compute Project®, "Coolant Temperatures for Next Generation IT and Durable Data Center Designs" presentation, OCP Regional Summit, April 2023

We agreed on 30℃ as the minimum fluid temperature for hardware and data center design. The line thickness represents prediction uncertainty. Investments in advanced silicon packaging and liquid cooling performance are needed to maintain 30℃ as a long-term interface specification. The choice of 30℃ matches a common minimum air temperature specification for large-scale data center design. Choosing the same value for air- and liquid-cooled IT hardware allows the industry to maintain the significant power usage effectiveness (PUE) improvements achieved for the past 15 years. The consequences of lowering the fluid temperature from 30℃ to 20℃, and the reduction of the free-cooling hours in a year in zones classified with having hotter climates, can be read in the white paper.

While liquid cooling within computing systems has been around for many decades, the demand for cooling higher-density workloads is driving liquid cooling from boutique to hyperscale. Because of the development and evolution gap between AI solutions and data center build timelines, data center operators and silicon providers need to agree on a coolant temperature that enables the silicon to be durable.

A minimum coolant temperature of 30℃ for the TCS loop offers a setpoint that provides both parties an operating point. Considering a simplified view of the cooling architecture in a data center, there are at least two cooling loops in a liquid-cooling system from the TCS to the facility water system (FWS) as shown in Figure 3 (Figure 8 in the white paper).

Figure 3. Cooling system architecture and components

Cooling AI with confidence

The TCS 30℃ coolant temperature limit is not intended to discourage the development of a wide range of future solutions. While there is a significant value in aligning on a coolant temperature of 30℃ or greater, there is no desire to limit the types of technology used to deliver the cooling capability to the silicon. Freedom of innovation will be needed across many different cooling technologies to provide ideal solutions across the industry, including immersion, cold plate, thermal interface materials, and many others. There will also be demand for silicon solutions across a range of coolant temperatures, including some opportunities below the 30℃ target. However, the goal of this initiative is to target silicon with the best price-per-performance ratio above the 30℃ coolant limit.

As an industry, researchers continue to find solutions and enhancements for the critical digital infrastructure, including technologies that can further business and responsible business efforts into the future. For instance, OCP (Open Compute Project)has been instrumental and will continue promoting cooling solutions that enhance and enable data center efficiency and reliability. To find the technical insights and business details for keeping the optimal temperatures in the modern data center, and identify paths for new technological investments towards increased performance in the future, download the OCP's latest white paper "30°C Coolant-A Durable Roadmap for the Future." To know the deployments and implementations to facilitate this, visit Vertiv.com.