Oak Ridge "Discovery" Supercomputer Spearheads New HPE Cray GX5000 Design

9 hours ago 1

Hewlett Packard Enterprise PE has a long history in supercomputing, with efforts like its Apollo family of systems aimed at data-intensive workloads like HPC, data analytics, and storage, and its $275 million acquisition in 2016 of SGI to help expand its presence in HPC.

But it was HPE’s $1.3 billion deal three years later to buy supercomputing pioneer Cray that propelled it past competitors like IBM and into its current dominant position. And it was HPE’s much deeper pockets and global reach that gave Cray the financial staying power it needed to become dominant.

With Cray in hand, HPE has now built the world’s three fastest supercomputers, all of them exascale systems based on the Cray EX4000 architecture and now humming along in three US Department of Energy (DOE) laboratories – “El Capitan” at Lawrence Livermore National Laboratory in California, “Frontier” at Oak Ridge National Lab in Tennessee, and “Aurora” at the Argonne Leadership Computing Facility in Illinois.

As with every part of the IT industry, supercomputing is undergoing rapid changes in needs and demands with the accelerating rise of generative AI workloads and models. Those shifts are forcing HPE and other system makers to assess how they need to evolve their architecture and infrastructure in not-too-distant future generations to meet the new AI-driven demands.

A “Crazy Time” In AI And HPC

“It’s been a crazy time to be in the HPC and AI business because of the explosion of AI and ChatGPT and everything else,” Trish Damkroger, senior vice president and general manager of HPE’s HPC and AI infrastructure solutions, told journalists during a video briefing on the new GX5000 iron. “We need to build a converged system that really can focus on both of these workload needs. It has to be modeling a simulation, which is core to so many of our customers, but it also has to fit the AI world. What this means is, it’s not only about having the traditional needs and what our traditional customers are looking for. For silicon, it’s a broader silicon need that we’re going to have to house within this infrastructure. We’re also converging with AI. Simulation is now not done alone. It’s truly part of a workflow, and AI is enhancing the modeling and simulation work, speeding it up, making it easier, etc.”

Damkroger added that “we have to make sure that our infrastructure supports both. It’s really about extreme scaling. With the growth of AI and the TDP [thermal design power] of the silicon. We’ve had to rethink what is possible and what we can do so that we can make sure that we maximize both the datacenter space and the energy usage.”

Due to the expanding use of such systems by cloud providers and enterprises adopting infrastructure-as-a-service (IaaS), and the core factors that HPE and others have to think about grows.

To address this, HPE is introducing the GX5000 exascale system – the successor to the Cray “Shasta” EX3000 line that was debuted ahead of SC2018 and that started shipping a year later. The Shasta line was expanded with the EX4000 systems at SC2022, while we were still dealing with the aftermath of the coronavirus pandemic. In addition to the new GX5000 designs, HPE is rolling out a new distributed storage cluster based on the open source Distributed Asynchronous Object Storage (DAOS) software that was pioneered by Intel in the Aurora system. The GX5000s also have the next generation of liquid cooling. All of these changes were made to address the ongoing convergence of AI and HPC workloads.

The GX5000 system will be the architecture for one system at Oak Ridge, specifically “Discovery,” an exascale computer that will be the successor to Frontier that is expected to be installed in 2028 at a cost of around $500 million, with operations starting in 2029.

The other will be an AI cluster called “Lux,” which will be based on the direct liquid-cooled HPE ProLiant Compute XD685 that is powered by AMD technology, including Instinct MI355X GPUs, Epyc CPUs, and Pensando DPUs.

Discovery is aimed at AI, HPC, and quantum computing and will increase the productivity of some applications by 10X, with use cases in such areas as precision medicine, cancer research, nuclear energy, and aerospace. It will be powered by AMD’s upcoming next-generation “Venice” Epyc processors and Instinct MI430X GPUs, and will come with the HPE’s Slingshot 400 interconnect, which promises twice the speed over the 200 Gb/sec Slingshot 200 series. Slingshot 400 was introduced last year along with other HPC upgrades and with plans to make it available in “Shasta” Cray EX systems this fall.

Meanwhile, Lux will operate in a multi-tenant cloud-platform focusing on AI and machine learning operations.

Pivots With The GX5000

Discovery will be the introduction for the GX5000.

“It is purpose-built, like our previous generation, for supercomputing,” Damkroger said. “It includes CPUs, GPUs, networking, software, storage, and cooling. It’s a completely new architecture engineered to deliver this unprecedented performance with higher density and truly supports these growing workloads and workflows. The GX5000 has been in the works for years, but, honestly, we’ve made some pivots over the last year and a half as we’ve seen the growth of TDPs, the growth of different silicon coming out from all the vendors, and the need to be able to support all of these different workloads.”

There is still more information that will be coming out down the road about the GX5000 – which will be on display at SC25 next month in St. Louis – but she said the architecture will offer more compute power – 127 percent more – and up to 25 kilowatts per compute slot. The system will be 42 percent smaller than its predecessor.

Each GX5000 tray will accommodate different TDP parts, though Damkroger said she wasn’t ready to talk in detail about how hot they could get (our guess is several kilowatts each) or about which CPUs, GPUs, and XPUs would be included as options. Those details will emerge when HPE starts to announce customer wins for the system, with system deliveries beginning in early 2027, she said.

The EX system is a double-cabinet wide, but the GX5000 will be smaller, dropping from 95 inches across to 53 inches. It also has a design that will allow organizations to mix and match the processors.

“Where [with] the EX, you had to have the same, so you made sure you had the same load across each one of the blades,” she said. “With the new pump design [below], we’re going to be able to mix and match, which will mean we could have mixed cabinets, which is definitely something that our customers have been interested in.”

Augmenting the GX5000 will be the Cray Supercomputing Storage System K3000, which comes with embedded DAOS software. Intel late last year transferred its DAOS development team to HPE. DAOS software is also part of the Aurora system, and with the all-flash K3000, HPE has a factory-built storage system with DAOS embedded that will deliver as much as 75 million IO operations per second (IOPS) per storage rack, 39 percent higher than other systems, according to Damkroger.

It complements HPE’s Lustre Cray Supercomputing Storage Systems E2000, which also will be used in Discovery.

The GX5000 also will feature HPE’s latest direct liquid cooling.

“We’re bringing liquid coolant to every component of the supercomputing that transmits heat,” she said. “This is important. It’s not just CPUs and GPUs and memory, but it’s also switches, which is unique in our current EX design. The next generation of Cray supercomputing will allow for more efficiency and density. … Basically, the cooling pump is designed to be more compact and can be placed on the side of the system, called a side pump, instead of in the middle, and each pump is going to have redundancy to ensure that there’s an always on operation.”

Users will be able to control the water flow rate, so rather than every single blade having the same flow rate, it can be optimized based on the needs of the blade and what it’s running. The water also will be at 40 degrees Celsius, or just more than 100 degrees Fahrenheit. That’s warmer than the current 25 degrees Celsius (77 Fahrenheit).

“This new thermal capacity meets the new energy requirements for a lot of our customers in Europe and, actually, in other parts of the world,” Damkroger said. “This will ensure that we don’t need additional chillers and refrigerators, which just costs additional power. It’s really going to be a much more energy-efficient system.”