What medium and large enterprises can learn from the supercomputing revolution

Stylised close-up photo of a supercomputer
(Image credit: Shutterstock)

IT Pro created this content as part of a paid partnership with AMD. The contents of this article are entirely independent and solely reflect the editorial opinion of IT Pro

Supercomputing entered a new era in 2022, with the inaugural Exascale cluster coming online. Oak Ridge National Laboratory’s Frontier was the first system to arrive that is able to deliver over 1 ExaFLOP of compute performance, or one quintillion floating-point operations per second. This will unleash a new world for research, including the application of AI/ML on a much larger scale. But it has significant implications for medium and large enterprises as well, particularly while they address digital transformation.

Frontier is an immense system, with its HPE Cray EX cluster composed of 9,408 3rd Gen AMD EPYC 7A53 processors sporting 64 cores apiece, for a total of 602,112 CPU cores. Each of these single-processor nodes is equipped with four AMD Instinct MI250X GPUs, making a total of 37,632 accelerators. While the CPU compute power is already enormous, the GPUs deliver the lion’s share of the FLOPS, so to take advantage of the compute power on offer from Frontier, workloads need to be enabled for GPU acceleration, including unique new features.

The AMD Instinct MI250X GPUs deployed in Frontier offer a range of capabilities, but two stand out. First is memory coherence, where the accelerator memory and system memory can be treated as a continuum, rather than requiring two copies of the data as in other standard systems. While it was already possible for multiple GPUs in a node to share their memory, facilitating the handling of very large datasets, adding system memory as well further expands the possibilities, particularly when each node can support up to 4TB of RAM.

What results is memory space savings that allow the processing of larger and more complex datasets. The significant reduction in programming code makes the code more nimble and efficient, cutting down on the time it takes to execute commands.

The AMD Instinct MI250X is also the first GPU to offer built-in networking, which enables distributed processing across nodes. When GPUs are plugged directly into the interconnect network, the communication between compute nodes or GPUs can become faster and more efficient, with smaller latency overhead. This will enable even more abilities to operate on huge datasets, which for some workloads will mean a qualitative change, not just a quantitative one. Certain kinds of insights that were restricted by scope will become possible. The scientific community researching cosmology or extremely complex environmental systems with Computational Fluid Dynamics will be the first to benefit from this.

While Frontier is the first supercomputer so far to take advantage of the AMD Instinct MI250X’s built-in networking, it’s not a special feature only for this system. This capability is likely to become available on a wider basis in more generally available HPC data centres, and GPU-enhanced code will take advantage of it to improve performance scaling. For the time being, the huge datasets used in certain domains within science will be the main beneficiaries of this technology. But all areas of data analytics can benefit from volume to deliver more effective insights. Any company needing to analyse the commercial behaviour of its customers on a global basis is going to have a huge amount of information to process. The ability to work on larger sets covering longer periods can deliver better actionable analysis.

Power consumption is another area where supercomputers like Frontier and Europe’s recently commissioned Lumi lead the way. Their choice of hardware delivers unprecedented levels of performance and consumes a lot of power overall, but in terms of how much compute per watt they provide, they are the most frugal systems yet. This is going to be an increasingly important attribute for medium and large businesses, not just because of concern for the environment but for the optimisation of running costs.

There are now over 7 million data centres around the world. Energy company Engie has estimated that these accounted for 4% of world energy consumption and 1% of greenhouse gas emissions. The hunger for compute power is showing no signs of diminishing, particularly as Internet of Things devices proliferate and AI/ML workloads are deployed in more and more areas. This compute must be delivered in the most environmentally friendly way to ensure that the increase in demand doesn’t come with a prohibitive environmental impact or cost.

The density of supercomputers such as Frontier and Lumi enables them to consume less power for their performance capabilities, thanks to efficient processor and accelerator design, which provides the most compute for the watt. However, HPC data centres don’t just consume energy directly to power their CPUs and GPUs, but also for the cooling required to keep these components at optimal operational temperature. This is another area where the technology used in the latest supercomputers offers plenty to emulate.

Traditionally, data centres have required sophisticated, power-hungry air conditioning to keep them at optimal temperatures. Not only does this draw a lot of electricity but can also eject a lot of heat into the surrounding environment. Using cooling that relies on water and natural airflow rather than actively refrigerated air conditioning can dramatically reduce the power consumption and environmental impact. Data centres in hot dry climates such as the US western and southwestern regions have also been deploying “swamp cooling”, which relies on evaporation to provide a cooling effect. Not only are these systems much cheaper to install than air conditioning (around half the price), but they also consume less than 40% of the electricity.

The efficiency afforded by density is set to increase, too. The processor side of supercomputing has just had a huge injection of extra power with the release of AMD’s 4th Gen EPYC processor. This increases the number of cores per socket by 50%. The top CPUs now offer 96 cores, enabling a dual-socket server to deliver 192 cores. Although this increases the thermal design power by nearly 30% over the equivalent 64-core 3rd Gen AMD EPYC processor, with 50% more cores the compute per watt will go down, making this an even more environmentally friendly HPC platform.

Medium and large enterprises have been undergoing a concerted period of digital transformation, which shows no sign of abating. The increasing use of AI/ML and data analytics in business practice entails a parallel escalation in the demand for HPC compute infrastructure. The supercomputing revolution, with its focus on GPU acceleration, density, and power consumption, shows the way forward. By borrowing ideas from the fastest computers in the world, medium and large enterprises can ensure they can address digital transformation in the most environmentally friendly and cost-effective way.

IT Pro created this content as part of a paid partnership with AMD. The contents of this article are entirely independent and solely reflect the editorial opinion of IT Pro

ITPro

ITPro is a global business technology website providing the latest news, analysis, and business insight for IT decision-makers. Whether it's cyber security, cloud computing, IT infrastructure, or business strategy, we aim to equip leaders with the data they need to make informed IT investments.

For regular updates delivered to your inbox and social feeds, be sure to sign up to our daily newsletter and follow on us LinkedIn and Twitter.