As the renowned author, Douglas Adams, once wrote: “Space is big. Really big. You just won't believe how vastly hugely mind-bogglingly big it is.” It’s no real surprise, then, that getting to the bottom of the origins of the universe takes a massive – almost unbelievable – amount of computing power.
Getting to the bottom of some of the universe's deepest mysteries has meant setting up a massive computer, with Durham University among a handful of UK institutions jointly working on the project. Admittedly, it isn’t nearly as big as the Deep Thought system that features in The Hitchhiker's Guide to the Galaxy, and the answers are far less vague than “42”.
Before even beginning to tackle these questions, researchers sought a device powerful enough to peer into the very beginnings of space and time. This is where the James Webb Space Telescope (JWST) came in. Successor to the great Hubble Telescope, the JWST conducts infrared spectroscopy, with its high resolution and sensitivity allowing us to examine objects deep in space with detail its predecessor never managed. The JWST was launched into space last December and is already producing phenomenal results. Clusters of breathtaking images published by NASA speak for themselves.
This is only part of the journey, though, and institutions such as Durham University are hoping to use the tiny clues revealed by such images to help decipher the origins of the universe. They should, over time, begin to give answers to such questions as what the universe is, what it’s made from, what dark matter might be, and so on. But first, data needs to be crunched on a massive scale while, at the same time, researchers run in-depth simulations over and over.
The origins of everything
Durham set up a Cosmology Machine (COSMA) last year to begin answering these questions. This isn’t the first COSMA – that came online in 2001 – with the latest version, COSMA8, becoming operational in October 2021. It’s part of the Distributed Research utilizing Advanced Computing (DiRAC) facility, which has five deployments at the Universities of Cambridge, Durham, Edinburgh, Leicester and UCL. The DiRAC-3 memory intensive (MI) service at Durham runs workloads that require massive amounts of memory.
COSMA’s role is to run a simulation of the big bang. By tuning different input parameters and different models, taking into consideration things like dark matter and dark energy, the team attempts to match a simulation with what’s observed by astronomers using the JWST. Using a heavy amount of statistical analysis, the team can fine-tune the input parameters of these models to get a better idea of how the universe might have been formed.
To do that, COSMA8 has a high amount of memory per core, with the current system boasting 1TB of random access memory (RAM) per computer to power the cosmology simulation. This simulation which starts with the big bang before tracking how this might propagate through spacetime, can take months to run. The additional memory capacity helps to shrink that time span.
The compute cluster is based on Dell EMC PowerEdge C6525s. With four servers arranged in a 2U formation, each chassis contains 512 processing cores, with up to 3,200 MT/s (megatransfers per second) memory speed to decrease latency and PCIe Gen4 to transfer data quicker. The first installation had 32 compute nodes to iron out any problems with coding as well as testing and benchmarking. Shortly after, this was expanded to 360 nodes with plans to increase this even further to 600 nodes next year.
COSMA8 also features AMD Epyc CPUs, with 64 cores per processor. Each node contains dual 280-watt AMD EPYC 7H12 processors per node. According to Alastair Basden, HPC technical manager at Durham University, having so many cores per node decreases the amount of internode communication, speeding things up along the way. Basden’s team worked with Dell to move half of COSMA7 to a switchless interconnect, meaning all the nodes are connected in a 6D torus rather than to a central hub.
One of the reasons for building this style of interconnect was that some of the code running the simulations didn’t parallelise 100%.
In order to scale well with the new technology deployed at Durham, Basden says the team there developed code in Swift to dynamically allocate tasks to nodes with available processing power, using task-based parallelism. This reduces the imbalance often seen in HPC codes. The code automatically rebalances itself periodically to consider different parts of the universe that may be denser or sparser. These simulations can take several months to carry out and produce petabytes of data which the team then analyses.
It was also important for the system to have a very fast scratch storage partition as when the simulation is running it needs to dump a lot of data to disk very rapidly.
IBM FlashSystem 5000 and 5200 for mid-market enterprises
Manage rapid data growth within limited IT budgets
With the system in place, Basden says it’s running as expected. The real challenge has not been with the hardware but with the software. For Basden, this means getting the software environment “working in a way in which maximises utilisation of the hardware and makes it easier for the researchers to use”.
Another challenge was in being one of the first to use direct cooling using water to take heat away from the system. He says that “was a bit of a learning experience in how to manage these systems”. He adds Dell helped the team thorough this challenge, and it is now very reliable, so much so that Dell will be offering direct liquid cooling across its next generation server portfolio when it launches next year.
As for the future, there is talk of the possibility of using composable systems from a company called Liqid that enables infrastructures to virtually pool GPUs, CPUs, memory, and network switches, among other hardware components. While Basden may not use GPUs any time soon – as he says CPUs are better for the memory intensive simulations Durham is running – it could be of use in allocating extra memory to a server when running those simulations.
This novel technology has led to Basden sparking up discussions with code developers, lab scientists and researchers on how they would be able to redesign their coding to take advantage of a much larger bank of cheaper memory.
Get the ITPro. daily newsletter
Receive our latest news, industry updates, featured resources and more. Sign up today to receive our FREE report on AI cyber crime & security - newly updated for 2023.
Rene Millman is a freelance writer and broadcaster who covers cybersecurity, AI, IoT, and the cloud. He also works as a contributing analyst at GigaOm and has previously worked as an analyst for Gartner covering the infrastructure market. He has made numerous television appearances to give his views and expertise on technology trends and companies that affect and shape our lives. You can follow Rene Millman on Twitter.