Is the Top500 meaningless? Not so, says US national laboratory CTO

LINPACK may measure only one process, but there are real and meaningful use cases for exascale systems

Lab technician working on Frontier compute node
(Image credit: U.S. Department of Energy, Oak Ridge National Laboratory)

While every CTO’s job varies, some stand out more than others, and being CTO of one of the USA’s national laboratories is one such case.

In eastern Tennessee, south of the city of Oak Ridge and close to the Clinch River, sits the aptly named Oak Ridge National Laboratory – or ORNL for brevity – and its CTO, Scott Atchley.

ORNL is one of two facilities – the other being Argonne National Laboratory in Illinois – that make up the Leadership Computing Facilities. These two locations, both of which are funded by the US Department of Energy (DoE), are home to some of the fastest and most powerful supercomputers not just in the US but in the world: Aurora at Argonne and, at ORNL, Frontier.

Atchley, a 16-year veteran of ORNL, tells ITPro that there have only been two weeks in his tenure as CTO that he’s regretted taking the job.

Latest Videos From

“That’s when we had deployed Frontier,” he says.

Full power failure?

Yes, even in one of the most advanced computing facilities in the world, it’s still entirely possible to have a relatively small fault cause chaos.

“[Frontier is] our current supercomputer, it's an HPE Cray EX system and one of the benchmarks – not the goal, but one of the benchmarks – for the system is that it would reach exascale, or 10 to the 18 flops,” says Atchley.

Supercomputers’ power is only announced twice a year – at ISC in Germany, which takes place in June, and SC Conference in the USA, which takes place in November. It’s from the announcements at these conferences that the Top500 list of most powerful supercomputers is drawn.

With the deadline for ISC 2022 fast approaching, Frontier was failing to perform.

“We were only about halfway [to 1018 flows], and we could not figure out why,” says Atchley. “We were having daily calls with HPE and AMD, and the deadline was coming.

“We were in a full panic, and this was the first time I thought, ‘my gosh, we might have bought the wrong machine’.”

Fortunately for Atchley and everyone else involved, he was wrong, and once the problem was identified, it was a “simple fix”.

“It was a one-line software change, and all of a sudden the performance shot up to right at just under an exaflop,” he says. “When they're new, [supercomputers] are very fragile. So, we rebuilt the machine during the day, and the next night we did the runs, and we finished it at about six in the morning.

“There were probably 50 people in line watching … and when the job looked like it was going to finish, all the chatter went quiet, nobody wanted to jinx it – we had been trying all night long, and it was really exciting when we saw the result that we actually exceeded an exaflop.”

Putting power to use

Top500 does have its limitations – only companies that run the LINPACK benchmark and submit the outcomes are measured, for example. There are also claims that it’s not particularly scientific, and Jack Dongarra, one of the creators of LINPACK, has himself admitted that its usefulness is limited.

“This benchmark reports the performance of this one application with its floating point operations and message passing. So there is one application and one number that represents the performance for the computer system. The weakness is that this is just one application and one number,” he told HPCWire in 2002.

Limitations and weaknesses don’t mean useless, however, and Atchley pushes back at the idea that Top500 is entirely meaningless.

“A lot of people will knock the Top500 … but we've actually had jobs on the system doing real science exceed an exaflop in performance, which is really cool, and so we're glad,” Atchley tells ITPro. “Our users love the machine, it's doing some incredible science.”

From aerospace to outer space – and back

Speaking of users, who is making use of Frontier? One case study Atchley gives is from the engineering firm GE. In partnership with French firm Safran, they have been developing a new type of jet engine.

“If you look at engines when commercial airliners came out in the 60s and early 70s, the engines were really narrow, really long,” says Atchley. “If you flew[today] and looked out the window, the engines now are really short, but really wide.

“Making them wider makes them more energy fuel efficient, but they're to the point now that if they make them any bigger, the gains they get in the efficiency they lose because of the drag on the nacelle.”

One option is to get rid of the nacelle – the housing around the engine – altogether. This isn’t a new idea, Atchley explains: “NASA has known about this for 50 years.”

The only issue is that using them on a commercial aircraft would deafen the passengers and crew, because they are – or would be – so loud.

It was only once Frontier came online with its exascale capabilities that GE and Safran were able to measure in high resolution the swirls of turbulence coming from the engine.

“Now they're tweaking the design to minimize those, and so that is the huge breakthrough, is understanding turbulence at a really fine scale. They can't do that in a wind tunnel. This engine is so big, there's no wind tunnel on earth that would fit it, so the only way they can do it is in a computer,” says Atchley.

As powerful as Frontier is, however, it can only model a 14-blade engine. In 2028, though, Frontier is expected to be superseded by the Discovery supercomputer, which may open up new opportunities for the likes of GE.

Once it’s online, ORNL will be home to three supercomputers: Frontier, Lux (which is expected to come online in 2026), and Discovery. At least for a time. Frontier will, ultimately, be decommissioned just as its predecessor, Summit, was.

“The challenge you get on these machines is even if the hardware doesn't run out or doesn't fail, the software, the vendors don't want to support something that's now 6,7,8 years old,” says Atchley.

“You think you want to have the latest software, you want to have the latest security patches, and so eventually you have to turn the machine off, but we had universities [and] industrial partners asking us, ‘can you run Summit longer?’ – we'd love to, but we can't.”

Sometimes there is a second life for these apparently beloved behemoths that isn’t just the recycling route.

“IBM was able to sell a portion of it to an oil and gas company, and so it is still doing work for them somewhere else.”

TOPICS
Jane McCallion
Managing Editor

Jane McCallion is Managing Editor of ITPro and ChannelPro, specializing in data centers, enterprise IT infrastructure, and cybersecurity. Before becoming Managing Editor, she held the role of Deputy Editor and, prior to that, Features Editor, managing a pool of freelance and internal writers, while continuing to specialize in enterprise IT infrastructure, and business strategy.

Prior to joining ITPro, Jane was a freelance business journalist writing as both Jane McCallion and Jane Bordenave for titles such as European CEO, World Finance, and Business Excellence Magazine.