Ending the age of data centre ‘five nines’
High 99.999% availability pledges play well with cloud and data centre customers, but some downtime is always necessary
Service level agreements (SLAs) and marketing material that emphasises extremely high 'five nines' availability – 99.999% of uptime – can sometimes have the opposite effect.
In fact, almost always-on availability can nudge operators away from taking systems or infrastructure down for maintenance, according to Simon Brady, EMEA services channel business development head at Vertiv.
"Everyone talks about 'five nines'; 99.999% availability," Brady says. "Ask any data centre when was the last time they had a three-minute failure; a one-second failure equals a whole day. ‘Five nines’ is a misnomer, a marketing gimmick."
Always-on data centres are unrealistic
The impetus on data centre five-nines might have inadvertently contributed to the data centre cooling issues reported during 2022's July heatwave, when outdoor temperatures peaked at 40.3ºC in Coningsby, Lincs. Rather than carving out downtime necessary to upgrade equipment or maintain systems to cope with extreme temperatures, operators paid the price much later down the line when cooling systems failed.
While hot weather can correlate with failures, well-looked-after equipment doesn’t typically falter. Instead of demanding constant uptime, stated metrics and service levels should focus more on those aspects: contingency plans in the case of failure, resilience strategies, or redundancies built in at the software level, for example. he says.
Some data centres, meanwhile, otherwise typically operate beyond their SLAs and contracts, putting their own organisation at risk – legally if not technically. The promise of 99.999% availability, therefore, could be deemphasised in some cases, Brady suggests.
"No matter what you buy, whether it's a Bugatti Veyron or a Ford Focus, if it can break, it will break at some point," he says. "So you have to plan for that."
Channel Pro Newsletter
Stay up to date with the latest Channel industry news and analysis with our twice-weekly newsletter
Sascha Giese, head geek at SolarWinds, agrees that 99.999% uptime often falls "somewhere between a marketing statement and a tick-box exercise for business requirements", with organisations sometimes seeing 'five nines' as standard even if they don't need it.
"'Five nines’ allows just over five minutes of downtime in a year. Even if nothing happens, that's not easy to achieve, as OS and security updates will take longer than that. This means data needs to be moved around between redundant systems," Giese points out.
Smaller or regional companies might be "perfectly fine" with two nines, considering that maintenance windows can be scheduled outside business hours. Yet "we live in a world where humans have lost their patience". Either way, "unexpected details" should be avoided in the fine print, he cautions.
When disaster strikes, ‘five nines’ mean nothing
Giese agrees more tolerance for some downtime can be needed, suggesting that a staging site with "proper messaging" could be prepared, "ready for the inevitable" disruption, which is bound to happen at some point. Other than that, customers should perhaps simply be more prepared to pay higher prices if they want that high availability.
Neil Clark, director of cloud services at provider QuoStar, notes that typically the need for downtime should be – and is – outlined in the fine print. The onus is at least partly on customers, who need to make sure they know exactly what they're being sold in the first instance.
Multi-cloud data integration for data leaders
A holistic data-fabric approach to multi-cloud integration
"Downtime from a maintenance perspective shouldn't ever be included in your 'five nines'," Clark says. "If they can't have an application that ever goes down, even for maintenance, you then build the solution right for that application. But, from our perspective, we're going to have maintenance, our platform will have maintenance, the provider will have maintenance."
That might suggest providers could make the service levels and maintenance requirements clearer to customers, perhaps by explaining it in the main body of an agreement, rather than in the fine print – and emphasise the importance of engaging with all the terms and conditions.
"Obviously, honesty is the best policy," says Clark. "And looked at another way, 'five nines' means absolutely nothing if there is a natural disaster – like a flood. I mean, how much diesel have you got to run your data centre if something like that happens?"
Reforming the ‘five nines’ mindset
Nick Archer, senior consultant at datacentre authority Uptime Institute says 'five nines' can still be useful as a metric, partly depending on whether customers are working with collocation or cloud, and what services are being hosted or supported.
He warns that if critical business functions are in a third-party facility or the cloud, five minutes of downtime may not even be considered enough anymore. If purchasing a single supply to a rack, however, with single corded devices, it's probably "the best you're going to get".
If there's "a concurrently maintainable or fault-tolerant infrastructure", downtime can, of course, happen for maintenance.
"I think 'five nines' remains a useful metric but it is probably a little bit outdated," Archer says. "And Sod's Law states that that five minutes [downtime] is going to be happening in whatever is critical to your business, whether that be key trading hours or key business hours in the middle of a key application."
Customers aren't necessarily realistic when estimating their own tolerance levels. If you were to ask different stakeholders within an organisation, they'll all have different views. Providers should ensure stakeholders agree and understand in the first instance the potential impacts of downtime, then work on reducing risk, says Archer.
What's key is understanding the importance of the application or the service that's been supported that's going to be placed elsewhere, related to the tolerance of the organisation for potential risks, outages or downtime at particular times, he says.
Fleur Doidge is a journalist with more than twenty years of experience, mainly writing features and news for B2B technology or business magazines and websites. She writes on a shifting assortment of topics, including the IT reseller channel, manufacturing, datacentre, cloud computing and communications. You can follow Fleur on Twitter.