Why major data center outages may soon be a thing of the past

Data center outages concept image showing two technicians troubleshooting issues on a laptop in a server room.
(Image credit: Getty Images)

Data center outages are becoming less frequent and less severe, according to new research, highlighting increased resiliency across the industry. 

As the number of data centers continues to expand, you might expect the overall number of data center-related outages to increase. But new research from Uptime Institute shows there has been a consistent downward trend in both the frequency and severity of outages for the last few years.

According to the institute’s 2023 data center survey, just over half (55%) of data center operators said they had suffered an outage in the past three years. But that’s down from 60% in 2022 and 69% in 2021. 

Similarly, only one-in-ten outages in 2023 were considered either serious or severe – marking a decrease compared to 2022 and 2021.

Uptime said that one key factor behind the improved uptime is that, year-on-year, most organizations are investing more in physical infrastructure redundancy.

“While the industry may indeed move further toward distributed and software-based resiliency models, maintaining and increasing site-level redundancy remains a high priority for most operators,” it said.

According to its data, only a tiny proportion of enterprise, colocation, or cloud providers were reducing redundancy; across all those groups about a third were upping their cooling and power redundancy levels while the rest were keeping them steady.

Still, the report found that outages, while rarer, are becoming costlier. More than half of the respondents said their most recent significant outage cost more than $100,000, while an unlucky 16% reported costs of more than $1 million.

The industry advisory firm calculated there are, on average, 10 to 20 high-profile IT outages or data center events globally every year that cause serious or severe financial loss along with disruption to both businesses and consumers.

Power issues a key factor in data center outages

Uptime’s research found that disruptions to on-site power distribution are the most common cause behind impactful outages.

“This is unsurprising given the intolerance of IT hardware to any significant power disturbances, such as voltage fluctuations or complete loss of power, that last more than fractions of a second,” it said.

In contrast, cooling system failures can go on for (slightly) longer without problems. The study noted that while IT-based failures may occur more frequently, these are often isolated in their impact on specific applications or datasets.

Problems with third-party providers is also creeping up as a factor, which Uptime said reflects the growing reliance on SaaS and colocation providers. Other less common issues included problems with networks and fire suppression systems.

Human error still a pervasive issue

But however the problem manifests itself, humans are likely to be responsible for it.

Uptime said human error can be the result of a number of factors, such as poor training, the quality of the procedures in place, staff fatigue, and the sheer complexity of equipment operation involved.

Drawing on 25 years of data, Uptime estimated that human error, whether directly or indirectly, contributed to somewhere between two-thirds to four-fifths of all incidents.

Such outages are mostly caused either by staff failing to follow procedures (listed by 48% of respondents who had a serious outage in the last three years) or by the procedures themselves simply being inadequate (43%).

RELATED WHITEPAPER

Whitepaper cover with title and logo over image of barista in a coffee shop serving

(Image credit: Schneider Electric)

Backup power for when disaster strikes

Four-in-five respondents to the survey said that their most recent serious outage could have been prevented with better management, processes and configuration.

“This suggests that, as in previous years, there is an opportunity to reduce outages through training and process review,” the report said.

The report added that the aftershocks of the COVID-19 pandemic continued to have an impact on the data center industry. For example, supply chain disruptions continue to slow capital projects, which has led many organizations to delay maintenance and infrastructure upgrades.

These projects can lead to outages, so it’s possible that there will be a rebound at some time in the future.

Uptime also warned that the global shift toward more transactive, dynamic, and renewable power grids could reduce grid reliability. This could mean more outages in future as outages often occur when an uninterruptible power supply or generator fails to respond to a grid disruption.

Extreme weather events made worse by climate change have also been associated with data center outages over the past few years.

“This trend is likely to intensify and will increase the outage risks until pre-emptive action is taken,” it warned.

Steve Ranger

Steve Ranger is an award-winning reporter and editor who writes about technology and business. Previously he was the editorial director at ZDNET and the editor of silicon.com.