The benefits of a robust system redundancy strategy

A businessman's hand pushed in between wooden dominoes, stopping falling dominoes from knocking over others. The first domino that fell is red, while the rest are plain wood. — (Image credit: Getty Images)

The role of system redundancy is to protect against individual system failures and entails the deliberate duplication of parts of a system to take over if something fails. While resiliency is a system’s ability to bounce back when a problem occurs, system redundancy is about at the least reducing downtime, and at best stopping it from occurring entirely. Simply put, it’s the first line of defense against unplanned outages, data loss and operational disruption.

“With redundancy, you have components of a system that essentially have a copy of themselves so if one fails then the second can come in,” explains Chris Astley, head of cloud at KPMG UK. “That’s important because you could have various components which could fail at any given point in time, but with redundancy the whole system should keep working.”

While focus is often reserved for server redundancy, effective redundancy shouldn’t just be about hardware. As Gray Reinhard, CTO at Energea puts it: “It’s a risk-weighted safety net that duplicates every business-critical dependency across availability zones, clouds and even the edge.”

Modernizing redundancy strategy

Indeed, the concept of redundancy has shifted in response to a growth of distributed infrastructure, cloud services and hybrid models. If a business is only focused on servers, they’ve got a slightly old-fashioned view of how redundancy should work, notes Astley.

Latest Videos From

Watch full video here:

“If you’re in a cloud environment for instance, you’re probably managing things on a service or product basis which is a different approach because there’s a lot of supporting systems that allow you to run those services. Those supporting services, such as how software is stored in your source code repositories, and then deployed into live environments, are often overlooked because they’re not serving an end customer in and of itself. But if they break then it has a material impact, therefore you need the same level of redundancy.”

IT leaders must change their approach to redundancy as they shift to hybrid/cloud infrastructure, as well as virtualization and containerization, he continues. “It’s quite different as instead of looking at a system or a set of systems that are like each other and doing redundancy on those, it’s now an environment that supports many applications and services. Therefore, the approach is a lot more specific to an individual service and people find that quite difficult to get past.”

Those using edge computing may need a local redundancy requirement for offline tasks, or may be able to utilize their edge computing network to provide the redundancy for a certain service, Astley adds. “It’s important that the systems are designed based on particular needs and what the impact is of them failing so you can build in the appropriate level of redundancy.”

The benefits of system redundancy

Robust system redundancy offers a clear benefit, as it saves time and money whenever an outage might occur. These are costs that are often overlooked, explains Eric Hanselman, chief analyst at 451 Research, S&P Global Market Intelligence. But as with any technology, system redundancy isn’t a silver bullet and IT leaders must ensure they do their due diligence and check their solutions are fit for purpose by testing regularly.

One government agency neglected to do this and when primary systems failed so did the backup system, which left public services unavailable for several days notes Tom Richards, systems and storage practice lead at IT consultancy Northdoor.

A common blind spot organizations have when designing for redundancy is an overreliance on cloud native failovers without validating behavior says Nic Adams, founder and CEO of cybersecurity firm 0rcus. One example comes from a manufacturing firm that relied on single provider cloud redundancy. When that provider experienced an extended 48-hour outage, its production was halted. “The company lost millions in revenue,” notes Richards.

“Cloud platforms give you zones, but not immunity. A regional control-plane outage or a long-haul fibre cut can strand ‘redundant’ resources in the same failure domain,” Reinhard adds. “Mitigate with cross-region replication, limited multi-cloud escape hatches and rigorous offline restore tests,” he advises.

An example of one industry leader in system redundancy is Netflix, which introduced its Chaos Monkey tool to force systems to prove redundancy in real-time. This example of chaos engineering disables production instances randomly to make sure other systems take over without any impact to the customer.

Then there’s the cautionary tale of Facebook’s October 2021 outage, which was caused by a cut off of internal DNS due to a BGP misconfiguration. “This showed that even the biggest hyperscalers can fail without local routing backups or internal DNS redundancy,” notes Adams.

Calculating acceptable risk

Hanselman adds that there are situations where redundant technology approaches aren’t worth the expense, as some business processes can be covered by human interventions at lower cost.

“There may also be tolerable delays or risks – for example retail store connectivity might be able to tolerate a short network outage if it can still conduct business. There might be acceptable risks to not fully resolving credit card transactions if fraudulent activity risk is low enough.

"Another example would be shifting users to a mobile application if a website experiences an outage. That could work well as long as the two don’t share a backend that’s the source of the problem.”

So how should businesses assess the right level of redundancy for different aspects of their tech stack, without overengineering or overspending? First start by outlining the criticality of each system on the organization’s operations, where possible quantifying how much it would impact customers and revenue. This has the added advantage of helping sell the benefits of system redundancy to the board, by showcasing the cost of unplanned outages, potential insurance increases and customer churn.

Adams recommends the use of attack surface heat maps and downtime cost modelling, noting that not all systems require an active-active redundancy model. “Tier by consequence: mission-critical apps get live mirroring, non-critical assets can defer to slower restore models,” he advises.

Map every plausible failure to a dollar value, then invest up to, but never beyond that exposure concludes Reinhard. “Redundancy that outstrips risk is waste: redundancy that falls short is negligence.”

Keri Allan is a freelancer with 20 years of experience writing about technology and has written for publications including the Guardian, the Sunday Times, CIO, E&T and Arabian Computer News. She specialises in areas including the cloud, IoT, AI, machine learning and digital transformation.