How to minimise the impact of cloud outages on your business

dark cloud coverage

More organisations than ever are moving applications and data to the cloud. While this undoubtedly offers many benefits, outages are a fact of life, with 100% uptime far from guaranteed. Therefore, a key part of any business' strategy if using the cloud should be to reduce the impact of these disruptions when they occur.

In March, it was found that human error was responsible for an Amazon Web Services (AWS) outage that affected millions of customers earlier in the year. These major cloud provider outages should teach us that IT infrastructure, if not monitored correctly, can experience complete shutdowns and drastically reduced performance. It's true for both physical infrastructure and the cloud, according to Virtual Instruments EMEA marketing director Chris James.

"Remember: the term 'cloud' simply refers to an outsourced data centre. You can either manage your own data centre (on premises), have someone manage your data centre (private cloud), or use someone else's data centre (public cloud)," he says.

Robert Castley, senior performance engineer for EMEA at Catchpoint Systems, says that the incident highlights that a 100% uptime is unrealistic, even for impressive infrastructures like AWS. "The fact that these major websites and services were completely unavailable wasn't Amazon's fault. We shouldn't forget that the cloud is simply a patchwork of servers, switches and code, which means that it's vulnerable to potential outages and performance issues," he adds.

Lessons learnt from cloud failure

Perhaps the most important lesson to be learnt is that businesses should take precautions for when situations like this occur.

"In the case of the AWS outage, we have to be clear that it isn't Amazon's responsibility to create a redundancy plan for its customers, but it is responsible to mitigate outages of its systems and get them back online as fast as possible," says Castley.

The three key lessons that organisations can learn from this, according to Castley, are that it's important for you to monitor your own services, alongside third-party apps, on a regular basis, as this helps you catch performance issues in a timely manner. Second, it's crucial to learn about the technology used by third-party apps as they are vital for your customers' experience. "For example, who they rely on for hosting their technology and who their DNS provider is are critical things to be aware of," he says.

Third, users need to be provided accurate and timely information at all times. It's important to be transparent about the issue and provide your customers with regular updates on the social media platforms they use, says Castley.

Contingency plans

What contingency plans and procedures should organisations put in place to help get through, and minimise the potential disruption to business when the cloud falls? When organisations move critical services and data like email off premise, they must plan for the inevitability that the service will go down -- just as they would with business continuity solutions on their own infrastructure, according to Dan Sloshberg, cyber resilience expert at Mimecast.

"Rather than maintain LAN tethers, this should be done through a secondary cloud service that can work seamlessly with primary providers to ensure business continuity and maintain data access," he says.

As a few outages are down to human error, vendors should limit the ability for humans to cause them, according to Oliver Pinson-Roxburgh, EMEA director at Alert Logic. "In addition, due to the speed and agility of the cloud, organisations can easily cause their own outages. Also, we have seen examples where hackers have put organisations out of business by deleting whole workloads in the cloud that weren't backed up," he says.

He adds that contingency plans should be in place before moving workloads to the cloud.

"The more subscribers there are to a cloud service, the bigger the impact of an outage becomes to individual customers and the broader community. Those that put contingencies in place will ultimately have an advantage over those that simply hope for the best."

Mitigating outages

Sloshberg says it's important to put a risk mitigation strategy in place for key systems that have been moved to the cloud, and that the plan is tested regularly. "By staging a cloud outage, organisations can fully understand how the business will cope if this does occur. Taking a cyber resilience approach, IT teams should ensure critical infrastructure, like email, can continue to operate during a primary cloud system outage, key data is backed up and remains available to search and access and importantly, security layers remain active in order to protect the organisation."

He adds that early detection and visibility of any issues, together with the right technology and well-tested processes, can help mitigate the impact of an outage and get the business back up and running more quickly when primary systems come back online.

It's possible to design for complete robust business continuity, disaster avoidance and recovery solutions so effects can be mitigated. "It's important to have the right processes and procedures in place so people know what to do -- or ensure it's managed for businesses by the right managed service provider, who can take control [of] the situation," says Roy Wood, managing director of IT services at Advanced.

Dealing with future outages

Although outages are likely to be a problem for a while yet, as no one solution can promise 100% uptime, a growing number of organisations are using containers to deploy their production workloads. Can this emerging technology help reduce the risk of cloud outages?

"Containers abstract the virtualisation layer of traditional hypervisors by running lightweight operating systems that only include the libraries and binaries required for the application. This means that they are very easily portable and entire environments can be quickly and easily deployed," says James Hooper, senior manager for Virtual Data Centre Solutions at Interoute.

He adds that container management platforms enhance this further by adding orchestration capabilities, allowing customers to provision their container infrastructure on any cloud irrespective of the location.

"There are now a number of providers on the market offering container platforms and managed services for these, allowing customers to focus on the applications themselves whilst the provider manages the underlying infrastructure and container platforms," he says.

Rene Millman

Rene Millman is a freelance writer and broadcaster who covers cybersecurity, AI, IoT, and the cloud. He also works as a contributing analyst at GigaOm and has previously worked as an analyst for Gartner covering the infrastructure market. He has made numerous television appearances to give his views and expertise on technology trends and companies that affect and shape our lives. You can follow Rene Millman on Twitter.