Inside a cloud outage

Businesses must adopt proactive planning for cloud outages – but what does that look like?

The text "Inside a cloud outage" against a blue loading wheel to represent cloud outages. The words "cloud outage" are in yellow, the rest are in white. In the bottom-right corner, the ITPro Podcast logo is shown.
(Image credit: Future)

The end of October was punctuated with a series of major cloud outages, first at AWS and then at Microsoft, bringing a wide range of websites and business applications offline.

In the previous episode, we spoke about this in a reactive sense – the immediate customers impacted and the likely causes.

But it's also important to break the problem down at a strategic and technical level. Just how do outages at this scale occur – and what’s it like as an insider, fighting to bring services back online?

In this episode Rory speaks to James Kretchmar, SVP & CTO of the cloud technology group at Akamai Technologies, to get an insider’s perspective on cloud outages and how businesses can navigate these incidents.

Highlights

"The worst feeling in the world is to be in the middle of an incident and realize that it would be a great thing that you could do to resolve that incident, if only a tool had been built before, right? So it'd be great if you figure that out before you get into that incident, and then you have the tool ready to go. "

"[O]ne of the things that's actually very satisfying in an incident is we've had circumstances where one system does start to fail, but we had built a safety system and it kicks in, and you see that it works. You know, it's immensely satisfying."

"The big difference between a short outage and a long outage is, 'do we know immediately how to remediate a problem of this nature?', versus 'we're not sure and/or we have to be careful not to cause a bigger problem."

"Well, one way or another, the 'what if' scenario planning has to happen. I think the extent to which you do that in real world circumstances, that'll be case by case, dependent of what makes the most sense to do. But the most important thing is to make sure that you do have that proactive and consistently, this is not something to do just after a big outage, as part of your normal work process, to be thinking, 'okay, what are the ways that things can go wrong. And then, yes, actually stepping through 'okay, if this happened, what would we do? Do we have the facilities available to be able to debug this problem, to pull this mitigator into place?'"

Footnotes

Subscribe 

Rory Bathgate
Features and Multimedia Editor

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.

In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.