Inside a cloud outage
Businesses must adopt proactive planning for cloud outages – but what does that look like?
The end of October was punctuated with a series of major cloud outages, first at AWS and then at Microsoft, bringing a wide range of websites and business applications offline.
In the previous episode, we spoke about this in a reactive sense – the immediate customers impacted and the likely causes.
But it's also important to break the problem down at a strategic and technical level. Just how do outages at this scale occur – and what’s it like as an insider, fighting to bring services back online?
In this episode Rory speaks to James Kretchmar, SVP & CTO of the cloud technology group at Akamai Technologies, to get an insider’s perspective on cloud outages and how businesses can navigate these incidents.
Highlights
"The worst feeling in the world is to be in the middle of an incident and realize that it would be a great thing that you could do to resolve that incident, if only a tool had been built before, right? So it'd be great if you figure that out before you get into that incident, and then you have the tool ready to go. "
"[O]ne of the things that's actually very satisfying in an incident is we've had circumstances where one system does start to fail, but we had built a safety system and it kicks in, and you see that it works. You know, it's immensely satisfying."
"The big difference between a short outage and a long outage is, 'do we know immediately how to remediate a problem of this nature?', versus 'we're not sure and/or we have to be careful not to cause a bigger problem."
Sign up today and you will receive a free copy of our Future Focus 2026 report - the leading resource for IT decision-maker insight on priorities and investment areas in AI, security and more.
"Well, one way or another, the 'what if' scenario planning has to happen. I think the extent to which you do that in real world circumstances, that'll be case by case, dependent of what makes the most sense to do. But the most important thing is to make sure that you do have that proactive and consistently, this is not something to do just after a big outage, as part of your normal work process, to be thinking, 'okay, what are the ways that things can go wrong. And then, yes, actually stepping through 'okay, if this happened, what would we do? Do we have the facilities available to be able to debug this problem, to pull this mitigator into place?'"
Footnotes
- Amazon Web Services outage live: Hundreds of apps including Slack, mobile carriers, banking services down
- The AWS outage brought much of the web to its knees: Here's how it happened, who it affected, and how much it might cost
- The Microsoft Azure outage explained: What happened, who was impacted, and what can we learn from it?
- Australia internet banking outage blamed on DDoS mitigation service
- Why the CrowdStrike outage was a wakeup call for developer teams
Subscribe
- Subscribe to The IT Pro Podcast on Apple Podcasts
- Subscribe to The IT Pro Podcast on Spotify
- Subscribe to the IT Pro newsletter
- Join us on LinkedIn

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
MSPs grow wary over supply chain security threatsNews CyberSmart’s 2026 MSP Survey found that more than two-in-five firms experienced a cyber incident linked to a supplier or third-party vendor over the past year
-
Dell Pro Max with GB10 reviewReviews This juggernaut of a machine can be a gateway to AI productivity, with plenty of power and playbooks to get you started – but it comes at a high cost
-
Why mobile connectivity still mattersFor connected devices, IoT, and resiliency, WiFi alone isn't always enough.
-
The AWS outage explained: What happened, who was impacted, and what services are back online?News Overheating at a single data center has been identified as the cause of the AWS outage, which impacted customers such as Coinbase
-
Going all-in on digital sovereigntyITPro Podcast Geopolitical uncertainty is intensifying public and private sector focus on true sovereign workloads
-
AT&T expands AWS partnership in network modernization, cloud migration pushNews The telecoms giant said the deal will supercharge the nation’s connectivity infrastructure
-
Grid constraints are slowing down AWS infrastructure plans across Europe – and research shows it's only going to get worseNews Efforts by AWS to expand data center infrastructure across Europe face severe delays due to sluggish grid connection practices, a senior company figure claims.
-
AWS and NTT Data team up to drive legacy IT modernization in EuropeNews Partnership between AWS and NTT DATA aims to boost AWS European Sovereign Cloud capabilities
-
TPUs: Google's home advantageITPro Podcast How does TPU v7 stack up against Nvidia's latest chips – and can Google scale AI using only its own supply?
-
On the ground at HPE Discover Barcelona 2025ITPro Podcast This is a pivotal time for HPE, as it heralds its Juniper Networks acquisition and strengthens ties with Nvidia and AMD