Inside a cloud outage
Businesses must adopt proactive planning for cloud outages – but what does that look like?
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
You are now subscribed
Your newsletter sign-up was successful
The end of October was punctuated with a series of major cloud outages, first at AWS and then at Microsoft, bringing a wide range of websites and business applications offline.
In the previous episode, we spoke about this in a reactive sense – the immediate customers impacted and the likely causes.
But it's also important to break the problem down at a strategic and technical level. Just how do outages at this scale occur – and what’s it like as an insider, fighting to bring services back online?
In this episode Rory speaks to James Kretchmar, SVP & CTO of the cloud technology group at Akamai Technologies, to get an insider’s perspective on cloud outages and how businesses can navigate these incidents.
Highlights
"The worst feeling in the world is to be in the middle of an incident and realize that it would be a great thing that you could do to resolve that incident, if only a tool had been built before, right? So it'd be great if you figure that out before you get into that incident, and then you have the tool ready to go. "
"[O]ne of the things that's actually very satisfying in an incident is we've had circumstances where one system does start to fail, but we had built a safety system and it kicks in, and you see that it works. You know, it's immensely satisfying."
"The big difference between a short outage and a long outage is, 'do we know immediately how to remediate a problem of this nature?', versus 'we're not sure and/or we have to be careful not to cause a bigger problem."
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
"Well, one way or another, the 'what if' scenario planning has to happen. I think the extent to which you do that in real world circumstances, that'll be case by case, dependent of what makes the most sense to do. But the most important thing is to make sure that you do have that proactive and consistently, this is not something to do just after a big outage, as part of your normal work process, to be thinking, 'okay, what are the ways that things can go wrong. And then, yes, actually stepping through 'okay, if this happened, what would we do? Do we have the facilities available to be able to debug this problem, to pull this mitigator into place?'"
Footnotes
- Amazon Web Services outage live: Hundreds of apps including Slack, mobile carriers, banking services down
- The AWS outage brought much of the web to its knees: Here's how it happened, who it affected, and how much it might cost
- The Microsoft Azure outage explained: What happened, who was impacted, and what can we learn from it?
- Australia internet banking outage blamed on DDoS mitigation service
- Why the CrowdStrike outage was a wakeup call for developer teams
Subscribe
- Subscribe to The IT Pro Podcast on Apple Podcasts
- Subscribe to The IT Pro Podcast on Spotify
- Subscribe to the IT Pro newsletter
- Join us on LinkedIn

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
Anthropic researchers warn AI could 'inhibit skills formation' for developersNews A research paper from Anthropic suggests we need to be careful deploying AI to avoid losing critical skills
-
CultureAI’s new partner program targets AI governance gains for resellersNews The new partner framework aims to help resellers turn AI governance gaps into scalable services revenue
-
AWS and NTT Data team up to drive legacy IT modernization in EuropeNews Partnership between AWS and NTT DATA aims to boost AWS European Sovereign Cloud capabilities
-
TPUs: Google's home advantageITPro Podcast How does TPU v7 stack up against Nvidia's latest chips – and can Google scale AI using only its own supply?
-
On the ground at HPE Discover Barcelona 2025ITPro Podcast This is a pivotal time for HPE, as it heralds its Juniper Networks acquisition and strengthens ties with Nvidia and AMD
-
AWS' new DNS 'business continuity' feature targets 60 minute recovery time after October cloud outageNews The US-EAST-1 Region is getting extra tools and features to help customers during an outage
-
AWS targets cloud resilience and AI networking gains with new 'Fastnet' subsea cableNews Fastnet is set for deployment in 2028 and will link Maryland and County Cork with a line offering more than 320 terabits per second
-
October rundown: AWS chaos and supercomputers surgingITPro Podcast As the dust settled on the AWS outage, the US Department of Energy announced a slew of new supercomputers for national security
-
The Microsoft Azure outage explained: What happened, who was impacted, and what can we learn from it?News Microsoft has confirmed its Azure services are back online after a major outage impacted services across multiple regions – here's everything you need to know.
-
Is all-photonics the future of networking?ITPro Podcast Using light to transmit data rather than relying on electronic components could slash latency