Inside a cloud outage
Businesses must adopt proactive planning for cloud outages – but what does that look like?
The end of October was punctuated with a series of major cloud outages, first at AWS and then at Microsoft, bringing a wide range of websites and business applications offline.
In the previous episode, we spoke about this in a reactive sense – the immediate customers impacted and the likely causes.
But it's also important to break the problem down at a strategic and technical level. Just how do outages at this scale occur – and what’s it like as an insider, fighting to bring services back online?
In this episode Rory speaks to James Kretchmar, SVP & CTO of the cloud technology group at Akamai Technologies, to get an insider’s perspective on cloud outages and how businesses can navigate these incidents.
Highlights
"The worst feeling in the world is to be in the middle of an incident and realize that it would be a great thing that you could do to resolve that incident, if only a tool had been built before, right? So it'd be great if you figure that out before you get into that incident, and then you have the tool ready to go. "
"[O]ne of the things that's actually very satisfying in an incident is we've had circumstances where one system does start to fail, but we had built a safety system and it kicks in, and you see that it works. You know, it's immensely satisfying."
"The big difference between a short outage and a long outage is, 'do we know immediately how to remediate a problem of this nature?', versus 'we're not sure and/or we have to be careful not to cause a bigger problem."
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
"Well, one way or another, the 'what if' scenario planning has to happen. I think the extent to which you do that in real world circumstances, that'll be case by case, dependent of what makes the most sense to do. But the most important thing is to make sure that you do have that proactive and consistently, this is not something to do just after a big outage, as part of your normal work process, to be thinking, 'okay, what are the ways that things can go wrong. And then, yes, actually stepping through 'okay, if this happened, what would we do? Do we have the facilities available to be able to debug this problem, to pull this mitigator into place?'"
Footnotes
- Amazon Web Services outage live: Hundreds of apps including Slack, mobile carriers, banking services down
- The AWS outage brought much of the web to its knees: Here's how it happened, who it affected, and how much it might cost
- The Microsoft Azure outage explained: What happened, who was impacted, and what can we learn from it?
- Australia internet banking outage blamed on DDoS mitigation service
- Why the CrowdStrike outage was a wakeup call for developer teams
Subscribe
- Subscribe to The IT Pro Podcast on Apple Podcasts
- Subscribe to The IT Pro Podcast on Spotify
- Subscribe to the IT Pro newsletter
- Join us on LinkedIn

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.
In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.
-
More transparency needed on sprawling data center projects, activists claimNews Activists call for governments to be held accountable when data centers are pushed through without proper consultation
-
Red Hat eyes tighter data controls with sovereign support for EU customersNews The company's new offering will see support delivered entirely by EU citizens in the region
-
AWS targets cloud resilience and AI networking gains with new 'Fastnet' subsea cableNews Fastnet is set for deployment in 2028 and will link Maryland and County Cork with a line offering more than 320 terabits per second
-
October rundown: AWS chaos and supercomputers surgingITPro Podcast As the dust settled on the AWS outage, the US Department of Energy announced a slew of new supercomputers for national security
-
The Microsoft Azure outage explained: What happened, who was impacted, and what can we learn from it?News Microsoft has confirmed its Azure services are back online after a major outage impacted services across multiple regions – here's everything you need to know.
-
Is all-photonics the future of networking?ITPro Podcast Using light to transmit data rather than relying on electronic components could slash latency
-
Future-proofing AI infrastructureSponsored Podcast Constructing the future of the tech sector can only be done with a strategic approach and access to the best tools
-
First Microsoft, now AWS: Why tech giants are hitting the brakes on costly data center plansNews Amazon Web Services (AWS) has paused plans for some data center leases, according to analysts, sparking further concerns about the cost of AI infrastructure spending plans.
-
Can better connectivity boost rural business?Podcast Rural businesses are still offered speeds far below the UK government’s gigabit target
-
AWS eyes ‘flexible’ data center expansion with $11bn Georgia investmentNews The hyperscaler says the infrastructure will power cloud computing and AI growth