The AWS outage brought much of the web to its knees: Here's how it happened, who it affected, and how much it might cost

Apps and websites impacted by the AWS outage have recovered after a highly disruptive start to the week

Amazon Web Services (AWS) logo pictured on a wall at the Lisbon Web Summit with attendees walking by.
(Image credit: Getty Images)

Hundreds of applications and websites impacted by the AWS outage are now back online and operating as normal. However, questions still remain over the cost of the incident and a growing reliance on a select few cloud providers underpinning the digital economy.

Over the course of Monday morning through to the afternoon, services running on AWS struggled with outages, hitting websites but also payment services.

AWS said the incident was triggered by a DNS issue and that services were largely up and running by the afternoon, though a backlog of messages would take time to work through.

The incident follows last year's CrowdStrike outage and a similar outage by AWS in 2021, highlighting our overreliance on digital systems and networks run by a few companies – a fact noted by many industry experts over the last day or so.

“AWS powers millions of websites and applications, elevating a technical glitch from an inconvenience to a global disruption," noted Forrester principal analyst Brent Ellis.

Ellis added that it was no surprise that DNS was at the heart of the issue.

"This particular outage exposes core issues with cloud resilience that stem from overreliance on services such as DNS, which were not architected for cloud-era technology demands," he said.

A timeline of the AWS outage

The outage began late Sunday night, AWS said in a statement, with increased error rates in what it calls the US-EAST-1 Region, which is northern Virginia – home to a huge swathe of data centers.

By just after midnight, Amazon realized the problem was caused by DNS resolution issues for regional DynamoDB service endpoints, and mitigated it by 2:24am PDT, so by just about half past 10am in the UK. However, the nature of the problem meant recovery took much longer.

"After resolving the DynamoDB DNS issue, AWS services began recovering, but a small subset of internal subsystems continued to be impaired," AWS noted.

"To facilitate full recovery, we temporarily throttled some impaired operations such as EC2 instance launches."

Alongside the EC2 impairments, AWS said it then spotted a problem in Network Load balancer health checks, leading to network connectivity issues in multiple services, including Lambda, DynamoDB, and CloudWatch. Those were fixed by 9:38 am.

"This kind of configuration-level failure, while not unusual in large-scale distributed systems, can cascade rapidly because so many online services rely on shared cloud infrastructure," said Dr Soohyun Jeon, Assistant professor/lecturer in Operation and Information Systems Management, Brunel University of London.

"What begins as a localized fault can therefore manifest globally within minutes, impacting platforms ranging from gaming and social media to banking, telecommunications, and government portals."

As part of the recovery effort, AWS revealed it “temporarily throttled” some operations, including EC2 instance launches and processing of SQS queues through Lambda Event Source Mappings.

“Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered."

The cloud computing giant said services were seeing "significant recovery" by lunch time Monday, and by 3pm PDT all services were back to normal, though many users still reported issues throughout the afternoon.

Who was impacted by the outage?

The widespread reliance on cloud computing providers such as AWS meant the impact was felt online and offline.

An array of major players saw outages on their services and platforms, including Canva, Coinbase, Duolingo, Perplexity, Reddit, Signal, Slack, Zoom and even Amazon itself.

This included issues with its Ring doorbells, Alexa smart assistants, and Kindle books.

Customers of banks including Lloyds and Halifax also reported issues, as did payment apps Venmo and Square. Media sites such as the New York Times and Wall Street Journal also saw errors and outages, as did UK government digital services.

"Even a single failure in a cloud region or streaming backbone can ripple across the stack, impacting everything from data movement to the models and applications that rely on it," said Wadkins.

The financial cost of the AWS outage

The exact scale of the outage in terms of costs to business is yet to be determined, however, previous research shows both website and IT downtime issues can be very costly for businesses.

Douglas Wadkins, CTO at Opengear, said the financial losses are often matched by reputational damage.

"As seen with the widespread disruption today, the consequences of downtime are severe and immediate – lost revenue and customer trust, with potential knock-on effects in today’s fragile macro-economic and geopolitical environment," he said.

Nicky Stewart, the senior advisor at the Open Cloud Coalition, suggested the costs could be eye-watering once fully counted.

"It’s too soon to gauge the economic fallout, but for context, last year’s global CrowdStrike outage was estimated to have cost the UK economy between £1.7 and £2.3 billion," Stewart said.

Overreliance on AWS?

AWS commands a 30% share of the lucrative cloud computing market, with about $29.7bn in quarterly revenue, followed by Microsoft Azure and Google Cloud.

Combined, those three companies control nearly two-thirds of the entire global market. Is that too much? Stewart suggested the incident highlights a growing overreliance on a select few providers.

"Today’s massive AWS outage is a visceral reminder of the risks of over-reliance on two dominant cloud providers, an outage most of us will have felt in some way," he said.

Of course, it's no surprise that companies rely on big players. They’ve been the go-to for enterprise for at least a decade at this stage, underpinning much of the modern digital economy.

But that’s not to say they’re immune to issues or failures, despite impressive uptime records.

"There’s great appeal to using tech giants, but assuming they are too big to fail or inherently resilient is a mistake, with the evidence being the current outage and past ones," noted Forrester's Ellis.

"The entrenchment of cloud, especially AWS, in modern enterprises, coupled with an interwoven ecosystem of SaaS services, outsourced software development, and virtually no visibility into dependencies, is not a bug – it’s a feature of a highly concentrated risk where even small service outages can ripple through the global economy."

Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.

MORE FROM ITPRO

Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.

Nicole the author of a book about the history of technology, The Long History of the Future.