The Microsoft Azure outage explained: What happened, who was impacted, and what can we learn from it?

(Image credit: Getty Images)

Microsoft has confirmed its Azure cloud services are back online after a major outage impacted services across multiple regions.

It's a tough time to be a cloud provider, with the Azure incident coming hot on the heels of an AWS outage that knocked a wide range of companies and services offline last week.

According to Microsoft, the Azure outage was caused by a CDN configuration error, lasted eight hours, and saw a wide range of Microsoft customers affected, including Alaska Airlines, websites for Heathrow, NatWest, and more.

Even Microsoft's own 365 suite was affected by the outage while proceedings at the Scottish Parliament in Edinburgh were paused due to technical difficulties.

In a service update, Microsoft said error rates and latency are “back to pre-incident levels”, but warned that a “smaller number of customers may still be seeing issues” akin to the turbulent recovery period experienced by AWS customers last week.

The outage was due to a configuration fault in a content delivery network (CDN) system. As with the AWS incident, once the outage started, the interlinked nature of these systems meant the disruption quickly cascaded into a wider problem.

What happened with the Azure outage?

Microsoft has detailed its initial take on the outage on the Azure status page, noting that the incident started at 3.45pm UTC and was spotted within 20 minutes, with manual recovery beginning at 6.45pm. The outage was resolved by midnight.

The outage affected customers and Microsoft services using Azure Front Door (AFD), which is Microsoft's content delivery network service, causing latency issues, timeouts and errors.

"An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery," the update said.

Microsoft added that this wasn't spotted because protection mechanisms failed due to a software bug.

"The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services."

As the nodes failed, they dropped out of the global pool, meaning traffic distribution across the nodes that remained became unbalanced. Microsoft noted that this exacerbated the outage and expanded the impact, meaning some regions that weren't affected by the AFD issues also saw service degradation.

To recover, Microsoft blocked all configuration changes and rolled back to the "last known good" version, which required reloading across a large number of loads. The recovery was slow to avoid overloading systems as they came back online.

"This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue," Microsoft said, adding it would publish a more detailed review in the next two weeks.

Time to diversify?

With two major outages at US hyperscalers impacting organizations around the world, it's no surprise that questions are being raised about concentrating too much in the hands of a few American tech providers.

AWS holds about 30% of the market, for example, while Azure commands a 20% share, underlining the tight grip both these companies have on the broader cloud computing landscape.

Mark Boost, CEO of Civo, said this latest incident should serve as a warning about diversification within the industry.

"Two of the world’s biggest cloud providers have suffered major outages in the space of a week," he said.

"It’s a wake-up call for governments and enterprises alike, why are so many critical UK institutions, from HMRC to major banks and airports, reliant on infrastructure hosted thousands of miles away?"

Nicky Stewart, Senior Advisor to the Open Cloud Coalition, echoed Boost’s comments on diversification, adding that interoperability and resilience are two key issues that need to be addressed.

"Successive outages on this scale show how a single technical fault can ripple through essential services, public infrastructure, and the wider economy," said Stewart.

"The pattern of repeated disruption underlines the urgent need for diversification, and the CMA must move at pace to implement remedies that foster a more open, competitive, and interoperable cloud market, one where resilience comes from choice, not dependence on the two dominant providers.”

Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.

MORE FROM ITPRO

TOPICS

Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.

Nicole the author of a book about the history of technology, The Long History of the Future.