The Cloudflare outage explained: What happened, who was impacted, and what was the root cause?

(Image credit: Getty Images)

Web users globally were met with error messages and site crashes yesterday after an outage at Cloudflare brought much of the web to a standstill.

The incident marks the latest in a string of major outages that have sent the web into meltdown this year.

Amazon Web Services (AWS) and Microsoft Azure, for example, both encountered technical faults that caused widespread disruption for consumers and enterprises alike.

While Cloudflare services are now back to normal and websites up and running again, like previous incidents, the outage will have had a serious impact on operations for a wide range of enterprises.

How the Cloudflare outage unfolded

Disruption began at around 11.20am UK time (7.20am EST) and was initially described as an “internal service degradation” which impacted limited services.

It’s around then that web users began reporting serious difficulties accessing a host of popular websites, services, and platforms.

OpenAI’s popular chatbot, ChatGPT, was among the services affected by the incident, along with social media site X and creative design platform Canva.

Online games such as League of Legends and Valorant were also impacted by the incident.

To add insult to injury, Downdetector, a site used to track web outages, was also temporarily taken down by the outage, preventing users from even checking the source of the disruption.

The cause of the outage

First and foremost, it’s important to note that the outage was not caused by a cyber attack of any kind.

In a blog post detailing the incident, Cloudflare CEO Matthew Prince said the company initially suspected that the source of the outage was due to a “hyper-scale” distributed denial of service (DDoS) attack. However, this was quickly dismissed.

Instead, Prince revealed the incident was “triggered by a change to one of our data systems’ permissions”.

While this appears rather vague, the root cause here lay with the company’s Bot Management system, which is used to protect against threats such as DDoS.

This is used by customers to “score” bot activity and control which bots are allowed to access an individual site – or block access.

As Prince noted, this system includes a “machine learning model that we use to generate bot scores for every request traversing our network”.

Thereafter, these scores are held in a “feature file” which refreshes every five minutes to keep pace with the behavior of bots attempting to access any given site.

These refresh cycles are generated by a query running on a ClickHouse database cluster, according to Prince. Changes to that query caused the database to “output multiple data entries”, which in turn meant the file doubled in size.

“The larger-than-expected feature file was then propagated to all the machines that make up our network,” Prince explained. “The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.”

Services are back up and running

Recovery efforts by Cloudflare meant that core traffic flows were back to normal by 2.30pm (UK time), with full recovery by 5.06pm.

Some users continued experiencing difficulties in the hours following, but this was largely due to a surge in network traffic as services came back online.

As with previous outages in recent months, some industry stakeholders have used these incidents to highlight the delicate infrastructural balancing act that keeps the modern economy running.

When one system or feature fails, it can have a devastating global impact, according to Brent Ellis, principal analyst at Forrester.

“In this case, the 3 hour, 20 minute outage could have direct and indirect losses of around $250 million to $300 million when you consider the cost of down-time and the downstream effects of services like Shopify or Etsy that host the stores for tens to hundreds of thousands of businesses.”

Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.

MORE FROM ITPRO

TOPICS

Ross Kelly is ITPro's News & Analysis Editor, responsible for leading the brand's news output and in-depth reporting on the latest stories from across the business technology landscape. Ross was previously a Staff Writer, during which time he developed a keen interest in cyber security, business leadership, and emerging technologies.

He graduated from Edinburgh Napier University in 2016 with a BA (Hons) in Journalism, and joined ITPro in 2022 after four years working in technology conference research.

For news pitches, you can contact Ross at ross.kelly@futurenet.com, or on Twitter and LinkedIn.