Google pins weekend outage on "unexercised" feature
Google Cloud outage impacted a wide range of customers last week after new feature wreaked havoc


Google has apologised for a Cloud outage last week that knocked offline many of its own customers, saying the fault was down to a new feature that wasn't properly tested – and promising to do better at communicating such incidents in the future.
On Thursday last week, outages on status trackers like Downdetector began to suggest an issue with tens of thousands of reports of outages for Google Cloud, Spotify, Discord, and more. After initial media reports, Google's Cloud status page posted a short update about the outage.
The company has now released a full incident report, pinning the blame for the outage on a policy check system called System Control that was given a new feature last month to allow it to make extra quota policy checks for API requests.
"This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google explained. "As a safety precaution, this code change came with a red-button to turn off that particular policy serving path."
"The issue with this change was that it did not have appropriate error handling nor was it feature-flag protected," it added. "Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag-protected, the issue would have been caught in staging."
The policy change was added to Service Control on the day of the incident. "This policy data contained unintended blank fields," Google said. "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop."
Wide impact
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
The outage hit a wide range of third-party companies, including OpenAI, Shopify, and MailChimp, as well as Cloudflare, which in turn impacted the customers of that web and cloud services provider.
Cloudflare said the Google troubles didn't touch its core services, but did cause outages in a large set of critical services including Workers KV, Gateway, Stream, and its Dashboard.
"We're deeply sorry for this outage: this was a failure on our part, and while the proximate cause (or trigger) for this outage was a third-party vendor failure, we are ultimately responsible for our chosen dependencies and how we choose to architect around them," Cloudflare said in a blog post.
"This was not the result of an attack or other security event," the post added. "No data was lost as a result of this incident. Cloudflare Magic Transit and Magic WAN, DNS, Cache, proxy, WAF and related services were not directly impacted by this incident."
Seven-hour outage explained
Google said the incident was spotted immediately with engineers on the case within two minutes, and the cause identified within ten minutes, and a fix rollout after 40 minutes. However, the knock-on effects of the incident lasted several hours, starting just before 11 am US Pacific time and ending after 6pm on Thursday.
As the systems restarted, it created a "herd effect on the underlying infrastructure it depends on" – and that overloaded the infrastructure, Google said. "Service Control did not have the appropriate randomized exponential backoff implemented to avoid this."
That took another two hours and 40 minutes to resolve – throttling the fix to avoid overloading the infrastructure and rerouting traffic elsewhere to reduce the load – but recovery time was longer for some Google and Cloud products depending on their architecture, the company said.
Going forward
Google admitted the incident would hurt its users and said it would take steps to avoid a similar outage.
"We deeply apologize for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better," Google said in a statement. "We apologize for the impact this has had not only on our customers' businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward."
Those changes include rejigging the architecture of Service Control so it's modular, meaning any failures won't take out entire systems, as well as improving its communications to help customers react more quickly – which may be a response to the fact that customers spotted the outage and tweeted about it well before Google made any public announcement.
Freelance journalist Nicole Kobie first started writing for ITPro in 2007, with bylines in New Scientist, Wired, PC Pro and many more.
Nicole the author of a book about the history of technology, The Long History of the Future.
-
Crayon targets mid-market gains with expanded Google Cloud partnership
News The collaboration will enable mid-market channel partners to deliver Google Cloud’s AI technologies and cloud solutions
-
Google is getting serious on cloud sovereignty
News Google has joined Microsoft in bolstering its sovereign cloud services as tensions grow over US influence on big tech providers.
-
Google Cloud wants to tackle cyber complexity – here's how it plans to do it
News Google Unified Security will combine all the security services under Google’s umbrella in one combined cloud platform
-
Google Cloud Next 2025: All the live updates as they happened
Live Blog Google Cloud Next 2025 is officially over – here's everything that was announced and shown off in Las Vegas
-
Google Cloud Next 2025 is the hyperscaler’s chance to sell itself as the all-in-one AI platform for enterprises
Analysis With a focus on the benefits of a unified approach to AI in the cloud, the ‘AI first’ cloud giant can build on last year’s successes
-
The Wiz acquisition stakes Google's claim as the go-to hyperscaler for cloud security – now it’s up to AWS and industry vendors to react
Analysis The Wiz acquisition could have monumental implications for the cloud security sector, with Google raising the stakes for competitors and industry vendors.
-
Google confirms Wiz acquisition in record-breaking $32 billion deal
News Google has confirmed plans to acquire cloud security firm Wiz in a deal worth $32 billion.
-
Microsoft hit with £1 billion lawsuit over claims it’s “punishing UK businesses” for using competitor cloud services
News Customers using rival cloud services are paying too much for Windows Server, the complaint alleges