Quick takes on the GCP public incident write-up

4 months ago 3

On Thursday (2025-06-12), Google Cloud Platform (GCP) had an incident that impacted dozens of their services, in all of their regions. They’ve already released an incident report (go read it!), and here are my thoughts and questions as I read it.

Note that the questions I have shouldn’t be explicitly seen as a critique as of the write-up, as the answers to the questions generally aren’t publicly shareable. They’re more in the “I wish I could be a fly on the wall inside of Google” questions.

Quick write-up

First, a meta-point: this is a very quick turnaround for a public incident write-up. As a consumer of these, I of course appreciate getting it faster, and I’m sure there was enormous pressure inside of the company to get a public write-up published as soon as possible. But I also think there are hard limits on how much you can actually learn about an incident when you’re on the clock like this. I assume that Google is continuing to investigate internally how the incident happened, and I hope that they publish another report several weeks from now with any additional details that they are able to share publicly.

Staging land mines across regions

Note that impact (June 12) happened two weeks after deployment (May 29).

This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code.

The system involved is called Service Control. Google stages their deploys of Service Control by region, which is a good thing: staging your changes is a way of reducing the blast radius if there’s a problem with the code. However, in this case, the problematic code path was not exercised during the regional rollout. Everything looked good in the first region, and so they deployed to the next region, and so on.

This the land mine risk: when the code you are rolling out contains a land mine which is not tripped during the rollout.

How did the decisions make sense at the time?

The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

This is the typical “we didn’t do X in this case and had we done X, this incident wouldn’t have happened, or wouldn’t have been as bad” sort of analysis that is very common in these write-ups. The problem with this is that it implies sloppiness on the part of the engineers, that important work was simply overlooked. We don’t have any sense on how the development decisions made sense at the time.

If this scenario was atypical (i.e., usually error handling and feature flags are added), what was different about this development case? We don’t have the context about what was going on during development, which means we (as external readers) can’t understand how this incident actually was enabled.

Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

How do they know it would have been caught in staging, if it didn’t manifest in production until two weeks after roll-out? Are they saying that adding a feature flag would have led to manual testing of the problematic code path in staging? Here I just don’t know enough about Google’s development processes to make sense of this observation.

Service Control did not have the appropriate randomized exponential backoff implemented to avoid [overloading the infrastructure].

As I discuss later, I’d wager it’s difficult to test for this in general, because the system generally doesn’t run in the mode that would exercise this. But I don’t have the context, so it’s just a guess. What’s the history behind Service Control’s backoff behavior? By definition, Without knowing its history, we can’t really understand how its backoff implementation came to be this way.

Red buttons and feature flags

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. (emphasis added)

Because I’m unfamiliar with Google’s internals, I don’t understand how their “red button” system works. In my experience, the “red button” type functionality is built on top of feature flag functionality, but that does not seem to be the case at Google, since here there was no feature flag, but there was a big red button.

It’s also interesting to me that, while this feature wasn’t feature-flagged it was big-red-buttoned. There’s a story here! But I don’t know what it is.

New feature: additional policy quota checks

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks… On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies.

I have so many questions.. What were these additional quota policy checks? What was the motivation for adding these checks (i.e., what problem are the new checks addressing)? Is this customer-facing functionality (e.g., GCP Cloud Quotas), or is this an internal-only? What was the purpose of the policy change that was inserted on June 12 (or was it submitted by a customer)? Did that policy change take advantage of the new Service Control features that were added on May 29? Was that the first policy change that happened since the new feature was deployed, or had there been others? How frequently do policy changes happen?

Global data changes

Given the global nature of quota management, this metadata was replicated globally within seconds.

While code and feature flag changes are staged across regions, apparently quota management metadata is designed to replicate globally.

Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. (emphasis mine)

The implication I take from from the text was that there was a business requirement for quota management data changes to happen globally rather than staged, and that they are now going to push back on that.

What was the rationale for this business requirement? What are the tradeoffs involved in staging these changes versus having them happen globally? What new problems might arise when data changes are staged like this?

Are we going to be reading a GCP incident report in a few years that resulted from inconsistency of this data across regions due to this change?

Saturation!

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

Here we have a classic example of saturation, where a database got overloaded. Note that saturation wasn’t the trigger here, but it made recovery more difficult. Our system is in a different mode during incident recovery than it is during normal mode, and it’s generally very difficult to test for how it will behave when it’s in recovery mode.

Does this incident match my conjecture?

I have a long-standing conjecture that once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or
Unexpected behavior of a subsystem whose primary purpose was to improve reliability

I don’t have enough information in this write-up to be able to make a judgment in this case: it depends on whether or not the quota management system’s purpose is to improve reliability. I can imagine it going either way. If it’s a public-facing system to help customers limit their costs, then that’s more of a traditional feature. On the other hand, if it’s to limit the blast radius of individual user activity, then that feels like a reliability improvement system.

What are the tradeoffs of the corrective actions?

The write-up lists seven bullets of corrective actions. The questions I always have of corrective actions are:

What are the tradeoffs involved in implementing these corrective actions?
How might they enable new failure modes or make future incidents more difficult to deal with?

Read Entire Article