The same incident never happens twice, but the patterns recur over and over

1 day ago 6

“No man ever steps in the same river twice. For it’s not the same river and he’s not the same man” – attributed to Heraclitus

After an incident happens, many people within the organization are worried about the same incident happening again. In one sense, the same incident can never really happen again, because the organization has changed since the incident has happened. Incident responders will almost certainly be more effective at dealing with a failure mode they’ve encountered recently than one they’re hitting for the first time.

In fairness, if the database falls over again, saying, “well, actually, it’s not the same incident as last time because we now have experience with the database falling over so we were able to recover more quickly” isn’t very reassuring to the organization. People are worried that there’s an imminent risk that remains unaddressed, and saying “it’s not the same incident as last time” doesn’t alleviate the concern that the risk has not been dealt with.

But I think that people tend to look at the wrong level of abstraction when they talk about addressing risks that were revealed by the last incident. They suffer from what I’ll call no-more-snow-goon-ism:

Calvin is focused on ensuring the last incident doesn’t happen again

Saturation is an example of a higher-level pattern that I never hear people talk about when focusing on eliminating incident recurrence. I will assert that saturation is an extremely common pattern in incidents: I’ve brought it up when writing about public incident writeups at Canva, Slack, OpenAI, Cloudflare, Uber, and Rogers. The reason you won’t hear people discuss saturation is because they are generally too focused on the specific saturation details of the last incident. But because there are so many resources you can run out of, there are many different possible saturation failure modes. You can exhaust CPU, memory, disk, threadpools, bandwidth, you can hit rate limits, you can even breach limits that you didn’t know existed and that aren’t exposed as metrics. It’s amazing how much different stuff there is that you can run out of.

My personal favorite pattern is unexpected behavior of a subsystem whose primary purpose was to improve reliability, and it’s one of the reasons I’m so bear-ish about the emphasis on corrective actions in incident reviews, but there are many other patterns you can identify. If you hit an expired certificate, you may think of “expired certificate” as the problem, but time-based behavior change is a more general pattern for that failure mode. And, of course, there’s the ever-present production pressure.

If you focus too narrowly on preventing the specific details of the last incident, you’ll fail to identify the more general patterns that will enable your future incidents. Under this narrow lens, all of your incidents will look like either recurrences of previous incidents (“the database fell over again!”) or will look like a completely novel and unrelated failure mode (“we hit an invisible rate limit with a vendor service!”). Without seeing the higher level patterns, you won’t understand how those very different looking incidents are actually more similar than you think.

Read Entire Article