Reducing the time it takes to recover from an incident is and should be a priority for any organization. And of course, any efforts made to invest in that should be measured to ensure they are delivering value.
Conventional wisdom suggests you should create a metric that tracks the recovery duration of each incident, then aggregate those values over weeks, months or quarters, and correlate them to improvement initiatives.
You’ll probably call this rolled-up metric your Mean Time to Recovery (MTTR), something that provides a clear signal of how well your organizational efficiency efforts are improving the handling of failures.
It sounds logical. It probably feels intuitive.
Which is precisely the kind of bullshit logic MTTR surprises you with.
By reducing recovery to a single average, you’re implicitly assuming that all incidents are comparable — that their causes, resolution paths, and the human and technical actions involved are similar enough to be meaningfully aggregated.
But in real-world systems, that assumption rarely holds.
We work with complex, distributed systems — messy environments filled with emergent behaviors, hidden dependencies, and constant change. These systems evolve rapidly. Adapting to them is hard. Explaining why they’re not doing or behaving in the ways people think they should is challenging when you’re not the one in the trenches, coping with this complexity every day.
So, we reach for simplification. We condense complex, high-context work into something clear, digestible, and presentable on a dashboard or monthly report.
And MTTR, with its seductive intuitiveness, seems like the perfect solution. It’s a single metric. A clean number. A TL;DR for how your team handles failure.
But all MTTR really does is flatten complexity and obscures what truly matters: context, variance, and the profoundly human work of adaptation.
Applying statistical reasoning, MTTR does not hold up to scrutiny against complex systems.
It’s a mean of highly variable, time-series data
Incident durations range from seconds to hours — or even days. That kind of high variance means the mean (average) can be wildly misleading. It may not represent any real incident. Of course, you could remove the outliers, such as that 18-hour outage that ruined your team’s weekend, but that’s just fudging the numbers even further.
MTTR lacks statistical control
MTTR often fails to meet the criteria of a stable metric. It’s influenced by outliers, incident types, team dynamics, and evolving systems. Trends can’t be trusted without deep context. Here is a great blog post that explains Statistical Control by Lorin Hochstein using MTTR.
Averages hide distribution shape
MTTR compresses an entire distribution into a single number. It hides skew, masks clusters of extreme values, and dulls the signal of outliers — the very events most worth studying. It encourages oversimplified decision-making and can wallpaper over real trends.
Think of MTTR as a Rube Goldberg machine stuck in reverse. Instead of watching a complex system perform a simple task, you’re staring at a deceptively simple number that hides a tangled web of complex human and technical actions.
You might see a Time to Recovery metric and assume it reflects a simple, linear process:
- It happened
- It was detected
- It was resolved
This is not incorrect, but there’s a lot more nuance to this number, and it’s essential to break it down to understand its constituent parts and identify where the value lies, allowing for meaningful improvements.
Incident Start — The moment something begins to go wrong. This should (mostly) be obvious. If you’re one to root-cause your way out of a crisis, you’ll 5-Why your way down to this timestamp pretty quickly. But if you lean toward systems thinking, you may recognize that “start time” isn’t always precise in complex systems — symptoms often lag causes.
Time to Detection (TTD): The time between the incident starting and it being detected — by either a machine or a human. Ideally, your monitoring and alerting systems catch this before your customers do. The faster this happens, the earlier your team can begin responding.
Time to Acknowledge (TTA): A human-centric metric: the time between when a notification system (like PagerDuty) alerts a responder and when a human acknowledges it. This is a window into on-call behavior and team discipline — and, in some organizations, a source of anxiety. A slow TTA can indicate burnout, unclear ownership, or misconfigured alerting.
Time to Resolve (TTR): The time it takes to bring the system back to a steady state — or at least a functional one. This includes everything that happens between acknowledgment and resolution, such as debugging, escalations, restarts, rollbacks, incident chats, workarounds, and patching. It’s a mix of human effort and automation — and that mix is never consistent.
Why Time to Resolve Is Unreliable
Time to Resolve is an amorphous value. It absorbs all the messy human and technical work required to restore service and it’s highly variable, influenced by factors like:
- Skill and experience of responders
- Tooling and automation
- System complexity
- Team fatigue
- Time of day/week
- The nature of the incident itself
As systems become more complex, this variability in TTR expands. Averaging it out (as MTTR does) gives you a false sense of control — and makes profoundly different scenarios look deceptively similar.
Fact: Incidents vary wildly, MTTR pretends they dont.
Why Time to Detection Matters
This is where you can apply automation, tighten feedback loops, and reduce latency in incident response. The faster you detect an issue — ideally before customers notice — the sooner recovery can begin.
Good detection doesn’t just trigger alerts. It provides context: what’s broken, what it’s affecting, and how it’s evolving. The earlier you capture that context, the more effective and less painful the resolution effort will be.
And you don’t need to build detection from scratch. There’s a wide range of tools available, from monitoring and alerting platforms to centralized logging, distributed tracing, observability stacks, and AI-powered summarization. Whether open source or commercial, there’s a solution for nearly every scale and maturity level.
Systems emerging from total chaos
Imagine a mid-sized SaaS company that has experienced rapid growth over the past two years. They’ve scaled from a monolith to a loosely coupled microservices architecture — but without maturing their operational practices.
Until recently, they had:
- No centralized logging
- No alerting — outages were detected via customer complaints
- No incident response processes
- Zero observability into dependencies or request flows
As a result, every incident followed the same brutal pattern: something broke, and the team would scramble in Slack or on Zoom, trying to figure out what had happened, where, and why. MTTR, if measured at all, was inconsistent and mostly meaningless — detection times varied wildly, and resolution relied heavily on tribal knowledge.
But then the company invested in:
- A real observability stack e.g. OpenTelemetry + Grafana + Loki + Prometheus
- An alerting pipeline: proper thresholds, notifications, on-call rotations, and alert notes
- Centralized structured logging and correlation IDs
And amazing things happened:
- Time to Detection dropped dramatically: Engineers no longer learn about issues from customers — alerts trigger within seconds or minutes.
- Acknowledgment is trackable: They use PagerDuty, so human response time is measured and improvable.
- Resolution patterns start to emerge: With logs, metrics, and traces aligned, engineers resolve issues quickly and consistently.
MTTR for a brief period of time in this company becomes a useful metric — not perfect, but good enough to:
- Measure the impact of improved observability
- Identify teams or services that need better instrumentation
- Track whether new alerting and on-call practices are working
It’s not that incidents are now uniform — but the chaos has been tamed enough that MTTR is showing the value of investing in incident detection and response.
However, the MTTR party will be short-lived as the rate of change continues to increase. Systems will grow in complexity, and incident response variability will increase accordingly. The early value of MTTR begins to fade, becoming less reflective of operational health and more a reflection of statistical noise.
Static systems
Imagine a large-scale manufacturing company running a legacy, on-premises ERP system that processes payroll batches every Friday at 6 PM. The environment is tightly controlled — static infrastructure, minimal change, no new deployments, no scale-out events, and the same batch workload every week.
The system has one known failure mode: Every few months, a memory leak in the batch processor causes a job to crash halfway through. The fix is well-documented — restart the service and clear the temporary directory. It’s always the same incident, always resolved in the same way.
In this context, MTTR is meaningful. Why?
- Low variance: The failure mode is consistent, predictable, and repeatable.
- Resolution steps are identical: There’s no variance in how the fix is applied — no human creativity, no novel debugging.
- The environment doesn’t evolve: The software and infrastructure aren’t changing, so historical averages are actually representative of future performance.
- Control over inputs: There are few to no unknowns entering the system — no surprise interactions or emergent behaviors.
In this scenario, a reduction in MTTR reflects a real improvement, such as a faster restart script or an automated fix. If MTTR increases, it likely means something has degraded, such as knowledge gaps, slower escalations, or process drift.
MTTR has limited value in only two extremes: Chaotic Systems, where any improvement to observability is a win — and Static Systems, which rarely change and behave predictably.
In anything else it’s borderline meaningless.
Modern technology systems are not linear. They are complex, distributed and constantly evolving. No one is fully in control of the environment. The only constant is change. Features are deployed continuously, infrastructure is updated, data relationships shift, and all these actions occur asynchronously across loosely coupled teams.
Imagine retrofitting a plane with new cabin features, and performing engine maintenance, all in mid-flight, while wing repairs are being outsourced to a third-party vendor — and half the crew still thinks they haven’t left the terminal. That’s what complexity in modern systems looks like.
Complex systems defy cause-and-effect predictability. Using causal, deterministic metrics like MTTR can’t — and won’t — capture how they actually behave. They oversimplify, obscure variability, and create a false sense of progress.
We’d all love to believe we can measure progress with a clean, upward-trending metric to convince leadership that, “everything is under control”, but as our systems grow more complex that belief in MTTR isn’t just unrealistic, it’s delusional.
Focus on what actually matters — building adaptive capacity in the teams doing the work, learning from incidents and improving detection and response automation. That’s what drives operational excellence and a culture of responsiveness and accountability, not a bullshit metric optimized for dashboards like MTTR.