Years ago, someone in tech support at Telewest, then the cable supplier for southwest London, told me that if my broadband went out I should hope its television service went down too: the volume of complaints would get it fixed much faster. You could see this in action some years later, in 2017, when Amazon Web Services went down, taking with it Netflix. Until that moment few had realized that Netflix built its streaming service on Amazon’s cloud computing platform to take advantage of its flexibility in up- and down-sizing infrastructure. The source – an engineer’s typing error – was quickly traced and fixed, and later I was told the incident led Netflix to diversify its suppliers. You would think!
Even so, Netflix was one of the companies affected on Monday, when a DNS error took out a chunk of AWS, and people from gamers on Roblox to governments with mission-critical dependencies were affected. On the list of the affected are both the expected (Alexa and Ring) and the unexpected (Apple TV, Snapchat, Hulu, Google, Fortnite, Lyft, T-Mobile, Verizon, Venmo, Zoom, and the New York Times). To that add the UK government. At the Guardian, Simon Goodley says the UK government has awarded AWS £1.7 billion in contracts across 35 public sector authorities, despite warnings from the Treasury, the Financial Conduct Authority, and the Prudential Regulation Authority. Among the AWS-dependent: the Home Office, the Department of Work and Pensions, HM Revenue and Customs, and the Cabinet Office.
First, to explain the mistake – so common that experts said “It’s always DNS” and so old that early Internet pioneers said “We shouldn’t be having DNS errors any more”. The Domain Name System, conceived in 1983 by Paul Mockapetris, is a core piece of how the Internet routes traffic. When you type or click on a domain name such as “pelicancrossing.net”, behind the scenes a computer translates that name into a series of dotted numbers that identify the request’s destination. An error in those numbers, no matter how small, means the message – data, search request, email, whatever – can’t reach its destination, just as you can’t reach the recipient you want if you get a telephone number wrong. The upshot of all that is that DNS errors snarl traffic. In the AWS case, the error affected just one of its 30 regions, which is why Monday’s outages were patchy.
As Dan Milmo and Graham Wearden write at the Guardian, the outage has focused many minds on the need to diversify cloud computing. Taken together, Amazon (30%), Microsoft Azure (20%), and Google (13%) jointly control 63% of the market worldwide. There have been many such warnings.
At The Register, Carly Page reports on the individual level: smart homes turned dumb. Eightsleep beds stuck in an upright position and lost their temperature controls. App-controlled litter boxes stopped communicating. “Smart” light bulbs stayed dark. The Internet of Other People’s Things at its finest.
Also at The Register, Corey Quinn suggests the DNS error was ultimately attributable to an ongoing exodus of senior AWS engineers who took with them essential institutional memory. Once you’ve reached a certain level of scale, Quinn writes, every problem is complex and being able to remember that a similar issue on a previous occasion was traced improbably to a different system in a corner somewhere can be crucial. As departures continue, Quinn believes failures like these will become more common.
If that global picture is dispiriting, consider also the question of dependence within organizations; if your country depends on a single company’s infrastructure to power mission-critical systems, the diversity in the rest of the world won’t help you if that single company goes out. In the UK, Sam Trendall reports at Public Technology, the government activated incident-response mechanisms. Notable among the failures as prime minister Keir Starmer pushes for a mandatory digital ID: the government’s new One Login, as well as some UK banks. This outage provides evidence for the digital sovereignty many have been advocating.
I admit to mixed feelings. I agree with the many who believe the public sector should embrace digital sovereignty…but I also know that the UK government has a terrible record of failed IT projects, no matter who builds them. In 2010, fixing that was part of the motivation for setting up the Government Digital Service, as first GDS leader Mike Bracken writes at Public Digital. Yet the failures keep coming; see also the Post Office Horizon scandal. Bracken believes the solution is to invest in public sector capacity and digital expertise in order to end this litany of expensive failures.
At TechRadar, Benedict Collins rounds up further expert commentary, largely in agreement about the lessons we should learn. But will we? We should have learned in 2017.
Still, it would be a mistake to focus solely on Amazon. It is just one of many centralized points of failure. The is dangerously important as a unique resource for archived web pages. And the UK is not the only government flying at high-risk. Consider South Korea, where a few weeks ago a data center fire may have consumed 85TB of government data – with no backups. It seems we never really learn.
Illustrations: Traffic jam in New York’s Herald Square, 1973 (via Wikimedia).
Wendy M. Grossman is an award-winning journalist. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. She is a contributing editor for the Plutopia News Network podcast. Follow on Mastodon or Bluesky.
.png)
