Summary of Heroku June 10 Outage

4 months ago 8

Beginning at 6:00 UTC, on Tuesday, Jun 10, 2025 Heroku customers began experiencing service disruption, creating up to 12 hours of downtime for many customers. This issue was caused by an unintended system update across our production infrastructure; it was not the result of a security incident, and no customer data was lost. Now that we have fully restored our services, we would like to share more details.

The entire Heroku team offers our deepest apology for this service disruption. We understand that many of you rely on our platform as a foundation for your business. Communication during the incident did not meet our standards, leaving many of you unable to access accurate status updates and uncertain about your applications. Incidents like this can affect trust, our number one value, and nothing is more important to us than the security, availability, and performance of our services.

What happened

A detailed preliminary RCA is available here. Let’s go over some of the more important points. Our investigation revealed three critical issues in our systems that combined to this outage:

A Control Issue: An automated operating system update ran on our production infrastructure when it should have been disabled. This process restarted the host’s networking services.
A Resilience Issue: The networking service had a critical flaw—it relied on a legacy script that only applied correct routing rules on initial boot. When the service restarted, the routes were not reapplied, severing outbound network connectivity for all dynos on the host.
A Design Issue: Our internal tools and the Heroku Status Page were running on this same affected infrastructure. This meant that as your applications failed, our ability to respond and communicate with you was also severely impaired.

These issues caused a chain reaction that led to widespread impact, including intermittent logins, application failures, and delayed communications from the Heroku team.

Timeline of Events

(All times are in Coordinated Universal Time, UTC)

Phase 1: Initial Impact and Investigation (06:00 – 08:26)

At 06:00, Heroku services began to experience significant performance degradation. Customers reported issues including intermittent logins, and our monitoring detected widespread errors with dyno networking. Critically, our own tools and the Heroku Status Page were also impacted, which severely delayed our ability to communicate with you. By 08:26, the investigation confirmed the core issue: the majority of dynos in Private Spaces were unable to make outbound HTTP requests.

Phase 2: Root Cause Discovery (08:27 – 13:42)

With the impact isolated to dyno networking, the team began analyzing affected hosts. They determined it was not an upstream provider issue, but a failure within our own infrastructure. Comparing healthy and unhealthy hosts, engineers identified missing network routes at 11:54. The key discovery came at 13:11, when the team learned of an unexpected network service restart. This led them to pinpoint the trigger at 13:42: an automated upgrade of a system package.

Phase 3: Mitigation and Service Restoration (12:56 – 22:01)

While the root cause investigation was ongoing, this became an all-hands-on-deck situation with teams working through the night to restore service.

Communication & Relief: At 12:56, the team began rotating restarts on internal instances, which provided some relief. A workaround was found to post updates to the @herokustatus account on X at 13:58.
Stopping the Trigger: The team engaged an upstream vendor to invalidate the token used for the automated updates. This was confirmed at 17:30 and completed at 19:18, preventing any further hosts from being impacted.
Restoring Services: The Heroku Dashboard was fully restored by 20:59. With the situation contained, the team initiated a fleetwide dyno recycle at 22:01 to stabilize all remaining services.

Phase 4: Long-Tail Cleanup (June 11, 22:01 – 05:50)

This started a long phase of space recovery as well as downstream fixes. Many systems had to catch up after service was restored. For example status emails from earlier in the incident started being delivered. Heroku connect syncing had to catch back up. Heroku release phase had a long backlog that took a few hours to catch up. After extensive monitoring to ensure platform stability, all impacted services were fully restored, and the incident was declared resolved at 05:50 on June 11.

Identified Issues

Our post-mortem identified three core areas for improvement.

First, the incident was triggered by unexpected weaknesses in our infrastructure. A lack of sufficient immutability controls allowed an automated process to make unplanned changes to our production environment.

Second, our communication cadence missed the mark during a critical outage, customers needed more timely updates – an issue made worse by the status page being impacted by the incident itself.

Finally, our recovery process took longer than it should have. Tooling and process gaps hampered our engineers’ ability to quickly diagnose and resolve the issue.

Resolution and Concrete Actions

Understanding what went wrong is only half the battle. We are taking concrete steps to prevent a recurrence and be better prepared to handle any future incidents.

Ensuring Immutable Infrastructure: The root cause of this outage was an unexpected change to our running environment. We disabled the automated upgrade service during the incident (June 10), with permanent controls coming early next week. No system changes will occur outside our controlled deployment process going forward. Additionally, we’re auditing all base images for similar risks and improving our network routing to handle graceful service restarts.
Guaranteeing Communication Channels: Our status page failed you when you needed it most because our primary communication tools were affected by the outage. We are building backup communication channels that are fully independent to ensure we can always provide timely and transparent updates, even in a worst-case scenario.
Accelerating Investigation and Recovery: The time it took to diagnose and resolve this incident was unacceptable. To address this, we are overhauling our incident response tooling and processes. This includes building new tools and improving existing ones to help engineers diagnose issues faster and run queries across our entire fleet at scale. We are also streamlining our “break-glass” procedures to ensure teams have rapid access to critical systems during an emergency and enhancing our monitoring to detect complex issues much earlier.

Thank you for depending on us to build and run your apps and services. We take this outage very seriously and are determined to continuously improve the resiliency of our service and our team’s ability to respond, diagnose, and remediate issues. The work continues and we will provide updates in an upcoming blog post.

Read Entire Article