Migrating Uber's Compute Platform to Kubernetes

4 months ago 23

In 2024, the Uber stateless container orchestration platform completed a migration from Apache Mesos® to Kubernetes®.

This blog is the first in a multi-part series about Kubernetes migration use cases. Uber is gradually converging to Kubernetes for batch, stateless, and storage use cases. The goal is to leverage industry-standard technology, rich built-in functionality, and stability. This blog describes how we migrated all shared stateless workloads to Kubernetes. It describes our migration journey, including the motivation behind the migration, the challenges we faced, and the solutions we implemented to ensure a seamless transition. We discuss our infrastructure, key principles guiding the migration, customizations, scalability achievements, unexpected issues, and our plans for collaboration with the open-source community. The next blog will explore how Uber migrated DSW batch job workloads from Peloton to Kubernetes. Future blogs will explore additional workload and architecture migrations.

About the Container Platform Team

The Container Platform team manages more than 50 compute clusters across multiple regions/zones on both on-prem data centers and cloud providers like Oracle® Cloud and Google Cloud. Each cluster has around 5,000-7,500 hosts, around 250,000 cores, and around 50,000 pods. They power Uber’s 4,000 services that use around 3 million cores. These services get deployed 100,000 times a day, which results in 1.5 million pod launches a day at the rate of 120-130 pods/second in a single cluster. These clusters power Uber’s service federation layer, Up, which developers use to manage their service lifecycle.

ImageFigure 1: Uber cluster federation, Up.

Why Kubernetes?

The stateless container orchestration platform was stable on the Mesos stack for over 3 years. So, why’d we need to migrate to Kubernetes?

The main motivation was that Mesos was deprecated at Uber in 2021, and not many commits have happened to that code base since then. This was challenging because we weren’t receiving any bug fixes or new features from the community.

Conversely, Kubernetes has emerged as the industry standard for container orchestration, and has been widely adopted by Fortune 100 companies. It’s also natively supported by all major cloud providers. Other benefits include its active open-source community, a rich ecosystem of off-the-shelf operators to streamline development and reduce redundancies, and continuous updates and patches to improve security and compliance. 

These benefits make Kubernetes a forward-compatible and secure choice, ensuring Uber’s infrastructure remains resilient and adaptable to future technological advancements. But moving Uber from Mesos to Kubernetes posed significant challenges around scale, reliability, and integration.

Migration Principles

Based on past experiences with open-source technologies, we established several guiding principles for our migration. 

  1. Seamless Upgrades: We aim to run the same Kubernetes versions as cloud providers, relying on Kubernetes’ native extensibility for any customizations.
  2. Reliable Upgrades: To avoid incidents during upgrades, we’ll build extensive release validation systems, including integration and performance tests.
  3. Transparent and Automated Migrations: The migration process should be invisible to developers, requiring no effort or changes in their workflow. It should be a centralized effort with no complexity passed along to the end user. Developers shouldn’t‌ even be aware that their service is running on a Mesos cluster or a Kubernetes cluster, and a service should safely co-exist in both clusters.

Migration Challenges

The following sections share the challenges we faced during the migration.

Scale

Running Kubernetes at Uber scale involved overcoming several significant challenges. The common industry-wide practice is to manage many small clusters (1,500-2,000 nodes). However, this approach results in a lot of stranded capacity due to fragmentation and control plane overhead. It also increases operational toil. At Uber, we do the exact opposite of this and operate large clusters (5,000-7,500 nodes). 

Operating large clusters on Kubernetes posed challenges with managing the API server load to prevent bottlenecks, ensuring that the scheduler can handle the scale even with high pod churn, and dealing with fragmentation issues that arise from running large clusters with varied workloads. 

We created a state-of-the-art benchmarking suite for Kubernetes at Uber and were able to benchmark our clusters at 7,500 nodes, 200,000 Pods, and 150 Pods scheduled per second. 

We had to make some optimizations and tune parameters to achieve these numbers. We tuned the QPS settings and parallelism in the controller manager and scheduler to handle high loads. We used API priority and fairness to limit expensive API calls like list and get. We switched from JSON to Proto encoding for better performance. Lastly, we modified the pod topology spread scheduler plugin for faster processing. 


Integration with Uber Ecosystem

Migrating to Kubernetes required rebuilding integrations to ensure a consistent developer experience. These integrations included our CI/CD systems, service discovery, security, host lifecycle and observability stacks. Since Mesos and Kubernetes are fundamentally different in their architecture, we had to rebuild all of these from scratch.

ImageFigure 2: Uber developer platform components.


Automation

At Uber’s scale, driving the migration manually or on a case-by-case basis wasn’t possible. To automate the mechanics of migrating over 3 million cores, we leveraged Up, our global stateless federation layer. It abstracted away ‌cluster technology from developers, allowing us to move capacity from Mesos to Kubernetes. We used Up’s cluster selection feature to add a Kubernetes cluster with some additional swing capacity to every zone, alongside the Mesos cluster. Cluster selection rebalanced services away from highly allocated clusters (Mesos) to low-allocation clusters (Kubernetes), automating the migration process without needing any service owner interaction. In fact, service owners didn’t even realize that this was happening under the hood.


Feature Parity

Achieving feature parity between Kubernetes and Mesos was crucial for facilitating a painless migration. We needed to provide service owners with the same developer experience, the same levels of deployment safety, and the same operational velocity.

Uber-Specific Customizations

To support Uber use cases, we built and adopted custom components in our Kubernetes environment.

Container Artifacts Retrieval

On Mesos, developers could access their container artifacts like core dumps, heap profiles, and logs even after their container exited. This feature helped developers debug their container crashes, out-of-memory kills, and more. These artifacts were written to local volumes and exposed to the Up UI via a Mesos endpoint. In Kubernetes, however, local volumes get cleaned up during pod deletion by default. The Kubernetes UI doesn’t give visibility into container artifacts.

To address this, we introduced a sidecar container and an artifact uploader daemon that uploads these artifacts to a blob store upon container exit. Each pod includes a sidecar container that remains running even after the primary container exits. Upon the primary container’s exit, the artifact uploader daemon on the host is triggered. It compresses the necessary artifacts and uploads them to a blob store. 

ImageFigure 3: Artifact uploader architecture diagram.

Gradual Scaling

At Uber, controlled scaling is a critical feature designed to ensure deployment safety for services that are sensitive to rapid scaling operations. 

Services could be sensitive to rapid scaling changes due to several factors. Rapid scaling can cause large-scale resharding, leading to instability in some services. Apache Helix–based services can experience spikes in convergence times due to high instance churn in the ring. Lastly, some services can suffer temporary loss of all workers for specific shards during rapid scale-down operations.

Native Kubernetes deployments don’t natively support configurations to handle these issues. The rolling update specification in Kubernetes only controls the pace of deployment upgrades, not scaling operations. To bridge this gap, we implemented a feature in our custom resource controller that controls scaling operations based on service intent configuration.

It works by scaling in small batches. Our custom resource controller breaks down a single scale operation into multiple small batches. The custom controller proceeds with one batch only after the previous batch has completed scaling successfully.

ImageFigure 4: Gradual scaling control flow.

Faster Deployments

To improve deployment speeds, especially for large containers, we adopted clone sets for in-place updates and introduced an image prefetch daemon. This daemon pre-downloads images in advance, reducing cold start times.

ImageFigure 5: Image prefetching.

CloneSet is an open-source custom Kubernetes resource that enables in-place updates by patching pods rather than replacing them. During an update, the image prefetch daemon pre-downloads the new images to nodes in other zones, ensuring minimal delay when the update rolls out.


Kubernetes UI Optimizations

The native Kubernetes UI didn’t work well for Uber’s scale. It would crash or become non-responsive when we pointed it to a large cluster. We optimized the Kubernetes UI for better performance and usability. Our improvements ensure that the UI remains responsive and functional even with large cluster sizes. Introduced caching in multiple places within the UI code to enhance performance.

Unexpected Snags

So far, we’ve discussed the challenges we anticipated and resolved before embarking on this migration. Over the course of our migration journey, we also had some interesting learnings around unexpected system behaviors and Kubernetes quirks. 

Lack of Holistic Cluster Health Monitoring

We discovered a gap in holistic monitoring tools for large Kubernetes clusters. Specifically, we needed better insights into resource fragmentation to identify why pods weren’t being placed efficiently, detecting performance issues caused by other pods on the same node, and understanding the impact of frequent pod updates and restarts. 

To address this, we built our own deployment observability tool, providing detailed metrics and insights into cluster health and performance.


Informer Reconciliation

The default Kubernetes informer reconciliation process, which replays events every 8-10 hours, was inadequate for our needs. We encountered issues where deployment events were missed during controller leadership changes, delaying the deployment process by up to 10 hours.

To solve this, we created a custom reconciliation mechanism that forces reconciliation of high-level objects every 15 minutes, ensuring timely processing of deployment events.


Faster Rollbacks

We needed a more reliable method for detecting and rolling back bad deployments. Initially, we used the progress deadline seconds timer, but found it insufficient. We found it to be implicit and slow. Beyond that, “deployments timeout due to progress deadline exceeded” was a slow and implicit signal to trigger rollbacks. We required a fast and explicit signal. Some services also had disabled health checks, causing pods to appear ready even when they were crash-looping. Lastly, services with long initial delays in health checks led to misleading progress indicators.

To solve this, we implemented heuristics based on container restarts. For example, if more than 10% of pods restart 5 or more times during a rollout, we signal Up to trigger a rollback automatically.

Migration Journey

ImageFigure 6: Kubernetes migration timeline.

As of July 2024, we’re 100% done with the migration of shared stateless services to Kubernetes. Figure 6 shows the year-and-a-half migration progress.


Strategic Pauses

We took a couple of strategic pauses during this journey, which are the flat lines shown in Figure 6. During these pauses, we identified and fixed several reliability issues that surfaced as we scaled our Kubernetes clusters. For example, we discovered edge cases in our custom controllers and operators that required fine-tuning to ensure consistent performance and stability. By addressing these issues promptly, we prevented potential disruptions and ensured a smooth developer experience.


Operational Challenges

Operational challenges, like managing API server loads and optimizing scheduler performance, also prompted us to pause and refine our approach. We implemented several optimizations, including tuning QPS settings, restricting API calls, and switching to Proto encoding for better performance. These changes were crucial in maintaining the operational efficiency of our clusters as we scaled up. 


Accelerated Pace Post-Consolidation

The periods of consolidation and issue resolution significantly accelerated the pace of our migrations afterward. With a more stable and reliable foundation, we could resume our migration with increased confidence in the platform. At our peak, we migrated about 300,000 cores in a single week.

Next Steps

Convergence and open-source collaboration are next on our roadmap. 

We aim to converge all our cluster management technologies onto Kubernetes, including Apache Hadoop YARN  for batch workloads and Odin for stateful workloads. This unified platform will simplify operations and improve efficiency across the board.

We also hope to contribute back to the open-source community and discuss if there are any features mentioned here that may be beneficial to others outside Uber.

Our migration to Kubernetes was a significant undertaking that required careful planning, extensive customizations, and continuous learning. The open-source community’s contributions were invaluable in working through our roadblocks. As we continue on this journey, we aim to share our experiences and solutions to enrich the Kubernetes community further.

Apache®, Apache Mesos, Apache Helix, and the star logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Google Cloud is a trademark of Google LLC and this blog post is not endorsed by or affiliated with Google in any way.

Kubernetes® and its logo are registered trademarks of The Linux Foundation® in the United States and other countries. No endorsement by The Linux Foundation is implied by the use of these marks.

Oracle® is a registered trademark of Oracle and/or its affiliates. No endorsement by Oracle is implied by the use of these marks.

Header Image Attribution: The “Container Ship MSC Texas” image is covered by a CC BY-SA 2.0 license and is credited to Daniel Ramirez.

Read Entire Article