We saved 30% on Kubernetes using 70% more expensive VMs

5 hours ago 2

RDX

And that’s not a typo.

Omio is one of the earliest companies in Europe to go 100% Kubernetes Native. We switched all our workloads (including databases, logs, metrics, everything) in 2017, and made a KubeCon 2017 Europe talk about it.

We jokingly call ourselves as Google Kubernetes Engine users, not Cloud users 😏 We don’t use Cloud as a shopping mall. Our usage is portable, controlled and reproduceable. You can read more in the other blog, DevOps as a Contract.

Synthetic Metrics: For this effort, we primarily track Provisioning ratio (sum of cpu/memory requested by workloads / sum of node-allocatable) and Utilization ratio (sum of utilization / sum node-allocatable).

Field Metrics: From our experience, we’ve seen a fair share of misguidance chasing synthetic metrics which get easily decoupled from ground reality. When one goes down — the other can go up, down or sideways.

Daily billed cost of the cluster is what truly matters, guarded closely by SLOs.

As a well known Kubernetes cost-saving practice, we used a mix of Spot VMs and Standard VMs in our GKE clusters. Spot VMs are ephemeral, short-lived & highly discounted (up to 70%) machines. They represent spare capacity in the Cloud provider. The only caveat being, they can disappear anytime at short notice.

Given our headstart, scale & maturity of Kubernetes operations, we’ve had lots of time to build global policies. The two we’d like to highlight for spot:

  • Mandatory Anti Affinity: Pods that belong to the same workload are never colocated on the same node, so that if any one node goes down — any single workload never gets unavailable, nor it’s SLOs significantly affected. We unfortunately can’t rely on Topology Spread due to the frequency of full-node disruptions explained below.
  • Mix of Spot Nodepools: We have multiple nodepools to accept whatever spot capacity is present across zones. e.g. for instance families (N1, N2, N2D, C2, etc) and different cpu/memory combinations (like standard-4, standard-8, highcpu-4, highcpu-8, etc). Since Spot VMs are not guaranteed capacity, to avoid pods getting starved and us not waking up to unassigned pod alerts, we ensure we accept whatever Spot capacity is available in the region.

Observation 1: Anti-affinity causes less bin-packing

Kubernetes optimizes the pod for efficient capacity only when the pod is created, but after that it stays there.

Kubernetes Scheduler does not perform Active Rebalancing.

When we spread out pods of same workloads, we observed poor provisioning utilization across our clusters. There is lot of wasted cpu/memory. If a workload requires 50 replicas, this means 50 machines.

Observation 2: Mix of Nodepools causes less bin-packing

Kubernetes (especially GKE) is quite machine-heavy. On every machine.

  • Kubernetes already reserves a certain % of cpu/memory for system stability.
  • A lot of system services and daemonsets (kube-proxy, dns caching, log shipping, metrics scraping, etc). This can be a separate article by itself on our node-level optimizations, but we won’t digress for now.

In Kubernetes, it’s more efficient to run a few large machines than many small machines, to avoid death by thousands of system services.

GKE’s autoscaling profiles come close, but they are not a true equivalent for proper bin-packing.

But with spot machines, we have to accept a mix of nodepools for spot availability. This means, many 8-core or lower machines (inefficient utilization), the odd one or two 32-core machines (either creates a lot of chaos when it goes down, or conflicts with pod spread out), and a mix of machine families that creates unpredictable performance.

From our experience, 16-cores with 1:4 cpu:memory ratio (64GB RAM) is an optimal price-performance-packing tradeoff for Dense, Stateless Kubernetes Nodes with a mix of Java, Go & Node container runtimes. Your mileage may vary based on what you run.

Observation 3: Spot is a Chaos Monkey

Spot VMs are an automatic Chaos Monkey let loose in production, but not in a good way. We were running between 100–300 spot machines at any point in time. This meant we saw full node disruptions every 5-15 minutes. A large number of pods have to be rescheduled, pulled, warmed up, etc. increasing kernel pressure across the board.

Observation 4: Networking Overhead

There is a lot of unestimatable impact due to full-node Spot disruptions. Kubernetes networking triggers complete routing table reloads, Ingress reloads, Load balancer health check recoveries, etc. We had a lot of unexplainable issues that were generally clubbed as “Networking Noise”.

Since we use Spot VMs which vary by zone, we used all 4 zones for Stateless clusters. This also means cross-zone networking traffic that we have to pay for, which is not insignificant at our scale.

Observation 5: Dubious Vendors

At a certain point, we used a vendor for managing the Spot machines. The statistics were not reliable. They provided figures of “what it might have cost” with standard VMs — and they charged us a percentage of what they hypothetically saved. This was always dubious for us, and we had no way to prove it unless we had something to compare.

At one point, we decided we’ve had enough of the “What it might have cost without spot VMs” stuff, the Chaos Monkey, the Networking Noise.

We decided to focus on the field, and get to the bottom of “What it actually costs” to just run plain standard machines.

We already have 2 geographically distributed regions. Why not just switch one region and compare the bills. Even if it failed, at least it would give us a reference threshold to what the vendor was billing us for spot capacity.

Blindly doing that would just mean 70% higher cost. It wouldn’t be a fair comparison. So we decided to operationalize it with a few basic policies tuned for Standard machines.

Standard Machine Policies

  1. Homogeneous 16-cpu 1:4 RAM single-family machines. We’ve tried GKE automatic nodepools before on our stateful clusters, but the Field Data was either equal or worse, so we might as well standardize.
  2. With Standard machines, it becomes feasible to use Kubernetes Topology Spread instead of Anti-affinity, since we no longer have full node disrupion every 5–15 mins as with Spot.
  3. Purchase committed use discounts, since we now have a homogeneous nodepool.
  4. Restrict all compute to two zones per region, since we’re not constrained by spot availability.

And a set of other minor policies which all add up.

So, we switched one region to this new format, hoping that it would at least be cost neutral. But the results blew us away.

The cluster costs were 30% lower than the Spot cluster!

And double surprise!

A large chunk of “Networking Noise” just went away.

The full-node Chaos Monkey was no longer there. The resulting machines gave better kubernetes networking and load balancer stability. An entire class of “Networking Noise” dropped below Tier 1 5x9s thresholds.

We switched from Spot VMs to 70% more expensive Standard VMs, and we saved 30% of our Cluster Costs, in exchange for higher quality.

It sounds counter-intuitive, but it can be done with due diligence. Your mileage may vary. We can imagine this does not work if you have low-density clusters, like cluster-per-team or (shudder) this madness.

Subhas Dandapani, Mark Parkinson and Fayiz Musthafa

Interested in solving these kind of problems at a vast scale? Join Omio. We bring together more than 1,000 transportation providers across trains, buses, flights, ferries, cars, and airport transfers to make it easier for people to focus on what really matters: the journey.

Read Entire Article