Hyperscalers Are Hard

59 minutes ago 1

15 Nov, 2025

drawing

A story of how NFTables Drives Kubernetes Nodes to Memory Exhaustion at 100k Scale.

It's a privilege to experience systems that operate at extreme scales. Consider the following temperatures:

329.85 K — July 10, 1913; the hottest day recorded on Earth[1]
800 K — Temperature of a hot stove
5.5 × 10³ K — The Sun's surface temperature
15 × 10⁶ K — The Sun's core temperature
10¹² K — Temperatures achieved in heavy ion collisions at the Large Hadron Collider
10³² K+ — Temperature of the universe 10^-43 seconds after the Big Bang[2]

Readers of this blog weren't around in 1913 to witness the hottest day on Earth, but they can still imagine what that day would have been like based on experience. Similarly, most of us can intuit 800 K as "Ouch! I burnt my hand on a hot stove." It is natural to extrapolate your experience of temperatures from scales 1-2 to scales 3-6. However, at extreme temperatures, matter behaves in strange and unintuitive ways. Your everyday experience cannot prepare you for these regimes.

In fact, it's humbling to think that countless civilizations rose and fell over millions of years before science progressed enough to reach the temperatures required for atomic fission and fusion. And we still don't fully understand how matter behaved at the Planck Epoch (the earliest moment after the Big Bang) or maybe that's just me.

High-energy physicists are obsessed with smashing particles together at ever-higher temperatures not only to confirm predictions made by various models of physics, but also because they observe matter behaving in states that no model had predicted before. Extreme scales reveal emergent phenomena that cannot be extrapolated from smaller scales.

After being in tech for a while, I've felt that the same principle applies to distributed systems.

A Kubernetes cluster with 10 nodes? Intuitive. 100 nodes? Still manageable. But 100,000 nodes? With the rise of AI hyperscalers, we are entering a regime where intuitions break down; where seemingly innocent operations become existential threats to cluster stability.

With this, I have two main points:

Systems that operate at extreme scales are very uncommon. Most engineers will never work on them, and most software is never tested at these scales.
It is embarrassingly hard to predict how systems behave at such scales. The assumptions baked into software at normal scales become the very source of failure at extreme scales.

This is a story about what happens when Kubernetes, designed to orchestrate containers at scale, meets the kind of scale where its own optimizations become catastrophic.

If you're already familiar with Kubernetes internals—nodes, EndpointSlices, and kube-proxy—feel free to jump to Section III. Otherwise, let's establish the foundation.

Anatomy of a Node

Before we dive into the chaos of extreme scale, let's establish what "normal" looks like. Understanding what happens when a node joins a Kubernetes cluster and how the node behaves throughout its lifecycle is essential to understanding how things break at hyper-scales.

Say you've just provisioned a new virtual machine. It has an operating system, a network interface, some CPU, and some memory. But it's not yet part of your Kubernetes cluster. What transforms this generic compute resource into a functioning Kubernetes node?

Well, first you need to have an agent called kubelet running on this node. When the kubelet starts, it:

Registers itself with the API server, announcing "I exist, and here are my capabilities"
Reports its status continuously. It reports CPU, memory, disk, network, operating system details, to be more specific.
Establishes a heartbeat mechanism by periodically renewing a lease object in the Kubernetes API server. If a node fails to renew its lease, the node is going to be marked as unhealthy.

The API server receives this registration and creates a Node object (a fundamental resource that represents your machine in the cluster's data model). From this moment on, your node exists in Kubernetes' view of the world.

Apart from the kubelet there are a few more critical softwares running on the node. To name a few:

kube-proxy sits at the heart of Kubernetes networking. Its job is deceptively simple: ensure that when you create a Service in Kubernetes, network traffic actually reaches the Pods backing that Service. It does this by programming network rules on the node—rules that intercept packets destined for Service IPs and redirect them to actual Pod IPs. We'll return to kube-proxy later. This will be important soon!
Container runtime (like containerd or CRI-O) does the actual work of pulling images and running containers. The kubelet tells it what to run; the runtime makes it happen.
CNI plugins configure the network interfaces for Pods, ensuring each Pod gets its own IP address and can communicate with other Pods across the cluster.

At this point, your node is ready. It's registered, healthy, and waiting for work.

Now let's see what happens when you create a Deployment on the cluster; say, a simple web application with 3 replicas:

apiVersion: apps/v1 kind: Deployment metadata: name: webapp spec: replicas: 3 selector: matchLabels: app: webapp template: metadata: labels: app: webapp spec: containers: - name: nginx image: nginx:latest ports: - containerPort: 80

When you run kubectl apply, this is what roughly happens:

API server receives the Deployment manifest and stores it in a database called etcd
Deployment controller (running in the control plane) sees the new Deployment and creates a ReplicaSet
ReplicaSet controller sees it needs 3 Pods and creates 3 Pod objects in the API server
Scheduler watches for unscheduled Pods, evaluates which nodes have sufficient resources, and assigns each Pod to a node
Kubelet on each selected node sees that it has been assigned a Pod (receives a SYNCLOOP ADD event), tells the container runtime to pull the image and start the container
Pod networking is configured by the CNI plugin, giving each Pod its own IP address

Within seconds, you have 3 Pods running across your cluster. But they're just containers with IP addresses. How do you actually send traffic to them?

What is a Service object and why do we care?

Services provide stable endpoints for your Pods. Even though Pods can be created and destroyed constantly (changing their IP addresses), a Service gives you a single, stable IP that load-balances across all matching Pods:

apiVersion: v1 kind: Service metadata: name: webapp-service spec: selector: app: webapp ports: - protocol: TCP port: 80 targetPort: 80

When this Service is created:

API server stores the Service object
Service controller allocates a cluster IP for the Service (e.g., 10.96.100.50)
Endpoint controller finds all Pods matching the Service's selector and creates an EndpointSlice object

EndpointSlices are particularly interesting here.

An EndpointSlice is a collection of network endpoints (IP addresses and ports) that back a Service. For our webapp Service with 3 Pods, the EndpointSlice might look like:

apiVersion: discovery.k8s.io/v1 kind: EndpointSlice metadata: name: webapp-service-abc123 labels: kubernetes.io/service-name: webapp-service addressType: IPv4 endpoints: - addresses: - "10.244.1.5" # Pod 1 IP conditions: ready: true - addresses: - "10.244.2.8" # Pod 2 IP conditions: ready: true - addresses: - "10.244.3.12" # Pod 3 IP conditions: ready: true ports: - port: 80 protocol: TCP

When this EndpointSlice is created or updated, something critical happens: every kube-proxy instance on every node in the cluster watches for this change.

Why? Because kube-proxy needs to know where to send traffic. When someone on Node A tries to access webapp-service:80, kube-proxy on Node A must know that it can redirect that traffic to 10.244.1.5:80, 10.244.2.8:80, or 10.244.3.12:80.

This means that every time an EndpointSlice changes, kube-proxy on every node reacts.

At normal scale (example: 10 nodes, 50 Services, 200 Pods) this is trivial. EndpointSlice updates are infrequent. The network rules kube-proxy needs to maintain are manageable. A single node might have a few hundred rules total. The impact of an EndpointSlice update is measured in milliseconds and a few megabytes of memory. It's invisible. It just works.

This is the steady state that most engineers experience. Kubernetes feels elegant and seamless. Nodes join, Pods are scheduled, Services route traffic. The complexity is beautifully abstracted away.

But what happens when you have 100,000 nodes? What happens when a single Service suddenly has endpoints on every single node in the cluster?

Let's return to our EndpointSlice example. You've created a Service with 3 Pod endpoints. The EndpointSlice gets created, kube-proxy instances across your cluster each receive the update, they each update their local network rules, and everything is fine.

The cost of this operation should be a few milliseconds per node and a few kilobytes of memory i.e. unnoticeable.

Now let's scale this up and watch what happens. Consider a simple scenario: a DaemonSet-backed Service that runs on every node in your cluster. Here's how the resource footprint grows:

Cluster Size Endpoints per Service Ruleset File Size Memory per nft Process

10 nodes	10	~10 KB	~1 MB
100 nodes	100	~100 KB	~10 MB
1,000 nodes	1,000	~1 MB	~100 MB
10,000 nodes	10,000	~10 MB	~300-500 MB
100,000 nodes	100,000	~16-25 MB	700+ MB

Everything works perfectly up to several thousand nodes. The cluster is healthy. Services work. Traffic flows. The assumptions built into Kubernetes hold.

But at 100,000 nodes? Something fundamentally changes.

100,000 Nodes: Where Assumptions Collapse

Let's introduce a single, innocent Service X that runs on every node. Nothing special. The kind of thing you deploy without a second thought.

This Service will have 100,000 endpoints (one for each node).

When this Service's EndpointSlice is created:

100,000 kube-proxy instances receive notification of the new EndpointSlice
Each kube-proxy must process this update and regenerate its network rules
Each kube-proxy now needs to maintain rules for routing to 100,000 backend endpoints

But how does kube-proxy actually apply these rules?

Enter NFTables

Kubernetes introduced NFTables mode for kube-proxy as an alpha feature in version 1.29, reaching beta in 1.31. While iptables remains the default (even after NFTables reaches GA), NFTables is now available as an opt-in mode that promises significantly better performance at scale.

The traditional iptables mode has severe performance problems:

iptables: O(n) sequential rule processing—every packet checks every rule until it finds a match
At 100,000 endpoints, this becomes catastrophically slow
Packet processing latency increases linearly with the number of endpoints

NFTables was designed to solve exactly this problem:

nftables: O(1) map lookups using kernel hash maps
Constant-time performance regardless of the number of endpoints
At 100,000 endpoints, packet latency remains in the microsecond range

The performance improvement is real and dramatic. From a latency perspective, NFTables absolutely delivers. Packet processing time remains essentially constant whether you have 10 endpoints or 100,000 endpoints.

The Hidden Cost

source

But there's a cost that nobody anticipated at extreme scale.

When kube-proxy needs to update NFTables rules, it doesn't send incremental changes. It generates a complete ruleset—a file that describes every chain, every rule, every endpoint mapping—and hands it to the nft command:

nft --check -f /path/to/ruleset.nft

For our 100,000-endpoint Service, this ruleset file might look like this (simplified):

table ip kube-proxy { chain KUBE-SERVICES { type nat hook prerouting priority -100; ip daddr 10.96.100.50 tcp dport 80 jump KUBE-SVC-WEBAPP } # One chain per endpoint (100,000 of these) chain KUBE-SEP-ENDPOINT00001 { meta l4proto tcp dnat to 10.244.0.1:80 } chain KUBE-SEP-ENDPOINT00002 { meta l4proto tcp dnat to 10.244.0.2:80 } # ... 99,998 more chains ... # Service chain that load-balances across all endpoints chain KUBE-SVC-WEBAPP { numgen random mod 100000 0 jump KUBE-SEP-ENDPOINT00001 numgen random mod 99999 0 jump KUBE-SEP-ENDPOINT00002 # ... 99,998 more rules ... } }

For a 100,000-node cluster with this DaemonSet Service:

Ruleset file size: ~16-25 MB
Number of chains: ~100,005
Number of rules: ~200,000+

And here's the critical part: every single node in your 100,000-node cluster must load and process this ruleset file.

This isn't a one-time operation either. Every time any endpoint in that Service changes—a Pod restarts, a node becomes unhealthy, a new node joins—kube-proxy on every node regenerates and reloads the entire ruleset.

You deploy your DaemonSet Service. Within minutes, you start seeing nodes transition to NotReady. Not just a few nodes but thousands of them. Specifically, the nodes with the least memory.

The kubelet has stopped renewing its lease. Why? The node ran out of memory completely. The Out-Of-Memory (OOM) killer terminated processes. Eventually, it killed the kubelet itself.

When you examine a node just before it dies, you see something unexpected in the process list:

USER PID CPU% MEM% VSZ RSS COMMAND root 298930 56.6 35.8 721184 689532 /usr/sbin/nft -f - root 2799 0.0 19.3 1626836 372308 kube-proxy --v=2 ... root 2201 0.6 2.7 1958692 52660 /usr/bin/kubelet ...

A process called nft -f - is consuming 689 MB of memory. On a node with only 2 GB of total memory, this single process is taking over a third of available RAM.

And it's spawned by kube-proxy. An abstraction broke.

Nothing in the software changed. The kube-proxy code is identical whether you have 10 nodes or 100,000 nodes. The NFTables tool is the same. The kernel's networking stack is the same.

But at 100,000 nodes, the system exhibits behavior that was never modeled, never tested, never anticipated:

At 100 nodes: nft consumes ~3 MB, runs for <100ms, invisible
At 1,000 nodes: nft consumes ~30 MB, runs for ~1 second, barely noticeable
At 10,000 nodes: nft consumes ~300 MB, runs for ~10 seconds, starts to be noticeable
At 100,000 nodes: nft consumes 700+ MB, runs for 30+ seconds, kills nodes

The relationship seems to be super-linear. Memory consumption grows faster than the number of endpoints.

At normal scale, EndpointSlice creation is an abstraction that works perfectly. You never think about what kube-proxy is doing. You never monitor nft's memory usage. You never consider that loading a ruleset could be expensive.

At 100,000 nodes, the abstraction breaks. The invisible becomes visible. The negligible becomes catastrophic. And most critically: nothing in the system warned you this would happen. There's no alert at 10,000 nodes saying "WARNING: Approaching memory limits." The system doesn't gradually degrade. It works perfectly until suddenly, catastrophically, it doesn't.

This is what happens at extreme scale. The software didn't break. The logic didn't fail. The algorithms are correct. But the assumption that ruleset files would remain small, that memory consumption would be negligible, that these operations would complete quickly—these assumptions, invisible and unexamined at normal scale, become catastrophic at 100,000 nodes.

NFTables solved the packet processing bottleneck (O(1) lookups) but revealed a new one: memory consumption during rule loading. An optimization designed for scale became the mechanism of failure.

This is the reality of extreme scale: discovering failure modes that no amount of reasoning at smaller scales could have predicted.

In case you you are looking for more reasons to not use nftables [see this].(https://kubernetes.io/blog/2025/02/28/nftables-kube-proxy/#why-not-nftables)

Understanding NFTable Memory Footprint

"Everyone should just read the source code" - Nick Baker

Who is Nick Baker?
He's a legendary 10x programmer. Go check out his work: Nick Baker

Now that we understand what breaks at scale, let's understand why it breaks. I read through the source code cuz I had way too much time on my hands. Why does a 16 MB ruleset file consume 700+ MB of memory?

The Code Path

The critical code path lives in the kubernetes-sigs/knftables library that kube-proxy uses:

// https://github.com/kubernetes-sigs/knftables/blob/master/nftables.go#L269 func (nft *realNFTables) Check(ctx context.Context, tx *Transaction) error { nft.bufferMutex.Lock() defer nft.bufferMutex.Unlock() if tx.err != nil { return tx.err } nft.buffer.Reset() tx.populateCommandBuf(nft.buffer) // Generates the complete ruleset cmd := exec.CommandContext(ctx, nft.path, "--check", "-f", "-") cmd.Stdin = nft.buffer _, err := nft.exec.Run(cmd) return err }

Every time kube-proxy needs to update networking rules, it:

Generates a complete ruleset file in memory
Spawns /usr/sbin/nft --check -f - to validate it
If validation succeeds, spawns /usr/sbin/nft -f - to apply it

The --check flag is a safety mechanism—you want to validate the ruleset before loading it into the kernel. But at 100,000 nodes, this safety check becomes the failure point.

The NFTables Pipeline

When nft -f ruleset.nft executes, it's not simply reading a file. It's compiling it through several stages:

Stage 1: Lexing and Parsing (AST Generation)

The ruleset is tokenized and parsed into an Abstract Syntax Tree. This tree representation is significantly larger than the input text.

Input: 16 MB text file
AST in memory: ~10× file size = ~160 MB

You can observe this yourself by running:

nft --debug=parser -f ruleset.nft > ast_output.txt

The AST output is roughly 10× the size of the input file.

Stage 2: C Structure Construction

NFTables constructs internal C objects for every table, chain, rule, set, and map. The key data structure is:

// From https://git.netfilter.org/iptables/tree/iptables/nft.c struct obj_update { struct list_head head; enum obj_update_type type; unsigned int seq; union { struct nftnl_table *table; struct nftnl_chain *chain; struct nftnl_rule *rule; struct nftnl_set *set; void *ptr; }; };

Each chain and rule requires memory for:

The structure itself (~500 bytes)
String copies of names and expressions
Linked list overhead
Expression trees
Jump targets and references

For our 100,000-endpoint Service:

~100,000 chains (one per endpoint)
~100,000+ rules (load balancing + DNAT rules)
Memory per chain+rule: ~2-5 KB
Total: ~250-300 MB

Stage 3: Netlink Batch Serialization

NFTables needs to send all these rules to the kernel atomically. It serializes every object into a netlink message buffer in TLV (Type-Length-Value) format.

This is another complete copy of the ruleset, now in binary netlink format:

Netlink buffer: ~250 MB

Total Memory Footprint:

Component Memory ───────────────────────────────── AST (parsed tree) ~160 MB C Structures ~250 MB Netlink Buffer ~250 MB Parser overhead ~40 MB ───────────────────────────────── Total ~700 MB

All of this happens in userspace, in the nft process, every time the ruleset is loaded or validated.

Why This Can't Be Easily Fixed

The NFTables tool needs to:

Parse and validate the entire ruleset for correctness
Build a complete in-memory representation
Serialize it for atomic kernel submission

These operations are inherently O(n) in both time and space with respect to the number of rules. You can't validate syntax without parsing. You can't ensure atomicity without buffering the entire transaction.

Could the multiplier be reduced? Perhaps. More efficient data structures, streaming parsers, or compression might help. But the fundamental O(n) memory consumption remains—it's architectural.

The Kubernetes Relationship

Here's how Kubernetes resources map to memory consumption:

Kubernetes Resource Effect on NFTables Memory Impact

Service	+1-3 chains	~5-15 KB
EndpointSlice	+1 element per (IP, port, protocol)	~2-5 KB per endpoint
DaemonSet (100k nodes)	+100,000 endpoints	~500-700 MB

For our 100,000-node cluster with a single DaemonSet Service:

1 Service → 3 base chains
100,000 endpoints → 100,000 endpoint chains + 100,000 load-balancing rules
Result: 16 MB ruleset file, 700 MB memory consumption

And remember: this happens on every node in the cluster, every time the EndpointSlice changes.

The Hyperscaler Future

The race to AGI isn't being won by better algorithms (hot-take?). It's being won by whoever can train faster.

Consider the training timeline of frontier models:

Model Parameters Approx. Compute Train Duration Iteration Speed

GPT-3 (2020)	175B	~2,000 A100s	~30 days	1 model/month
GPT-3 on 100k nodes	175B	100,000 A100s	~15 hours	~60 models/month
GPT-4 (2023)	1.8T	~25,000 A100s	90-100 days	1 model/3 months
GPT-4 on 100k nodes	1.8T	100,000 A100s	~22 days	~1.3 models/month
LLaMA 3.1 (2024)	405B	~16,000 H100s	~60 days	1 model/2 months
Gemini Ultra (2024)	Unknown	Large TPU fleet (multi-DC)	Weeks-months	Faster
Gemini 2.0 (2024)	Unknown	100,000+ TPU v6	Unknown	Much faster
Future (2026)**	10T+	200,000+ GPUs/TPUs	~10 days	~3 models/month

*Hypothetical: What if these models were trained on 100k node hyperscale clusters?
**Projected based on current scaling trends

GPT-3 took 30 days on 2,000 GPUs. On 100,000 nodes? 15 hours. GPT-4 took 100 days on 25,000 GPUs. On 100,000 nodes? 22 days.

When you can train GPT-4 in under a month instead of three months, you don't just iterate faster. You fundamentally change how you do research. You can afford to try risky ideas (more experimentation = better results). You can run massive ablation studies. You can discover what works through empirical observation rather than theoretical guesswork.

This is why hyperscale matters. The company that can iterate fastest wins. Not because bigger models are automatically better, but because faster iteration means you find better models sooner. You discover what works. You ship while competitors are still training.

And this creates a new class of engineering challenges. At 100,000 nodes:

Your control plane becomes a distributed system problem
Your observability stack drowns in metrics
Your network topology hits physics limits
Your memory assumptions—like we saw with NFTables—collapse catastrophically

My guess would be that Kubernetes is where this battle will be fought. Not because it's perfect, but because it's the only orchestration platform that's even attempting to operate at this scale. The infrastructure that wins the AGI race will be Kubernetes infrastructure.

Which means the engineering challenges we discussed—NFTables memory exhaustion, endpoint scaling, service mesh overhead—aren't edge cases. They're the new normal. And there will be more. In fact, AWS recently annouced support for 100k node EKS clusters to power AI/ML workloads with a blog talking more about interesting scaling challenges they solved [4].

Working on hyperscale infrastructure means operating in a regime where your intuitions fail daily. Where every "obvious" optimization reveals a new bottleneck. Where you get to rewrite the assumptions that everyone else takes for granted.

Just as particle physicists discovered quarks by smashing atoms together at energies no one had reached before, hyperscale engineers will discover fundamental truths about distributed systems by pushing clusters to scales no one has attempted before.

Critique

The blog is open for critique. Drop me an email at [email protected] with subject: "blog critique" and I will add your points in this section. Also please point out any inaccuracies / mistakes in the essay/post.

Footnotes: