15 Nov, 2025

A story of how NFTables Drives Kubernetes Nodes to Memory Exhaustion at 100k Scale.
It's a privilege to experience systems that operate at extreme scales. Consider the following temperatures:
- 329.85 K — July 10, 1913; the hottest day recorded on Earth[1]
- 800 K — Temperature of a hot stove
- 5.5 × 103 K — The Sun's surface temperature
- 15 × 106 K — The Sun's core temperature
- 1012 K — Temperatures achieved in heavy ion collisions at the Large Hadron Collider
- 1032 K+ — Temperature of the universe 10-43 seconds after the Big Bang[2]
Readers of this blog weren't around in 1913 to witness the hottest day on Earth, but they can still imagine what that day would have been like based on experience. Similarly, most of us can intuit 800 K as "Ouch! I burnt my hand on a hot stove." It is natural to extrapolate your experience of temperatures from scales 1-2 to scales 3-6. However, at extreme temperatures, matter behaves in strange and unintuitive ways. Your everyday experience cannot prepare you for these regimes.
In fact, it's humbling to think that countless civilizations rose and fell over millions of years before science progressed enough to reach the temperatures required for atomic fission and fusion. And we still don't fully understand how matter behaved at the Planck Epoch (the earliest moment after the Big Bang) or maybe that's just me.
High-energy physicists are obsessed with smashing particles together at ever-higher temperatures not only to confirm predictions made by various models of physics, but also because they observe matter behaving in states that no model had predicted before. Extreme scales reveal emergent phenomena that cannot be extrapolated from smaller scales.
After being in tech for a while, I've felt that the same principle applies to distributed systems.
A Kubernetes cluster with 10 nodes? Intuitive. 100 nodes? Still manageable. But 100,000 nodes? With the rise of AI hyperscalers, we are entering a regime where intuitions break down; where seemingly innocent operations become existential threats to cluster stability.
With this, I have two main points:
- Systems that operate at extreme scales are very uncommon. Most engineers will never work on them, and most software is never tested at these scales.
- It is embarrassingly hard to predict how systems behave at such scales. The assumptions baked into software at normal scales become the very source of failure at extreme scales.
This is a story about what happens when Kubernetes, designed to orchestrate containers at scale, meets the kind of scale where its own optimizations become catastrophic.
If you're already familiar with Kubernetes internals—nodes, EndpointSlices, and kube-proxy—feel free to jump to Section III. Otherwise, let's establish the foundation.
Anatomy of a Node
Before we dive into the chaos of extreme scale, let's establish what "normal" looks like. Understanding what happens when a node joins a Kubernetes cluster and how the node behaves throughout its lifecycle is essential to understanding how things break at hyper-scales.
Say you've just provisioned a new virtual machine. It has an operating system, a network interface, some CPU, and some memory. But it's not yet part of your Kubernetes cluster. What transforms this generic compute resource into a functioning Kubernetes node?
Well, first you need to have an agent called kubelet running on this node. When the kubelet starts, it:
- Registers itself with the API server, announcing "I exist, and here are my capabilities"
- Reports its status continuously. It reports CPU, memory, disk, network, operating system details, to be more specific.
- Establishes a heartbeat mechanism by periodically renewing a lease object in the Kubernetes API server. If a node fails to renew its lease, the node is going to be marked as unhealthy.
The API server receives this registration and creates a Node object (a fundamental resource that represents your machine in the cluster's data model). From this moment on, your node exists in Kubernetes' view of the world.
Apart from the kubelet there are a few more critical softwares running on the node. To name a few:
kube-proxy sits at the heart of Kubernetes networking. Its job is deceptively simple: ensure that when you create a Service in Kubernetes, network traffic actually reaches the Pods backing that Service. It does this by programming network rules on the node—rules that intercept packets destined for Service IPs and redirect them to actual Pod IPs. We'll return to kube-proxy later. This will be important soon!
Container runtime (like containerd or CRI-O) does the actual work of pulling images and running containers. The kubelet tells it what to run; the runtime makes it happen.
CNI plugins configure the network interfaces for Pods, ensuring each Pod gets its own IP address and can communicate with other Pods across the cluster.
At this point, your node is ready. It's registered, healthy, and waiting for work.
Now let's see what happens when you create a Deployment on the cluster; say, a simple web application with 3 replicas:
When you run kubectl apply, this is what roughly happens:
- API server receives the Deployment manifest and stores it in a database called etcd
- Deployment controller (running in the control plane) sees the new Deployment and creates a ReplicaSet
- ReplicaSet controller sees it needs 3 Pods and creates 3 Pod objects in the API server
- Scheduler watches for unscheduled Pods, evaluates which nodes have sufficient resources, and assigns each Pod to a node
- Kubelet on each selected node sees that it has been assigned a Pod (receives a SYNCLOOP ADD event), tells the container runtime to pull the image and start the container
- Pod networking is configured by the CNI plugin, giving each Pod its own IP address
Within seconds, you have 3 Pods running across your cluster. But they're just containers with IP addresses. How do you actually send traffic to them?
What is a Service object and why do we care?
Services provide stable endpoints for your Pods. Even though Pods can be created and destroyed constantly (changing their IP addresses), a Service gives you a single, stable IP that load-balances across all matching Pods:
When this Service is created:
- API server stores the Service object
- Service controller allocates a cluster IP for the Service (e.g., 10.96.100.50)
- Endpoint controller finds all Pods matching the Service's selector and creates an EndpointSlice object
EndpointSlices are particularly interesting here.
An EndpointSlice is a collection of network endpoints (IP addresses and ports) that back a Service. For our webapp Service with 3 Pods, the EndpointSlice might look like:
When this EndpointSlice is created or updated, something critical happens: every kube-proxy instance on every node in the cluster watches for this change.
Why? Because kube-proxy needs to know where to send traffic. When someone on Node A tries to access webapp-service:80, kube-proxy on Node A must know that it can redirect that traffic to 10.244.1.5:80, 10.244.2.8:80, or 10.244.3.12:80.
This means that every time an EndpointSlice changes, kube-proxy on every node reacts.
At normal scale (example: 10 nodes, 50 Services, 200 Pods) this is trivial. EndpointSlice updates are infrequent. The network rules kube-proxy needs to maintain are manageable. A single node might have a few hundred rules total. The impact of an EndpointSlice update is measured in milliseconds and a few megabytes of memory. It's invisible. It just works.
This is the steady state that most engineers experience. Kubernetes feels elegant and seamless. Nodes join, Pods are scheduled, Services route traffic. The complexity is beautifully abstracted away.
But what happens when you have 100,000 nodes? What happens when a single Service suddenly has endpoints on every single node in the cluster?
Let's return to our EndpointSlice example. You've created a Service with 3 Pod endpoints. The EndpointSlice gets created, kube-proxy instances across your cluster each receive the update, they each update their local network rules, and everything is fine.
The cost of this operation should be a few milliseconds per node and a few kilobytes of memory i.e. unnoticeable.
Now let's scale this up and watch what happens. Consider a simple scenario: a DaemonSet-backed Service that runs on every node in your cluster. Here's how the resource footprint grows:
| 10 nodes | 10 | ~10 KB | ~1 MB |
| 100 nodes | 100 | ~100 KB | ~10 MB |
| 1,000 nodes | 1,000 | ~1 MB | ~100 MB |
| 10,000 nodes | 10,000 | ~10 MB | ~300-500 MB |
| 100,000 nodes | 100,000 | ~16-25 MB | 700+ MB |
Everything works perfectly up to several thousand nodes. The cluster is healthy. Services work. Traffic flows. The assumptions built into Kubernetes hold.
But at 100,000 nodes? Something fundamentally changes.
100,000 Nodes: Where Assumptions Collapse
Let's introduce a single, innocent Service X that runs on every node. Nothing special. The kind of thing you deploy without a second thought.
This Service will have 100,000 endpoints (one for each node).
When this Service's EndpointSlice is created:
- 100,000 kube-proxy instances receive notification of the new EndpointSlice
- Each kube-proxy must process this update and regenerate its network rules
- Each kube-proxy now needs to maintain rules for routing to 100,000 backend endpoints
But how does kube-proxy actually apply these rules?
Enter NFTables
Kubernetes introduced NFTables mode for kube-proxy as an alpha feature in version 1.29, reaching beta in 1.31. While iptables remains the default (even after NFTables reaches GA), NFTables is now available as an opt-in mode that promises significantly better performance at scale.
The traditional iptables mode has severe performance problems:
- iptables: O(n) sequential rule processing—every packet checks every rule until it finds a match
- At 100,000 endpoints, this becomes catastrophically slow
- Packet processing latency increases linearly with the number of endpoints
NFTables was designed to solve exactly this problem:
- nftables: O(1) map lookups using kernel hash maps
- Constant-time performance regardless of the number of endpoints
- At 100,000 endpoints, packet latency remains in the microsecond range
The performance improvement is real and dramatic. From a latency perspective, NFTables absolutely delivers. Packet processing time remains essentially constant whether you have 10 endpoints or 100,000 endpoints.
The Hidden Cost
But there's a cost that nobody anticipated at extreme scale.
When kube-proxy needs to update NFTables rules, it doesn't send incremental changes. It generates a complete ruleset—a file that describes every chain, every rule, every endpoint mapping—and hands it to the nft command:
For our 100,000-endpoint Service, this ruleset file might look like this (simplified):
For a 100,000-node cluster with this DaemonSet Service:
- Ruleset file size: ~16-25 MB
- Number of chains: ~100,005
- Number of rules: ~200,000+
And here's the critical part: every single node in your 100,000-node cluster must load and process this ruleset file.
This isn't a one-time operation either. Every time any endpoint in that Service changes—a Pod restarts, a node becomes unhealthy, a new node joins—kube-proxy on every node regenerates and reloads the entire ruleset.
You deploy your DaemonSet Service. Within minutes, you start seeing nodes transition to NotReady. Not just a few nodes but thousands of them. Specifically, the nodes with the least memory.
The kubelet has stopped renewing its lease. Why? The node ran out of memory completely. The Out-Of-Memory (OOM) killer terminated processes. Eventually, it killed the kubelet itself.
When you examine a node just before it dies, you see something unexpected in the process list:
A process called nft -f - is consuming 689 MB of memory. On a node with only 2 GB of total memory, this single process is taking over a third of available RAM.
And it's spawned by kube-proxy. An abstraction broke.
Nothing in the software changed. The kube-proxy code is identical whether you have 10 nodes or 100,000 nodes. The NFTables tool is the same. The kernel's networking stack is the same.
But at 100,000 nodes, the system exhibits behavior that was never modeled, never tested, never anticipated:
- At 100 nodes: nft consumes ~3 MB, runs for <100ms, invisible
- At 1,000 nodes: nft consumes ~30 MB, runs for ~1 second, barely noticeable
- At 10,000 nodes: nft consumes ~300 MB, runs for ~10 seconds, starts to be noticeable
- At 100,000 nodes: nft consumes 700+ MB, runs for 30+ seconds, kills nodes
The relationship seems to be super-linear. Memory consumption grows faster than the number of endpoints.
At normal scale, EndpointSlice creation is an abstraction that works perfectly. You never think about what kube-proxy is doing. You never monitor nft's memory usage. You never consider that loading a ruleset could be expensive.
At 100,000 nodes, the abstraction breaks. The invisible becomes visible. The negligible becomes catastrophic. And most critically: nothing in the system warned you this would happen. There's no alert at 10,000 nodes saying "WARNING: Approaching memory limits." The system doesn't gradually degrade. It works perfectly until suddenly, catastrophically, it doesn't.
This is what happens at extreme scale. The software didn't break. The logic didn't fail. The algorithms are correct. But the assumption that ruleset files would remain small, that memory consumption would be negligible, that these operations would complete quickly—these assumptions, invisible and unexamined at normal scale, become catastrophic at 100,000 nodes.
NFTables solved the packet processing bottleneck (O(1) lookups) but revealed a new one: memory consumption during rule loading. An optimization designed for scale became the mechanism of failure.
This is the reality of extreme scale: discovering failure modes that no amount of reasoning at smaller scales could have predicted.
In case you you are looking for more reasons to not use nftables [see this].(https://kubernetes.io/blog/2025/02/28/nftables-kube-proxy/#why-not-nftables)
Understanding NFTable Memory Footprint
"Everyone should just read the source code" - Nick Baker
Who is Nick Baker?He's a legendary 10x programmer. Go check out his work: Nick Baker
Now that we understand what breaks at scale, let's understand why it breaks. I read through the source code cuz I had way too much time on my hands. Why does a 16 MB ruleset file consume 700+ MB of memory?
The Code Path
The critical code path lives in the kubernetes-sigs/knftables library that kube-proxy uses:
Every time kube-proxy needs to update networking rules, it:
- Generates a complete ruleset file in memory
- Spawns /usr/sbin/nft --check -f - to validate it
- If validation succeeds, spawns /usr/sbin/nft -f - to apply it
The --check flag is a safety mechanism—you want to validate the ruleset before loading it into the kernel. But at 100,000 nodes, this safety check becomes the failure point.
The NFTables Pipeline
When nft -f ruleset.nft executes, it's not simply reading a file. It's compiling it through several stages:
Stage 1: Lexing and Parsing (AST Generation)
The ruleset is tokenized and parsed into an Abstract Syntax Tree. This tree representation is significantly larger than the input text.
- Input: 16 MB text file
- AST in memory: ~10× file size = ~160 MB
You can observe this yourself by running:
The AST output is roughly 10× the size of the input file.
Stage 2: C Structure Construction
NFTables constructs internal C objects for every table, chain, rule, set, and map. The key data structure is:
Each chain and rule requires memory for:
- The structure itself (~500 bytes)
- String copies of names and expressions
- Linked list overhead
- Expression trees
- Jump targets and references
For our 100,000-endpoint Service:
- ~100,000 chains (one per endpoint)
- ~100,000+ rules (load balancing + DNAT rules)
- Memory per chain+rule: ~2-5 KB
- Total: ~250-300 MB
Stage 3: Netlink Batch Serialization
NFTables needs to send all these rules to the kernel atomically. It serializes every object into a netlink message buffer in TLV (Type-Length-Value) format.
This is another complete copy of the ruleset, now in binary netlink format:
- Netlink buffer: ~250 MB
Total Memory Footprint:
All of this happens in userspace, in the nft process, every time the ruleset is loaded or validated.
Why This Can't Be Easily Fixed
The NFTables tool needs to:
- Parse and validate the entire ruleset for correctness
- Build a complete in-memory representation
- Serialize it for atomic kernel submission
These operations are inherently O(n) in both time and space with respect to the number of rules. You can't validate syntax without parsing. You can't ensure atomicity without buffering the entire transaction.
Could the multiplier be reduced? Perhaps. More efficient data structures, streaming parsers, or compression might help. But the fundamental O(n) memory consumption remains—it's architectural.
The Kubernetes Relationship
Here's how Kubernetes resources map to memory consumption:
| Service | +1-3 chains | ~5-15 KB |
| EndpointSlice | +1 element per (IP, port, protocol) | ~2-5 KB per endpoint |
| DaemonSet (100k nodes) | +100,000 endpoints | ~500-700 MB |
For our 100,000-node cluster with a single DaemonSet Service:
- 1 Service → 3 base chains
- 100,000 endpoints → 100,000 endpoint chains + 100,000 load-balancing rules
- Result: 16 MB ruleset file, 700 MB memory consumption
And remember: this happens on every node in the cluster, every time the EndpointSlice changes.
The Hyperscaler Future
The race to AGI isn't being won by better algorithms (hot-take?). It's being won by whoever can train faster.
Consider the training timeline of frontier models:
| GPT-3 (2020) | 175B | ~2,000 A100s | ~30 days | 1 model/month |
| GPT-3 on 100k nodes | 175B | 100,000 A100s | ~15 hours | ~60 models/month |
| GPT-4 (2023) | 1.8T | ~25,000 A100s | 90-100 days | 1 model/3 months |
| GPT-4 on 100k nodes | 1.8T | 100,000 A100s | ~22 days | ~1.3 models/month |
| LLaMA 3.1 (2024) | 405B | ~16,000 H100s | ~60 days | 1 model/2 months |
| Gemini Ultra (2024) | Unknown | Large TPU fleet (multi-DC) | Weeks-months | Faster |
| Gemini 2.0 (2024) | Unknown | 100,000+ TPU v6 | Unknown | Much faster |
| Future (2026)** | 10T+ | 200,000+ GPUs/TPUs | ~10 days | ~3 models/month |
*Hypothetical: What if these models were trained on 100k node hyperscale clusters?
**Projected based on current scaling trends
GPT-3 took 30 days on 2,000 GPUs. On 100,000 nodes? 15 hours. GPT-4 took 100 days on 25,000 GPUs. On 100,000 nodes? 22 days.
When you can train GPT-4 in under a month instead of three months, you don't just iterate faster. You fundamentally change how you do research. You can afford to try risky ideas (more experimentation = better results). You can run massive ablation studies. You can discover what works through empirical observation rather than theoretical guesswork.
This is why hyperscale matters. The company that can iterate fastest wins. Not because bigger models are automatically better, but because faster iteration means you find better models sooner. You discover what works. You ship while competitors are still training.
And this creates a new class of engineering challenges. At 100,000 nodes:
- Your control plane becomes a distributed system problem
- Your observability stack drowns in metrics
- Your network topology hits physics limits
- Your memory assumptions—like we saw with NFTables—collapse catastrophically
My guess would be that Kubernetes is where this battle will be fought. Not because it's perfect, but because it's the only orchestration platform that's even attempting to operate at this scale. The infrastructure that wins the AGI race will be Kubernetes infrastructure.
Which means the engineering challenges we discussed—NFTables memory exhaustion, endpoint scaling, service mesh overhead—aren't edge cases. They're the new normal. And there will be more. In fact, AWS recently annouced support for 100k node EKS clusters to power AI/ML workloads with a blog talking more about interesting scaling challenges they solved [4].
Working on hyperscale infrastructure means operating in a regime where your intuitions fail daily. Where every "obvious" optimization reveals a new bottleneck. Where you get to rewrite the assumptions that everyone else takes for granted.
Just as particle physicists discovered quarks by smashing atoms together at energies no one had reached before, hyperscale engineers will discover fundamental truths about distributed systems by pushing clusters to scales no one has attempted before.
Critique
The blog is open for critique. Drop me an email at [email protected] with subject: "blog critique" and I will add your points in this section. Also please point out any inaccuracies / mistakes in the essay/post.
Footnotes:
- Hottest Day recorded
- Planck Epoch
- Standard Model of Physics
- Amazon annouces support for ultra scale AI-ML workloads
Whats next?
The things I'm gonna be writing about in the coming few weeks:
- How does distributed training work at hyperscale?
- Evolution of GPUs over time (differences in NVIDIA, AMD and Inf GPUs)
#ai-infrastructure #container-networking #distributed-systems #endpointslices #hyperscale #infrastructure-engineering #iptables #kube-proxy #kubernetes #kubernetes-at-scale #memory-exhaustion #nftables
.png)


