Scaling Inference to Billions of Users and AI Agents

3 months ago 2

Not too long ago Jensen Huang mentioned how “AI is having its iPhone moment.” That was huge, we can all agree that the iPhone was a true revolution and leap forward for mankind.

In my humble opinion, while Jensen’s claim was visionary and built on a solid foundation — CUDA and the NVIDIA ecosystem — it didn’t address how to scale inference usage to billions of people. This challenge has only intensified with the rise of Agentic AI. A few weeks ago, on his YouTube channel TechTechPotato, made a very strong point that today’s systems won’t scale.

Generative AI is getting more powerful by the week, but almost no one’s asking the real question: How do we actually afford this, and not just in dollars but in infrastructure, energy, talent and time? Because the deeper this technology goes into education, healthcare, and research, the more demanding it will become. Simply put, today’s systems won’t scale in both requirements and cost. And this isn’t some abstract future; the early signs are already here.

Over the course of ten years, Google Cloud has developed an incredibly strong recipe to enable access to AI services for everyone on the planet. An effort that to some extent is also being observed by OpenAI with their latest announcement of Google Cloud GPU resources. And, oh boy, this was a massive undertaking involving thousands of engineers, hundreds of projects, and about a decade in the making.

In this paper, I want to discuss how Google Cloud is democratizing access to AI services for billions of agents — and, ultimately, users.

Zoom image will be displayed

The ‘Cathedral of Compute.’

You don’t build a cathedral with a single stone. Scaling AI to a global audience is the same — it’s not about one magic bullet, but about having all the right materials and architectural plans. It’s about meticulously assembling a set of technologies where each piece solves a critical part of the puzzle. At Google Cloud, this has been a decade-long construction project, laying a foundation of global networking and compute, and now, adding the specialized tools designed for the unique demands of AI. Let’s look at the core building blocks of this modern ‘Cathedral of Compute’:

GKE Inference Gateway: The entry point for GenAI workloads, providing smart routing, security, and load balancing tailored for LLMs and Agentic AI.
Custom Metrics for Application Load Balancers: Move the backend service selection beyond CPU and memory — route traffic based on what actually matters for inference, like queue depth, latency, and model-specific metrics.
Hyperscale Networking: A global backbone that ensures requests are always routed to the best available resources, no matter where users are.
GKE Custom Compute Classes: Fine-grained control over accelerator selection and cost, so you gain flexibility without breaking the bank.
World-Class Observability: Out-of-the-box dashboards and metrics for both GPUs and TPUs, so you can spot issues and optimize performance in real time.
Cloud TPUs: Google’s custom silicon, designed from the ground up for large-scale ML workloads, with unique interconnects for massive parallelism.
vLLM and llm-d: Open source inference engines that tie everything together, supporting both GPU and TPU, and enabling new architectures for distributed, high-throughput serving.

Each of these building blocks is impressive on its own, but the real breakthrough comes when you put them together. That’s when you get an inference platform that’s not just fast, but also reliable, affordable, and ready for whatever the next wave of AI brings.

It all starts here, GKE Inference Gateway is an extension to the GKE Gateway that provides optimized routing and load balancing for serving GenAI workloads. It is based on the open-source project Gateway API Inference Extension.

Traditional load balancing does not optimize serving performance and GPU/TPU utilization.

Zoom image will be displayed

GKE Inference Gateway routes client requests from the initial request to a model instance using the following inference extensions:

Zoom image will be displayed

Body-based routing extension: Extracts the model identifier from the client request body and sends it to GKE Inference Gateway. GKE Inference Gateway then uses this identifier to route the request based on rules defined in the Gateway API HTTPRoute object.
Security extension: Uses Model Armor (or supported 3rd-party) to enforce model-specific security policies that include content filtering, threat detection, sanitization, and logging.
Endpoint picker extension: Monitors key metrics from model servers within the InferencePool. It tracks the KV cache utilization, queue length of pending requests, and active LoRA adapters on each model server. It then routes the request to the optimal model replica based on these metrics to minimize latency and maximize throughput for AI inference.

In practice, GKE Inference Gateway reduces tail latency by 60% and increases throughput by up to 40%.

The benchmark below highlights how the Inference Gateway achieves superior performance consistency. The chart on the left shows that by ensuring uniform KV cache utilization, the Gateway effectively prevents requests from queuing up.

The powerful result of this is shown on the right: the TTFT (Time-to-First-Token) becomes exceptionally consistent, eliminating the latency spikes common with traditional load balancing. This creates a level of predictability analogous to a Real-Time Operating System. In an RTOS, while absolute speed is a factor, the paramount goal is ensuring that latency, even if not zero, remains consistent and deterministic throughout execution. This is precisely the stable, high-quality user experience the Inference Gateway delivers.

Fun fact: GKE just celebrated its 10th anniversary 🎉 🍾.

Custom metrics for Application Load Balancers

Custom metrics for Application Load Balancers is the underlying piece of technology enabling GKE Inference Gateway. Custom metrics allow users to define their own application-specific criteria for traffic distribution, instead of using standard metrics like CPU usage or request rate. Traffic routing happens on relevant metrics — like queue depth or processing latency — to maximize compute capacity and enable smarter, application-aware autoscaling.

Low-Rank Adaptation and Key-Value Cache utilization

The GKE Inference Gateway endpoint picker extension is specifically designed to perform LoRA and KV cache utilization routing. But why does it matter?

Low-Rank Adaptation

Let’s start with LoRA using an analogy: your main warehouse is staffed by a world-class, general-purpose expert (the Base Model). This expert can package anything: books, electronics, clothing, etc. They are incredibly knowledgeable but very “general.” What happens when a customer has a highly specific, custom request? For example Gift-wrap this book in blue paper with a handwritten note in Spanish.

The Old Way (Full Fine-Tuning): You would have to hire and train a brand new, full-time expert who only does Spanish gift-wrapping. This is expensive, slow, and you need a separate expert for every single custom task (one for Italian, one for poetry, etc.)
The LoRA Way: You keep your single, world-class expert. But next to their main workstation, you place a small “Finishing Touches” Kiosk (the LoRA Adapter). This kiosk contains a small set of instructions and a few special tools — a roll of blue paper, a specific pen, and a card with a few key Spanish phrases.

Zoom image will be displayed

Full-model Fine-tuning vs. LoRA vs. RAG by

When a custom order comes in, the main expert does 99% of the work, then, they simply take the package to the LoRA kiosk, apply the small, specific adjustments, and send it out. This is LoRA: lightweight, fast to “train”, and swappable.

Key-Value Cache

The KV cache in LLMs is a brilliant optimization technique and I would argue a key pillar in making the transformer architecture viable. It speeds up text generation by storing and reusing previously computed key and value tensors during the decoding process. Instead of recalculating these tensors for each new token, the model retrieves them from the cache, significantly reducing computation time.

Zoom image will be displayed

Attention with and without KV caching by

We therefore have a uniform KV cache utilization across the inference engines ensuring the model servers do not get saturated, The queue of incoming requests is minimized, directly leading to lower TTFT (Time-to-First-Token) latency.

Serving AI to billions requires a global nervous system. Google Cloud provides this with two foundational pillars. First, its anycast network offers a single, global IP address. This means a user’s request from Tokyo and another from Berlin hit the same IP but are instantly routed to the closest network edge, slashing latency.

Zoom image will be displayed

Google Cloud Load Balancing

Second, this traffic is then intelligently directed across a vast footprint of 42 cloud regions. With GPU and TPU resources distributed globally, GKE Inference Gateway can route inference requests not just to the nearest region, but to the one with the most optimal capacity. This combination of a universal front door (anycast) and a distributed backend (regions) is the core architecture that enables massive, low-latency AI inference at a planetary scale.

To run inference at scale, you must master the art of acquiring capacity. Google’s answer is GKE Custom Compute Classes — CCC for friends. Think of it as a smart, prioritized shopping list for your AI infrastructure. As a Kubernetes custom resource, a CCC lets you define a fallback hierarchy of different accelerators (GPUs, TPUs) and pricing models (Reservations, DWS Flex, On-demand, and even Spot).

Zoom image will be displayed

CCC in a nutshell

In the example above, we can see a CCC targeting A3 machine families (A3 features NVIDIA H100 GPUs). It starts by consuming A3 through a reservation, then moves to DWS (Dynamic Workload Scheduler) Flex, and ultimately, if nothing else is available, it will spin up some A3 in Spot mode.

This gives you nuanced control over your risk profile, prioritizing acquired reservations for the business-as-usual workloads and opportunistic Spot VMs for dire times. The GKE autoscaler then acts as your expert capacity hunter, following these declarative rules to find the best available hardware. It’s a resilient, cost-aware foundation for ensuring your inference workloads always have a place to run, optimally and efficiently.

From my first day at Google Cloud, I fell in love with our hands-off approach for observability. You get pretty much everything out-of-the-box, it works like a charm, and it’s mostly free or at a very low cost.

NVIDIA DCGM and Inference Engines metrics are immediately available.

I talked about NVIDIA DCGM and the Inference Engines dashboard already in my previous work “Optimize Gemma 3 Inference: vLLM on GKE 🏎️💨” at the GKE Goodies chapter.

And more is coming, especially after the announcement of the TPU Monitoring Library:

$ tpu-info
TPU Chips
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃ Chip ┃ Type ┃ Devices ┃ PID ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╕━━━━━━━━━╕━━━━━━━━┩
│ /dev/vfio/0 │ TPU v6e chip │ 1 │ 1052 │
│ /dev/vfio/1 │ TPU v6e chip │ 1 │ 1052 │
│ /dev/vfio/2 │ TPU v6e chip │ 1 │ 1052 │
│ /dev/vfio/3 │ TPU v6e chip │ 1 │ 1052 │
└──────────────┴──────────────┴─────────┴────────┘
TPU Runtime Utilization
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Device ┃ HBM usage ┃ Duty cycle ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╕━━━━━━━━━━━━┩
│ 8 │ 18.45 GiB / 31.25 GiB │ 100.00% │
│ 9 │ 10.40 GiB / 31.25 GiB │ 100.00% │
│ 12 │ 10.40 GiB / 31.25 GiB │ 100.00% │
│ 13 │ 10.40 GiB / 31.25 GiB │ 100.00% │
└────────┴──────────────────────────┴────────────┘
TensorCore Utilization
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Chip ID ┃ TensorCore Utilization ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ 13.60%│
│ 1 │ 14.81%│
│ 2 │ 14.36%│
│ 3 │ 13.60%│
└─────────┴────────────────────────┘
TPU Buffer Transfer Latency
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Buffer Size ┃ P50 ┃ P90 ┃ P95 ┃ P999 ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╕━━━━━━━━━━━━━━╕━━━━━━━━━━━━━━╕━━━━━━━━━━━━━━┩
│ 8MB+ │ 108978.82 us │ 164849.81 us │ 177366.42 us │ 212419.07 us │
│ 4MB+ │ 21739.38 us │ 38126.84 us │ 42110.12 us │ 55474.21 us │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘

Google Cloud observability stack is SoTA (State-of-the-Art). Fun fact: in Japanese, the kanji 壮太 have the sound of Sōta, which translates to big and strong.

While this article isn’t just about TPUs, it’s worth noting that TPUs have been Google’s answer to large-scale machine learning long before AI was the coolest kid on the block.

When first approaching the TPU, I couldn’t help but find so many similarities with the traditional “Thinking 10x” Google approach. The TPU project is probably Google’s secret weapon in the AI race — something that has been in development and continuously refined for over 10 years now.

Zoom image will be displayed

Apple is training its foundational models on thousands of TPU v4 and v5p chips. While a single TPU can’t match the performance of even older NVIDIA GPUs —references from TheNextPlatform: Stacking Up Google’s “Ironwood” TPU Pod To Other AI Supercomputers— that comparison misses the point. Google isn’t playing the same single-chip optimization game as its rivals. Instead, its true power — and strategic endgame — lies in the groundbreaking interconnect bandwidth between its chips, powered by a deeply tuned system of Inter-chip Interconnect (ICI) and optical circuit switches (OCS).

Within the Cube (aka TPU Rack), ICI links the TPU chips directly.
Between Cubes, the ICI links are connected through an OCS.

The ICI enables each TPU v7 chip to be interconnected with four links, each running at 1.3Tbps, for a total of 5.4Tbps bidirectional bandwidth. Again, that’s per TPU chip 🤯.

Zoom image will be displayed

OCS of Google Cloud TPU credits CNBC

The role that the interconnection plays was evident also in the DeepSeek V3 Paper, where the DeepSeek lab wrote specific PTX instructions to keep the CUDA cores fed and overcome interconnection inefficiency.

Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

And in the most recent Gemini 2.5 paper, the importance of the interconnection is highlighted again, this time around, with efficiency numbers:

Overall during the run, 93.4% of the time was spent performing TPU computations; the remainder was approximately spent half in elastic reconfigurations, and half in rare tail cases where elasticity failed. Around 4.5% of the computed steps were replays or rollbacks for model debugging interventions.

Ultimately, in the quest to serve billions of users and agents, having the Google Cloud TPU at your disposal is a great advantage.

For an informative — and visually appealing — Cloud TPU overview, head over to the CNBC YouTube channel where Amin Vahdat makes a great point about TPU and the rationale behind their making, now over ten years ago:

A number of leads at the company asked the question, what would happen if Google users wanted to interact with Google via voice for just 30s a day? And how much compute power would we need to support our users? We realized that we could build custom hardware, not general purpose hardware, but custom hardware, Tensor Processing Units, in this case to support that much, much more efficiently. In fact, a factor of 100 more efficiently than it would have been otherwise.

To get started with TPU, check out the official Google Cloud documentation. Trillium (TPU v6e) is available to all users, and there is a great document to experiment with serving Meta’s Llama-3.1–70B on vLLM.

Zoom image will be displayed

vLLM does not need much of an introduction. It’s the open source inference engine that not only enables top performance — check out my Medium work where I generated over 22.000 tokens/s— it’s also growing very fast with clear open roadmaps and, most critically, is not another vendor lock-in product. Actually, it allows you to seamlessly run the same LLM across a wide variety of accelerators: CUDA, ROCm for AMD, and XLA for Google Cloud TPU. At NEXT ‘25, we demoed a GKE Cluster running vLLM where the same model was being served by TPU Trillium v6e and a combination of Spot and On-demand NVIDIA L4 GPUs.

Zoom image will be displayed

For a practical hands-on guide on how to deploy all these pieces together, head over to the official Google Cloud Blog where will show you how it can be done.

In my now over 15 years of Tech experience, there has always been one constant mantra throughout my career: component decoupling is almost always the best path forward. While the Twelve-Factor App methodology is at this point a bit of ancient history, it served as the foundation for all modern applications built on or ported to the cloud. These sets of principles make constant references that decoupling is for the best:

Decoupling the App from its Operating Environment
Factor II: Dependencies — Explicitly declare and isolate dependencies.
Decoupling Code from Configuration
Factor III: Config — Store config in the environment.
Decoupling the App from Backing Services
Factor IV: Backing Services — Treat backing services as attached resources.
Decoupling State from Processes
Factor VI: Processes — Execute the app as one or more stateless processes.
Decoupling Services from Each Other
Factor VII: Port Binding — Export services via port binding.
Decoupling the App from Log Management
Factor XI: Logs — Treat logs as event streams.

Zoom image will be displayed

llm-d takes the decoupling concept and applies it to the Inference Engines, leveraging vLLM’s support for disaggregated serving to run prefill and decode on independent instances:

llm-d builds upon vLLM’s highly efficient inference engine, adding Google’s proven technology and extensive experience in securely and cost-effectively serving AI at billion-user scale. llm-d includes three major innovations:

First, instead of traditional round-robin load balancing, llm-d includes a vLLM-aware inference scheduler, which enables routing requests to instances with prefix-cache hits and low load, achieving latency SLOs with fewer hardware resources.

Second, to serve longer requests with higher throughput and lower latency, llm-d supports disaggregated serving, which handles the prefill and decode stages of LLM inference with independent instances.

Third, llm-d introduces a multi-tier KV cache for intermediate values (prefixes) to improve response time across different storage tiers and reduce storage costs. llm-d works across frameworks (PyTorch today, JAX later this year), and both GPU and TPU accelerators, to provide choice and flexibility.

Yet another groundbreaking Google-Red Hat collab 😭 (remember how Kubernetes changed the tech world forever?). While for small and medium-scale deployments, vLLM is more than capable, llm-d is gearing up to be the reference for billion-user scale.

If you’re planning to run LLMs at scale today, vLLM, the Ray framework, and the Kubernetes LeaderWorkerSet CRD are the right place to start. For those planning for even larger deployments, keep an eye on llm-d as it matures — early experimentation and community involvement can help shape its direction.

The ‘iPhone moment’ for AI has arrived, but a revolutionary device is useless without the global network to support it. The real, unspoken challenge of this new era isn’t just creating smarter models, but delivering them to billions of people without the system collapsing under its own weight. This is an incredible AI infrastructure challenge, not just an algorithm problem.

As we’ve seen, Google’s answer wasn’t born overnight. It’s the result of a decade of work, combining GKE’s orchestration with intelligent networking, custom silicon, and a new generation of open-source tools like vLLM and llm-d. This is what it looks like to treat AI infrastructure not as an afterthought, but as the foundational product. The cathedral is built, the doors are open, and it’s ready for planetary scale.

If you’re building for the future, now is the time to get hands-on. Experiment with GKE Inference Gateway, try out vLLM and llm-d, and see how these building blocks can help you deliver AI to everyone, everywhere. If you need a bit of inspiration, check out my previous work around optimizing inference on Gemma 3 and GKE Autopilot.

I highly recommend checking out the Google Cloud Next ’25 GKE Gen AI Inference session:

Also, check out Connect, secure, and simplify with new Cross-Cloud Network innovations where we dive deeper on the network requirements for the AI era:

Read Entire Article