When you’re training large language models or running distributed ML workloads, network performance can make or break your job. A poorly configured network can turn your multi-thousand-dollar GPU cluster into an expensive space heater. But configuring high-performance networking across different cloud providers and managed infrastructure platforms? That’s where things get really messy.

At SkyPilot, we’ve spent countless hours wrestling with the networking peculiarities of cloud providers offering VMs (Virtual Machines - individual server instances you can SSH into) and managed Kubernetes services (GKE, EKS, Nebius Managed Kubernetes, …). Each has its own way of doing things, its own gotchas, and its own performance characteristics. Whether you’re provisioning VMs or using managed Kubernetes clusters, the networking complexity is overwhelming.
Here’s what we learned along the way, and why you should care.
The Problem: Every Platform Does Networking Differently
Picture this: You have a distributed training job that works beautifully on cloud VMs with InfiniBand. You decide to move it to a managed Kubernetes service for better orchestration. Sounds simple. The performance shouldn’t change that much, right?
Wrong.
Each platform has its own high-performance networking story:
Cloud VMs:
- GCP: GPUDirect-TCPX for A3 High instances, GPUDirect-TCPXO for A3 Mega instances (Nvidia H100s), GPUDirect-RDMA for A4/A3 Ultra instances (Nvidia H200/B200s)
- Nebius: InfiniBand with MLX5 adapters and UCX optimizations
Managed Kubernetes Services:
- GKE: All of the above GCP networking, plus Kubernetes complexity, pod networking layers, and GPU device plugin configurations
- Nebius Managed Kubernetes: InfiniBand setup plus Kubernetes networking and container orchestration overhead
The Manual Setup Nightmare
Before we built our network tier abstraction, setting up a cluster manually was extremely tedious and error-prone. You need to look at the cloud provider’s guide for your specific instance type (say, A4 Ultra instances for B200), determine which machine type is best suited for your workload, then navigate through a complex setup process.
This setup process includes ensuring the NIC is properly connected to the GPU, installing GPU drivers, and configuring dozens of environment variables. You have to manually run each command to get the network set up correctly, and if you miss a command or make a typo, you’ll need to debug or restart the entire setup process—which can take hours or days. (Remember, you’re doing all this debugging on expensive high-end GPU instances. For example, with H200x8 clusters across 2 nodes costing around ~$100 per hour, just 3 days of debugging (8 hours per day) would cost you ~$2,400 in compute charges alone).
After all that technical setup, you still need to run NCCL tests to verify the network is performing as expected and ensure that when running any workload, the environment is set correctly. Then you have to repeat this entire process for every other instance type or cloud provider you want to use.
For managed Kubernetes deployments, you add yet another layer of complexity with pod networking, GPU device plugins, and container orchestration overhead.
Here’s what the manual configuration actually looks like across different platforms:
Click to see the full TCPX setup (25+ commands and environment variables)For GCP H100 A3 High instances (GPUDirect-TCPX):
For GCP H100 A3 Mega instances (GPUDirect-TCPXO):
Similar complexity but with somewhat different setup procedures - different environment variables, and different network interface configurations. The setup involves another dozen+ environment variables specific to TCPXO.
For GCP B200/H200 A4/A3 Ultra instances (GPUDirect-RDMA):
Another different approach for even higher end instances. Follow the GPUDirect-RDMA guide which requires configuring setting up RDMA network parameters, and managing specific NCCL environment variables for optimal performance.
And this is just for one cloud provider.
Imagine doing this for every cloud provider, every instance type, and keeping it all up to date as providers may change their networking stacks.
But with SkyPilot, all of this complexity is abstracted away with a simple network_tier configuration.
Performance Difference
NCCL Performance Tests
We conducted NCCL performance tests on a GCP cluster with 2x a3-highgpu-8g (2x H100:8) instances to compare GPUDirect-TCPX vs standard networking:

Key insight: Performance benefits scale with message size, achieving up to 3.8x speedup for large messages that are common in distributed ML training.
SGLang Serving Performance
LLM serving performance (DeepSeek-R1-Distill-Llama-8B) on the same cluster configuration:

For LLM serving workloads, proper networking configuration delivers measurable performance improvements - 11.3% higher throughput and 8% lower latency.
How SkyPilot Simplifies This
Instead of manually configuring networking for each cloud provider and infrastructure type, SkyPilot abstracts this complexity for both VMs and managed Kubernetes:
Adding network_tier: best is the one liner change that does the setup for you. That’s it!
For more details, check out our Nebius InfiniBand documentation and see a complete example configuration.
Future Work
We’re continuing to improve the network tier system:
- Additional cloud providers: Adding support for other clouds with custom networking stacks (AWS EFA, AWS HyperPod, Azure InfiniBand, Oracle Cloud Infrastructure RoCE, Lambda Labs InfiniBand, CoreWeave, RunPod)
P.S. If You Need High-Performance Distributed Training, Try SkyPilot
We built this network tier system because we got tired of seeing users struggle with cloud networking configuration across both VMs and managed Kubernetes services. With SkyPilot, you can focus on your ML code instead of becoming a cloud configuration expert or Kubernetes networking specialist.
SkyPilot automatically finds the best instances across clouds and infrastructure types, configures high-performance networking (whether it’s VMs or managed Kubernetes), and scales your jobs efficiently. Whether you’re training LLMs on GCP VMs, running distributed inference on GKE, or using Nebius managed Kubernetes for large-scale data processing, we handle the infrastructure complexity so you can focus on what matters.
Give it a try and let us know how it goes in our Slack community!
.png)
