A bare-metal-first architecture to address the GPU virtualization tax

2 hours ago 1

Modern AI workloads are fundamentally different from traditional enterprise applications. They are bound by GPU performance, sensitive to I/O latency and often require direct, unmediated access to hardware for distributed training. For infrastructure designed to serve these workloads, the architectural choice of the primitive upon which all services are built—is the foundational architectural choice of the platform.The right answer is to start with bare metal. This post explains why. For AI practitioners, this isn’t just a theoretical design choice. The choice impacts training time of a large language model (days), reducing inference jitter for real-time applications keeping recommendation engines responsive at scale or compressing the cycle time for model retraining, the foundation directly determines what customers can deliver. The foundation directly determines the outcomes customers see.

Most cloud platforms are built on a primitive of virtualization which means places a hypervisor between an application and the physical hardware. This creates a performance tax in exchange for management flexibility.

The Ori platform, on the other hand, is architected on a more fundamental and powerful primitive: bare metal. By treating raw, physical hardware as the base unit of the system, Ori provides a foundation that is architecturally superior for the unique demands of AI at scale.

Deconstructing the Hypervisor Tax

The conventional cloud model begins with a hypervisor (like KVM or ESXi) layered directly on top of the hardware. This approach, while proven for general-purpose computing, introduces a persistent and multi-faceted "tax" on performance, especially for AI. In very well tuned systems you are looking at a 5-10% overhead tax, but often it runs higher into (15-30%+) depending on workload and tuning.

CPU Overhead: The hypervisor itself consumes CPU cycles to manage memory mapping, schedule vCPUs, and handle device emulation.
I/O Path Virtualization: The most significant impact is on the I/O path. Network and storage traffic must traverse the hypervisor's virtual switch and storage controllers, adding latency and limiting throughput. This is particularly detrimental for technologies like RDMA (Remote Direct Memory Access), which are essential for ultra-low-latency InfiniBand and RoCE fabrics used in distributed training.
Operational Complexity: This architecture creates a hard boundary between the physical and virtual worlds. Managing the underlying bare-metal servers often requires a separate infrastructure management stack (like OpenStack, NVIDIA BCM, or MaaS), which must then be integrated with a virtualization management layer - and then a container orchestration layer.

AI workloads demand the opposite: a direct, simple, and high-performance path to silicon - without the overhead of virtualization’s tax and complexity.

The Ori Architecture: A Bare-Metal-First Approach

The Ori platform reverses this model. We built a distributed “cloud OS” designed to provision, manage and monitor bare-metal hardware as a cohesive, global resource pool.

Automating the Physical Layer

Ori works at the layer that interfaces directly with the hardware using protocols like IPMI and PXE. This automates the entire node lifecycle without manual intervention.

Dynamic Node Roles

Ori can flexibly assign roles to nodes within a cluster based on workload requirements.

GPU nodes can operate as bare metal for high-performance supercomputing, or as bare-metal Kubernetes worker nodes for containerized or VM workloads.
CPU nodes can serve as bare-metal resources for CPU-only pods within Kubernetes, or host virtual machines for control plane and system services.

The key capability is that node roles are not fixed: nodes can be rebuilt and reassigned as needed, allowing the cluster to adapt dynamically to changing demands.

Automatic Node Management

The system continuously monitors the health of cluster hardware components. If a fault is detected on a node - whether it’s a GPU, HCA, transceiver, etc - the affected node is automatically cordoned off, removed from the cluster and where applicable its workloads are rescheduled by the control plane. A replacement node is then provisioned and reintegrated into the cluster within minutes, maintaining redundancy and ensuring high availability.

Layering Services Without Compromise

With a fully automated bare-metal layer in place, our higher-level services can be provisioned with maximum performance and flexibility.

Virtual Machines: When a VM is requested, the control plane provisions it directly on a bare-metal node using KVM. This avoids the complexity and overhead of a legacy monolithic hypervisor management stack, while still providing secure, isolated workloads.
Kubernetes & Virtual Supercomputers: Kubernetes nodes and multi-node clusters are provisioned as bare-metal instances. This gives containers and HPC jobs direct access to GPUs, high-speed interconnects, and local NVMe storage. This eliminates most I/O and network virtualization performance bottlenecks.

This entire stack, from the silicon to the user-facing API, is managed by a single, distributed control plane.This unifies provisioning, scheduling, and lifecycle management across VMs, containers, and HPC jobs, while maintaining elasticity and high availability.

The Technical Advantages of a Bare-Metal Foundation

For organizations running at AI scale, these architectural choices translate into real gains: faster training cycles, lower cost per experiment, and the ability to serve models reliably under heavy load. In practice, that means a research team can iterate more quickly, or an enterprise can deploy AI services with confidence that performance won’t collapse at peak demand.Further - it doesn’t matter if you are doing this in Ori’s cloud infrastructure (powered by our platform) or in your own infrastructure (running our licensed platform software).

1. Hardware-Enforced Security and Multi-Tenancy

A bare-metal foundation allows for the strongest form of workload isolation. We build upon this foundation by combining multiple hardware-level technologies to create secure segments across the entire stack.

Compute: NVIDIA MIG (Multi-Instance GPU) partitions GPUs into hardware-isolated instances for fractional sharing.
Network & I/O: SR-IOV (Single Root I/O Virtualization) allows a single physical NIC or NVMe drive to appear as multiple separate physical devices, giving VMs and containers direct, protected hardware access.
Networking: We create high-performance virtual network fabrics using EVPN/VXLAN overlays and InfiniBand partitioning, all enforced at the infrastructure level.

This architecture lets us deliver flexible tenancy on a single cluster:

Soft – shared GPU nodes for maximum efficiency.
Strict – dedicated GPU nodes for guaranteed performance.
Private – dedicated GPU nodes and a fully isolated control plane for the highest level of security.

With these options, customers can choose the right balance of cost, performance, and compliance — including the physical-level isolation required in regulated industries.Beyond isolation, many customers require auditability and compliance. A bare-metal architecture makes it easier to align with standards such as ISO 27001, SOC 2, and NIST 800-53, and to enforce data residency mandates under GDPR or country-specific sovereignty rules. By applying security policies directly at the hardware and network layers, Ori reduces reliance on opaque virtualization stacks and can generate transparent, verifiable logs for regulators and auditors. This lowers compliance overhead while giving customers in finance, healthcare, and government stronger assurance that their AI infrastructure meets strict regulatory requirements.

2. Unlocking True Hardware Performance

By removing traditional virtualization layers from the data path, we enable workloads to directly leverage advanced hardware features. Distributed training frameworks like PyTorch FSDP or JAX can use RDMA over our InfiniBand and RoCE fabrics to communicate with near-zero latency, dramatically accelerating training times for large models. This is simply not achievable with the same efficiency through a traditional virtualized network stack.

3. A Unified Control Plane from Silicon to Service

Perhaps the most significant advantage is operational simplicity. There is no need to stitch together multiple disparate systems. The Ori Global Control Plane provides a single API and a single pane of glass to manage a geographically distributed, multi-vendor fleet of GPUs, storage, and networking hardware.

4. The Economics of Efficiency

A bare-metal foundation doesn’t just deliver performance, it also drives better economics. Eliminating the hypervisor layer reduces idle GPU time and lifts overall utilization. In our internal benchmarks we consistently see 10–15% higher effective throughput for distributed training jobs compared to virtualized environments, where hypervisor scheduling and virtual I/O paths introduce waste.

That delta compounds quickly at scale: on a 1,000-GPU cluster it can mean the equivalent of adding more than 100 GPUs without spending another dollar on hardware.

Fewer management layers also reduce operational overhead. In practice customers can consolidate toolchains and manage global fleets with leaner engineering teams, avoiding the cost of maintaining multiple stacks for bare metal, virtualization and container orchestration. This combination of higher utilization and reduced overhead directly lowers the total cost of ownership.

5. The Foundation for Change

AI hardware is evolving at a breathtaking speed — from NVIDIA’s Blackwell platform and its 5th-generation NVLink to upcoming CXL-enabled memory architectures and new classes of accelerators.

A bare-metal foundation ensures the platform can immediately expose these advances to workloads without waiting for hypervisor or virtualization layers to catch up. The same holds for alternative processors like AMD’s Instinct MI300 / MI300X / MI355X family, ARM-based GPUs or future domain-specific accelerators. By working directly at the hardware layer, Ori can integrate emerging technologies as soon as they ship, protecting customers from lock-in to a single vendor roadmap and ensuring long-term adaptability.

Summary

For organizations building private AI clouds or sovereign capabilities, this architectural choice is critical. A bare-metal-first approach is not just about raw performance; it's about providing a more resilient, secure, and operationally simple foundation for the future of AI.

Read Entire Article