Red Hat's AI Platform Now Has an AI Inference Server

2 days ago 1

BOSTON — So you want to run a generative AI (GenAI) model, Or, make that models. Or, OK, let’s admit it, you want to run multiple models on the platforms you want when you want them. That’s not easy. To address this need, at Red Hat Summit 2025, Red Hat rolled out the Red Hat AI Interference (RHAI) server.

RHAI  is a high-performance, open source platform that works as the execution engine for AI workloads. Like the name suggests, RHAI is all about the inference. This is, where pre-trained models generate predictions or responses based on new data. Inference is AI’s critical execution engine, where pre-trained models translate data into user interactions.

This platform is built on the widely adopted, open source vLLM project. VLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). The difference between vLLM and older inference engines is that the earlier engines are bottlenecked by memory I/O. VLLM divides memory, wherever it may be, into manageable chunks and only accesses what’s needed when necessary. If that sounds a lot like how computers handle virtual memory and paging, you’re right, it does, and it works just as well for LLMs as it does for your PCs.

Neural Magic Technology

To vLLM, Red Hat added technologies from its Neural Magic acquisition. Neural Magic brings software and algorithms that accelerate GenAI inference workloads to the table. The result is an AI inference platform that’s fast enough and cost-efficient enough for you to deploy scalable AI inference engines across any cloud.

RHAI’s key features include:

  • Support for any GenAI Model: The server is model-agnostic, supporting leading open source and third-party validated models such as Llama, Gemma, DeepSeek, Mistral, and Phi, among others.
  • Hardware and Cloud Flexibility: Users can run AI inference on any AI accelerator (GPUs, CPUs, specialized chips) and in any environment — on-premises, public cloud, or hybrid cloud — including seamless integration with Red Hat OpenShift AI and Red Hat Enterprise Linux AI (RHEL AI).
  • Performance and Efficiency: Leveraging vLLM’s high-throughput inference engine, the server supports features like large input contexts, multi-GPU acceleration, and continuous batching, delivering, Red Hat claims, two to four times more token production with optimized models.
  • Model Compression and Optimization: Built-in tools reduce the size of foundational and fine-tuned models, minimizing compute requirements while maintaining or even improving accuracy.
  • Enterprise-Grade Support: Red Hat provides hardened, supported distributions and third-party support, enabling deployment even on non-Red Hat Linux and Kubernetes platforms.

Red Hat‘s AI Inference Server is available as a standalone containerized solution or as an integrated component of Red Hat OpenShift AI. This is what empowers you to use RHAI to deploy and scale pretty much anywhere. As Brian Stevens, Red Hat’s AI CTO and former Neural Magic CEO, explained in his keynote,  you can deploy it “anywhere on anything.” Or, more specifically, on Red Hat OpenShift or any third-party Linux or Kubernetes environment.” I don’t know about you, but I like that flexibility.

From a business perspective, Joe Fernandes, Red Hat’s VP and general manager of the AI Business Unit, said, “Inference is where the real promise of GenAI is delivered, where user interactions are met with fast, accurate responses delivered by a given model, but it must be delivered in an effective and cost-efficient way. RHAI Server is intended to meet the demand for high-performing, responsive inference at scale while keeping resource demands low, providing a common inference layer that supports any model, running on any accelerator in any environment.”

Red Hat has big ambitions for RHAI. Red Hat is aiming to do for AI what it did for Linux — make it accessible, reliable, and ubiquitous across enterprise environments.

Distributed GenAI Inference at Scale

Of course, for that to happen, you need a solid open source foundation. For that, Red Hat, in partnership with CoreWeave, Google Cloud, IBM Research NVIDIA and numerous other companies and groups has launched llm-d. Llm-d is an open source project that marries Kubernetes, vLLM-based distributed inference, and intelligent AI-aware network routing to create robust, large language model (LLM) inference clouds.

Besides Kubernetes and vLLM, llm-d also incorporates:

  • Prefill and Decode Disaggregation to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.
  • KV (key-value) Cache Offloading, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.
  • AI-Aware Network Routing for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.
  • High-performance communication APIs for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).

Put it all together, Stevens explained, and “The launch of the llm-d community … marks a pivotal moment in addressing the need for scalable GenAI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential.”

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more.

Group Created with Sketch.

Read Entire Article