First Ethernet-Based AI Memory Fabric System to Increase LLM Efficiency – News

3 months ago 2

Enfabrica recently unveiled its Elastic Memory Fabric System (EMFASYS) for optimizing AI memory management. The AI-targeted memory system is a hardware and software solution built around Enfabrica’s proprietary silicon network interconnect technology (SuperNIC).

Enfabrica Elastic Memory Fabric System

Enfabrica Elastic Memory Fabric System (EMFASYS).

The system aims to greatly improve memory efficiencies in AI inference operations with a transparently scalable architecture. EMFASYS integrates remote direct memory access (RDMA) Ethernet networking with parallel compute express link (CXL)-based DDR5 memory channels.

We last wrote about Enfabrica nearly two years ago when it had closed its series B funding round. All About Circuits recently revisited the company and spoke with Enfabrica CEO Rochan Sankar. The new EMFASYS product meets the company’s performance improvement goals and, according to Sankar, is ready for the next wave of accelerators that will depend upon high parallel, compute, thread-to-thread, and thread-to-memory communication over a switch.

The Problem: AI-Inference Bottlenecks

AI large language model (LLM) inference processing requires the movement of extraordinary amounts of data to the processing units and back. The latest workloads may require 10 to 100 times more compute time per query than early LLM activity. Without new ways to access memory, CPUs, GPUs, and TPUs can be left idling while data is being fetched.

The rack-mount EMFASYS solution mitigates this issue by creating an ultra-high-performance virtual memory system that is scalable, fast, and transparent to the rest of the system. It enables shared memory access by multiple processors across multiple racks.

Enfabrica's Solution Mirrors a Hub-and-Spoke Model

EMFASYS is powered by Enfabrica’s 3.2 Terabits/second (Tbps) Accelerated Compute Fabric SuperNIC (ACF-S). It elastically connects up to 18 channels/144 CXL memory lanes to 800/400 Gigabit Ethernet (GbE) ports. As shown below, GPU rack units are connected to Enfabrica units with PCIe cabling.

EMFASYS operational diagram

EMFASYS operational diagram.

The Enfabrica units are all interconnected using 800/400 GbE RDMA Ethernet cabling. The result is faster access for complex multi-GPU AI-inference processing tasks. This comes with a cost savings of up to 50% per token per user, more efficient infrastructure scaling of batched LLM inference, and "brownfield" deployability (deploying new hardware into existing systems).

EMFASYS enables shared memory targets of up to 18 Terabytes (TB) CXL DDR5 DRAM per node to offload GPU and HBM consumption. This performance increase comes with a power consumption of about one kilowatt, which is added to a rack system that is already drawing tens of kilowatts. It also aggregates CXL memory bandwidth, enabling transaction striping across multiple memory channels and Ethernet ports. The cost of fabric-attached memory is less than $20 per gigabyte.

Sankar presented an airline hub-and-spoke system as a metaphor for EMFASYS in action. Different-sized data payloads can come and go at the same time without risking congestion. Large payloads can be offloaded and distributed to different processors, just as jumbo jet passengers are offloaded to regional jets going to small cities. The end result is highly efficient routing of complex data streams without congestion or collisions.

Architecture-Agnostic Memory Interaction

According to Sankar, one of the industry's challenges is that there is very little differentiation between the few LLM architecture options. GPUs and TPUs all have integrated high-bandwidth memory (HBM) with near-identical operation and similar memory limitations. Enfabrica aims to break those limitations by allowing high-performance memory interaction regardless of the underlying architecture.

Sankar emphasized the system's flexibility and performance.

“With this setup, we can actually deliver a peak of 3.2 Terabits/second (Tbps) to each accelerator, absorb congestion, and create a highly composable system side,” Sankar said. “This means you can put whatever combination of devices you want on this side. It could be Nvidia GPUs, AMD GPUs, memory, or storage. It's the same data flow motion.”

EMFASYS Sampling Now

Enfabrica is already sampling EMFASYS to a number of customers. It has delivered both complete systems and evaluation platforms. In addition, the company hosts several proof-of-concept systems at the Enfabrica data center. The units are modular, scalable, and customer-expandable.

All images used courtesy of Enfabrica.

Read Entire Article