Broadcom is quietly plotting a takeover of the AI infrastructure market

4 months ago 11

feature GPUs dominate the conversation when it comes to AI infrastructure. But while they're an essential piece of the puzzle, it's the interconnect fabrics that allow us to harness them to train and run multi-trillion-parameter models at scale.

These interconnects span multiple domains, whether it be die-to-die communications on the package itself, between the chips in a system, or the system-to-system networks that allow us to scale to hundreds of thousands of accelerators.

Developing and integrating these interconnects is no small order. It's arguably the reason Nvidia is the powerhouse it is today. However, over the past few years, Broadcom has been quietly developing technologies that span the gamut from scale-out Ethernet fabrics all the way down to the package itself.

And, unlike Nvidia, Broadcom deals in merchant silicon. It'll sell its chips and intellectual property to anyone, and in many cases, you may never know that Broadcom was involved. In fact, it's fairly well established at this point that Google's TPUs made extensive use of Broadcom IP. Apple is also rumored to be developing server chips for AI using Broadcom designs.

For hyperscalers in particular, this model makes a lot of sense, as it means they can focus their efforts on developing differentiated logic rather than reinventing the wheel to figure out how to stitch all of them together.

Rooted in switching

Your first thought of Broadcom may be the massive pricing headache caused by its acquisition of VMware. But if not, you probably associate them with Ethernet switching.

While the sheer number of GPUs being deployed by the likes of Meta, xAI, Oracle, and others may grab headlines, you'd be surprised just how many switches you need to stitch them together. A cluster of 128,000 accelerators might need 5,000 or more switches just for the compute fabric, and yet more may be required for storage, management, or API access.

To address this demand, Broadcom is pushing out some seriously high radix switches, initially with its 51.2Tbps Tomahawk 5 chips in 2022, and more recently, the 102.4Tbps Tomahawk 6 (TH6), which can be had with your choice of 1,024 100Gbps SerDes or 512 200Gbps SerDes.

The more ports you can pack into a switch, the higher the radix, and the fewer of them are needed for a given number of endpoints. By our calculations, connecting the same number of GPUs from our earlier example at 200Gbps would require just 750 TH6 switches.

Tomahawk 6's higher radix means it can support up to 128,000 GPUs with just 750 switches in two tier architecture. - Click to enlarge

Of course, this being Ethernet, customers aren't locked into one vendor. At GTC earlier this year, Nvidia announced a 102.4Tbps Ethernet switch of its own, and we imagine Marvell and Cisco will have equivalent switches before long.

Scale-up Ethernet

Ethernet is most commonly associated with the scale-out fabrics that form the backbone of modern data centers. However, Broadcom is also positioning switches like the Tomahawk 6 as a sort of short-cut to rack-scale architectures.

If you're not familiar, these scale-up fabrics provide high-speed chip-to-chip connectivity to anywhere from eight to 72 GPUs, with designs of as many as 576 expected by 2027. While small meshes up to around eight accelerators can be achieved using simple chip-to-chip meshes, larger configurations like we see in Nvidia's NVL72 or AMD's Helios reference design require switches.

Nvidia already has its NVLink Switches, and while much of the industry has aligned around Ultra Accelerator Link (UALink), an open alternative, the spec is still in its infancy. The first release just hit in April, and dedicated UALink switching hardware has yet to materialize.

Broadcom was an early proponent of the tech, but in the past few months, its name has disappeared from the UALink Consortium website, and it's begun talking up its own scale-up Ethernet (SUE) stack, which is designed to work with existing switches.

Here's a quick breakdown of how Broadcom intends to support rack-scale networks using Ethernet - Click to enlarge

While there are benefits to having a stripped-down built-for-purpose protocol like UALink for these kinds of scale-up networks, Ethernet will not only get the job done, but it has the benefit of being available today.

In fact, Intel is already using Ethernet for both scale-up and scale-out networks on its Gaudi system. AMD, meanwhile, plans to tunnel UALink over Ethernet for its first generation of rack-scale systems starting next year.

Lighting the way to bigger, more efficient networks

Alongside conventional Ethernet switching, Broadcom has been investing in co-packaged optics (CPO), going back to the introduction of Humboldt in 2021.

In a nutshell, CPO takes the lasers, digital signal processors, and retimers normally found in pluggable transceivers and moves them onto the same package as the switch ASIC.

Broadcom's latest generation of CPO switches offer up to 200Gbps per lane of optical connectivity directly to the ASIC — no pluggable optics required. - Click to enlarge

While networking vendors have resisted going down the CPO route for a while, the technology does offer a number of benefits. In particular, fewer pluggables mean substantially lower power consumption.

According to Broadcom, its CPO tech is more than 3.5x more efficient than pluggables.

The chip merchant teased the third generation of its CPO tech back at Computex, and we've since learned it will be paired with its Tomahawk 6 switch ASICs and provide up to 512 200Gbps fiber ports out the front of the switch. By 2028, the networking vendor expects to have CPO capable of 400Gbps lanes.

Broadcom isn't the only one embracing CPO. At GTC this spring, Nvidia showed off photonic versions of its Spectrum Ethernet and Quantum InfiniBand switches.

But while Nvidia is embracing photonics for its scale-out networks, it's sticking with copper for its NVLink scale-up networks for now.

Copper is lower power, but it can only stretch so far. At the speeds modern scale-up interconnects operate, those cables can only reach a few meters at most, and often involve additional retimes, which add latency and power consumption.

But what if you wanted to extend your scale-up network from one rack to several? For that you're going to need optics. For this reason, Broadcom is also looking at ways to strap optics to the accelerators themselves.

To test the viability of optically interconnected, Broadcom copackaged the optics to a test chip designed to emulate a GPU. - Click to enlarge

At Hot Chips last summer, the tech giant demoed a 6.4Tb/s optical Ethernet chiplet, which can be co-packaged alongside a GPU. That works out to 1.6TB/s of bidirectional bandwidth per accelerator.

At the time, Broadcom estimated this level of connectivity could support 512 GPUs, all acting as a single scale-up system with just 64 51.2Tbps switches. With Tomahawk 6, you could either cut that figure in half or add another CPO chiplet to the accelerator and double its bandwidth to 3.2TB/s.

Everything but the logic

While we're on the topic of chiplets, Broadcom's IP stack also extends to chip-to-chip communications and packaging.

As Moore's Law slows to a crawl, there's only so much compute you can pack into a reticle-sized die. This has driven many in the industry toward multi-die architectures. Nvidia's Blackwell accelerators, for example, are really two GPU dies that have been fused together by a high-speed chip-to-chip interconnect.

AMD's MI300-series took this to an even greater extreme, using TSMC's chip-on-wafer-on-substrate (CoWoS) 3D packaging tech to form a silicon sandwich with eight GPU dies stacked on top of four I/O dies.

Multi-die architectures allow you to get away with using smaller dies, which improves yields. The compute and I/O dies can also be fabbed on different process nodes to optimize for cost and efficiency. For example, AMD used TSMC's 5nm process tech for the GPU dies and the fab's older 6nm node for the I/O die.

Designing a chiplet architecture like this is not easy. So, Broadcom has developed what is essentially a blueprint for building multi-die processors with its 3.5D eXtreme Dimension System in Package tech (3.5D XDSiP).

On the left, you see a typical accelerator built using 2.5D packaging, and on the right, Broadcom's XDSiP 3D-packaging tech - Click to enlarge

Broadcom's initial designs look a lot like AMD's MI300X, but the tech is open to anyone to license.

Despite the similarities, Broadcom's approach to interfacing compute dies with the rest of the system logic is a little different. We're told that previous 3.5D packaging technologies, like we see on the MI300X, used face-to-back interfaces, which require more work to route the through silicon vias (TSVs) that shuttle data and power between the two.

By stacking the silicon face-to-face, Broadcom says it achieves higher die-to-die interconnect speeds and shorter signal routing - Click to enlarge

Broadcom's XDSiP designs have been optimized for face-to-face communications using a technique called hybrid copper bonding (HCB). This allows for denser electrical interfaces between the chiplets. We're told this will allow for substantially higher die-to-die interconnect speeds and shorter signal routing.

The first parts based on these designs are expected to enter production in 2026. But because chip designers are not in the habit of disclosing what IP they've built in house and which they've licensed, we may never know which AI chips or systems are using Broadcom's tech. ®

Read Entire Article