GPU accelerators used in AI processing are costly items, so making sure you get the best usage out of them ought to be a priority, yet the industry lacks an effective way of measuring this, says the Uptime Institute.
According to some sources, the cost of an Nvidia H100 can be anywhere from $27,000 to $40,000, while renting GPUs via a cloud provider instead is, for example, priced at $6.98 per hour for a H100 instance on Microsoft's Azure platform. That's just for a single GPU, and naturally AI training will often require more.
Many AI development teams are also unaware of their actual GPU utilization, often assuming higher levels than those achieved in practice
Users want to keep those units working as efficiently as possible however research literature, disclosures by AI cluster operators, and model benchmarks all suggest that GPU resources are often wasted, Uptime says in a new report, "GPU utilization is a confusing metric."
Many AI development teams are also unaware of their actual GPU utilization, often assuming higher levels than those achieved in practice.
Uptime, which created the Tier classification levels for datacenters, says GPU servers engaged in training are only operational about 80 percent of the time, and while running, even well-optimized models are only likely to use 35 to 45 percent of compute performance that the silicon can deliver.
Having a simple usage metric for GPUs would be a boon for the industry, writes the report author, research analyst (and former Reg staffer) Max Smolaks. But, he says, GPUs are not comparable with other server components and require fresh ways of accounting for performance.
Current ways of tracking accelerator utilization include monitoring the average operational time for the entire server node, or tracking individual GPU load via tools supplied by the hardware provider itself, typically Nvidia or AMD.
The first method is of limited use to datacenter operators, although it may give an overall power consumption of a cluster over time. The second is the most commonly used, the report says, but not always the best metric for understanding GPU efficiency as the tools typically measure what proportion of processing elements on the chip are executing at a given time and do not take account of the actual work being done.
A better method, according to Uptime, is model FLOPS (floating point operations per second) utilization, or MFU. This tracks the ratio of the observed performance of the model (measured in tokens per second) to the theoretical maximum performance of the underlying hardware, with a higher MFU equating to higher efficiency, which means shorter (and therefore less costly) training runs.
The downside is that this metric, introduced by Google Research, is difficult to calculate and the resultant figures may appear puzzlingly low, with even well-optimized models only delivering between 35 and 45 percent MFU.
- Qualcomm confirms it's getting into the datacenter market, probably for AI
- Nvidia opens up speedy NVLink interconnect to custom CPUs, ASICs
- China launches an AI cloud into orbit -12 sats for now, 2,800 in coming years
- CoreWeave may have built a house of (graphics) cards
This is because performance is impacted by factors such as the network latency and storage throughput, which mean a 100 percent score is unachievable in practice; results above 50 percent represent the current pinnacle.
Uptime concludes there is currently no entirely satisfactory metric to gauge whether GPU resources are being used effectively, but that MFU shows promise, particularly as it has a more-or-less direct relationship with power consumption.
More data gathered from real-world deployments is needed to establish what "good" looks like for an efficient AI cluster, the report states, but many organizations treat this information as proprietary and therefore keep it to themselves. ®