Inference Performance for Data Center Deep Learning (Nvidia)

4 weeks ago 1

MLPerf Inference v5.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset

DeepSeek R1	420,659 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	289,712 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
	33,379 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP16 (exact match 81.9132%)	mlperf_deepseek_r1
Llama3.1 405B	16,104 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	14,774 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,660 tokens/sec	8x B200	Dell PowerEdge XE9685L	NVIDIA B200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	553 tokens/sec	8x H200	Nebius H200	NVIDIA H200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	51,737 tokens/sec	4x GB200	4x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
	102,909 tokens/sec	8x B200	ThinkSystem SR680a V3	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
	35,317 tokens/sec	8x H200	Dell PowerEdge XE9680	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
Llama3.1 8B	146,960 tokens/sec	8x B200	ThinkSystem SR780a V3	NVIDIA B200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	CNN Dailymail (v3.0.0, max_seq_len=2048)
	66,037 tokens/sec	8x H200	HPE Cray XD670	NVIDIA H200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	CNN Dailymail (v3.0.0, max_seq_len=2048)
Whisper	22,273 samples/sec	4x GB200	BM.GPU.GB200.4	NVIDIA GB200	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
	45,333 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
	34,451 samples/sec	8x H200	HPE Cray XD670	NVIDIA H200	99% of FP32 and 99.9% of FP32 (WER=2.0671%)	LibriSpeech
Stable Diffusion XL	33 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
	19 samples/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
RGAT	651,230 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP32 (72.86%)	IGBH
RetinaNet	14,997 samples/sec	8x H200	HPE Cray XD670	NVIDIA H200	99% of FP32 (0.3755 mAP)	OpenImages (800x800)
DLRMv2	647,861 samples/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	99% of FP32 and 99.9% of FP32 (AUC=80.31%)	Synthetic Multihot Criteo Dataset

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms) Dataset

DeepSeek R1	209,328 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	167,578 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
	18,592 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200	99% of FP16 (exact match 81.9132%)	TTFT/TPOT: 2000 ms/80 ms	mlperf_deepseek_r1
Llama3.1 405B	12,248 tokens/sec	72x GB300	72x GB300-288GB_aarch64, TensorRT	NVIDIA GB300	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	11,614 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,280 tokens/sec	8x B200	Nebius B200	NVIDIA B200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	296 tokens/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama3.1 405B Interactive	9,921 tokens/sec	72x GB200	72x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	771 tokens/sec	8x B200	Nebius B200	NVIDIA B200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	203 tokens/sec	8x H200	Nebius H200	NVIDIA H200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 4500 ms/80 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	49,360 tokens/sec	4x GB200	4x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	101,611 tokens/sec	8x B200	Nebius B200	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	34,194 tokens/sec	8x H200	ASUSTeK ESC N8 H200	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
Llama2 70B Interactive	29,746 tokens/sec	4x GB200	4x GB200-186GB_aarch64, TensorRT	NVIDIA GB200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	62,851 tokens/sec	8x B200	G894-SD1	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	23,080 tokens/sec	8x H200	Nebius H200	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
Llama3.1 8B	128,794 tokens/sec	8x B200	Dell PowerEdge XE9685L	NVIDIA B200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/100 ms	OpenOrca (max_seq_len=1024)
	64,915 tokens/sec	8x H200	HPE Cray XD670	NVIDIA H200	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/100 ms	OpenOrca (max_seq_len=1024)
Llama3.1 8B Interactive	122,269 tokens/sec	8x B200	AS-4126GS-NBR-LCC	NVIDIA B200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	TTFT/TPOT: 500 ms/30 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
	54,118 tokens/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	TTFT/TPOT: 500 ms/30 ms	CNN Dailymail (v3.0.0, max_seq_len=2048)
Stable Diffusion XL	29 queries/sec	8x B200	Supermicro SYS-422GA-NBRT-LCC	NVIDIA B200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
	18 queries/sec	8x H200	QuantaGrid D74H-7U	NVIDIA H200	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
RetinaNet	14,406 queries/sec	8x H200	ASUSTeK ESC N8 H200	NVIDIA H200	99% of FP32 (0.3755 mAP)	100 ms	OpenImages (800x800)
DLRMv2	591,162 queries/sec	8x H200	ASUSTeK ESC N8 H200	NVIDIA H200	99% of FP32 (AUC=80.31%)	60 ms	Synthetic Multihot Criteo Dataset