MLPerf Inference v5.1 Performance Benchmarks
Offline Scenario, Closed Division
Network
Throughput
GPU
Server
GPU Version
Target Accuracy
Dataset
DeepSeek R1 420,659 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1 289,712 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1 33,379 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP16 (exact match 81.9132%) mlperf_deepseek_r1 Llama3.1 405B 16,104 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport 14,774 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport 1,660 tokens/sec 8x B200 Dell PowerEdge XE9685L NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport 553 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) Subset of LongBench, LongDataCollections, Ruler, GovReport Llama2 70B 51,737 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024) 102,909 tokens/sec 8x B200 ThinkSystem SR680a V3 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024) 35,317 tokens/sec 8x H200 Dell PowerEdge XE9680 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) OpenOrca (max_seq_len=1024) Llama3.1 8B 146,960 tokens/sec 8x B200 ThinkSystem SR780a V3 NVIDIA B200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048) 66,037 tokens/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) CNN Dailymail (v3.0.0, max_seq_len=2048) Whisper 22,273 samples/sec 4x GB200 BM.GPU.GB200.4 NVIDIA GB200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech 45,333 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech 34,451 samples/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 and 99.9% of FP32 (WER=2.0671%) LibriSpeech Stable Diffusion XL 33 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val 19 samples/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] Subset of coco-2014 val RGAT 651,230 samples/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP32 (72.86%) IGBH RetinaNet 14,997 samples/sec 8x H200 HPE Cray XD670 NVIDIA H200 99% of FP32 (0.3755 mAP) OpenImages (800x800) DLRMv2 647,861 samples/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP32 and 99.9% of FP32 (AUC=80.31%) Synthetic Multihot Criteo Dataset
Server Scenario - Closed Division
Network
Throughput
GPU
Server
GPU Version
Target Accuracy
MLPerf Server Latency
Constraints (ms)
Dataset
DeepSeek R1 209,328 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1 167,578 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1 18,592 tokens/sec 8x B200 NVIDIA DGX B200 NVIDIA B200 99% of FP16 (exact match 81.9132%) TTFT/TPOT: 2000 ms/80 ms mlperf_deepseek_r1 Llama3.1 405B 12,248 tokens/sec 72x GB300 72x GB300-288GB_aarch64, TensorRT NVIDIA GB300 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport 11,614 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport 1,280 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport 296 tokens/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 6000 ms/175 ms Subset of LongBench, LongDataCollections, Ruler, GovReport Llama3.1 405B Interactive 9,921 tokens/sec 72x GB200 72x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport 771 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport 203 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335) TTFT/TPOT: 4500 ms/80 ms Subset of LongBench, LongDataCollections, Ruler, GovReport Llama2 70B 49,360 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024) 101,611 tokens/sec 8x B200 Nebius B200 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024) 34,194 tokens/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/200 ms OpenOrca (max_seq_len=1024) Llama2 70B Interactive 29,746 tokens/sec 4x GB200 4x GB200-186GB_aarch64, TensorRT NVIDIA GB200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024) 62,851 tokens/sec 8x B200 G894-SD1 NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024) 23,080 tokens/sec 8x H200 Nebius H200 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 450 ms/40 ms OpenOrca (max_seq_len=1024) Llama3.1 8B 128,794 tokens/sec 8x B200 Dell PowerEdge XE9685L NVIDIA B200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/100 ms OpenOrca (max_seq_len=1024) 64,915 tokens/sec 8x H200 HPE Cray XD670 NVIDIA H200 99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162) TTFT/TPOT: 2000 ms/100 ms OpenOrca (max_seq_len=1024) Llama3.1 8B Interactive 122,269 tokens/sec 8x B200 AS-4126GS-NBR-LCC NVIDIA B200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048) 54,118 tokens/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 99% of FP32 and 99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881) TTFT/TPOT: 500 ms/30 ms CNN Dailymail (v3.0.0, max_seq_len=2048) Stable Diffusion XL 29 queries/sec 8x B200 Supermicro SYS-422GA-NBRT-LCC NVIDIA B200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val 18 queries/sec 8x H200 QuantaGrid D74H-7U NVIDIA H200 FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801] 20 s Subset of coco-2014 val RetinaNet 14,406 queries/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99% of FP32 (0.3755 mAP) 100 ms OpenImages (800x800) DLRMv2 591,162 queries/sec 8x H200 ASUSTeK ESC N8 H200 NVIDIA H200 99% of FP32 (AUC=80.31%) 60 ms Synthetic Multihot Criteo Dataset
LLM Inference Performance of NVIDIA Data Center Products
B200 Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Qwen3 235B A22B 1 8 128 2048 66,057 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 128 4096 39,496 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 2048 128 7,329 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 5000 500 8,190 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 500 2000 57,117 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 1000 1000 42,391 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 1000 2000 34,105 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 2048 2048 26,854 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 235B A22B 1 8 20000 2000 4,453 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 128 2048 37,844 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 128 4096 24,953 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 2048 128 6,251 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 5000 500 6,142 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 500 2000 27,817 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 1000 1000 25,828 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 1000 2000 22,051 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 2048 2048 17,554 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Qwen3 30B A3B 1 1 20000 2000 2,944 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 128 2048 112,676 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 128 4096 68,170 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 2048 128 18,088 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 1000 1000 79,617 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 1000 2000 63,766 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 2048 2048 52,195 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Maverick 1 8 20000 2000 12,678 output tokens/sec 8x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 128 2048 4,481 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 128 4096 8,932 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 2048 128 3,137 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 5000 500 2,937 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 500 2000 11,977 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 1000 1000 10,591 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 1000 2000 9,356 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 2048 2048 7,152 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v4 Scout 1 1 20000 2000 1,644 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 128 2048 9,922 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 128 4096 6,831 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 2048 128 1,339 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 5000 500 1,459 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 500 2000 7,762 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 1000 1000 7,007 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 1000 2000 6,737 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 0.19.0 NVIDIA B200 Llama v3.3 70B 1 1 2048 2048 4,783 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.3 70B 1 1 20000 2000 665 output tokens/sec 1x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 128 2048 8,020 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 128 4096 6,345 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 2048 128 749 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 5000 500 1,048 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 500 2000 6,244 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 1000 1000 5,209 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 1000 2000 4,933 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 2048 2048 4,212 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200 Llama v3.1 405B 1 4 20000 2000 672 output tokens/sec 4x B200 DGX B200 FP4 TensorRT-LLM 1.0 NVIDIA B200
RTX PRO 6000 Blackwell Server Edition Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v4 Scout 4 1 128 128 17,857 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 128 2048 9,491 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 2 2 128 4096 6,281 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 2048 128 3,391 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 5000 500 2,496 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 500 2000 9,253 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 1000 1000 8,121 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 1000 2000 6,980 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v4 Scout 4 1 2048 2048 4,939 output tokens/sec 4x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 128 2048 4,776 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 128 4096 2,960 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 500 2000 4,026 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 1000 1000 3,658 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 1000 2000 3,106 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 2048 2048 2,243 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.3 70B 2 1 20000 2000 312 output tokens/sec 2x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 128 128 4,866 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 128 2048 3,132 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 2048 128 588 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 5000 500 616 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 500 2000 2,468 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 1000 1000 2,460 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 1000 2000 2,009 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 405B 8 1 2048 2048 1,485 output tokens/sec 8x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 128 128 22,757 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 128 4096 7,585 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 2048 128 2,653 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 5000 500 2,283 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 500 2000 10,612 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 1000 2000 8,000 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 2048 2048 5,423 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition Llama v3.1 8B 1 1 20000 2000 756 output tokens/sec 1x RTX PRO 6000 Supermicro SYS-521GE-TNRT FP4 TensorRT-LLM 0.21.0 NVIDIA RTX PRO 6000 Blackwell Server Edition
H200 Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Qwen3 235B A22B 1 8 128 2048 42,821 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 128 4096 26,852 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 2048 128 3,331 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 5000 500 3,623 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 500 2000 28,026 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 1000 1000 23,789 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 1000 2000 22,061 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 2048 2048 16,672 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Qwen3 235B A22B 1 8 20000 2000 1,876 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 128 2048 40,572 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 128 4096 24,616 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 2048 128 7,307 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 5000 500 8,456 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 500 2000 37,835 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 1000 1000 31,782 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 1000 2000 34,734 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 2048 2048 20,957 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Maverick 1 8 20000 2000 4,106 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 128 2048 34,316 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 128 4096 21,332 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 2048 128 3,699 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 5000 500 4,605 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 500 2000 24,630 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 1000 1000 21,636 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 1000 2000 18,499 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 2048 2048 14,949 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v4 Scout 1 4 20000 2000 2,105 output tokens/sec 4x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 128 2048 4,336 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 128 4096 2,872 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 2048 128 442 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 5000 500 566 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 500 2000 3,666 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 1000 1000 2,909 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 1000 2000 2,994 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 2048 2048 2,003 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.3 70B 1 1 20000 2000 283 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 128 2048 5,661 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.19.0 NVIDIA H200 Llama v3.1 405B 1 8 128 4096 5,167 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 0.19.0 NVIDIA H200 Llama v3.1 405B 1 8 2048 128 456 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 5000 500 650 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 500 2000 4,724 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 1000 1000 3,330 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 1000 2000 3,722 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 2048 2048 2,948 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 405B 1 8 20000 2000 505 output tokens/sec 8x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 128 2048 26,221 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 128 4096 18,027 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 2048 128 3,538 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 5000 500 3,902 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 500 2000 20,770 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 1000 1000 17,744 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 1000 2000 16,828 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 2048 2048 12,194 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200 Llama v3.1 8B 1 1 20000 2000 1,804 output tokens/sec 1x H200 DGX H200 FP8 TensorRT-LLM 1.0 NVIDIA H200
H100 Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v3.3 70B 1 2 128 2048 6,651 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 128 4096 4,199 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 2048 128 762 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 5000 500 898 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 500 2000 5,222 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 1000 1000 4,205 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 1000 2000 4,146 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 2048 2048 3,082 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.3 70B 1 2 20000 2000 437 output tokens/sec 2x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 128 2048 4,340 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 128 4096 3,116 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 2048 128 453 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 5000 500 610 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 500 2000 3,994 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 1000 1000 2,919 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 1000 2000 2,895 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 2048 2048 2,296 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 405B 1 8 20000 2000 345 output tokens/sec 8x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 128 2048 22,714 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 128 4096 14,325 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 2048 128 3,450 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 5000 500 3,459 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 500 2000 17,660 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 1000 1000 15,220 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 1000 2000 13,899 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 2048 2048 9,305 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB Llama v3.1 8B 1 1 20000 2000 1,351 output tokens/sec 1x H100 DGX H100 FP8 TensorRT-LLM 1.0 H100-SXM5-80GB
L40S Inference Performance - Max Throughput
Model
PP
TP
Input Length
Output Length
Throughput
GPU
Server
Precision
Framework
GPU Version
Llama v4 Scout 2 2 128 2048 1,105 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 2 2 128 4096 707 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 4 1 2048 128 561 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 4 1 5000 500 307 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 2 2 500 2000 1,093 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 2 2 1000 1000 920 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 2 2 1000 2000 884 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v4 Scout 2 2 2048 2048 615 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.3 70B 4 1 128 2048 1,694 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.3 70B 2 2 128 4096 972 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.3 70B 4 1 500 2000 1,413 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.3 70B 4 1 1000 1000 1,498 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.3 70B 4 1 1000 2000 1,084 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.3 70B 4 1 2048 2048 773 output tokens/sec 4x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 128 128 8,471 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 128 4096 2,888 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 2048 128 1,017 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 5000 500 863 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 500 2000 4,032 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 1000 2000 3,134 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 2048 2048 2,148 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S Llama v3.1 8B 1 1 20000 2000 280 output tokens/sec 1x L40S Supermicro SYS-521GE-TNRT FP8 TensorRT-LLM 0.21.0 NVIDIA L40S
Inference Performance of NVIDIA Data Center Products
B200 Inference Performance
Network
Batch Size
Throughput
Efficiency
Latency (ms)
GPU
Server
Container
Precision
Dataset
Framework
GPU Version
Stable Diffusion v2.1 (512x512) 1 6.8 images/sec - 225.55 1x B200 DGX B200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA B200 Stable Diffusion XL 1 2.85 images/sec - 522.86 1x B200 DGX B200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA B200 ResNet-50v1.5 2048 118,265 images/sec 121 images/sec/watt 17.32 1x B200 DGX B200 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA B200 BEVFusion Head 1 2869.15 images/sec 6 sequences/sec/watt 0.35 1x B200 DGX B200 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA B200 Flux Image Generator 1 .48 images/sec - sequences/sec/watt 2079.78 1x B200 DGX B200 25.08-py3 FP4 Synthetic TensorRT 10.13.2 NVIDIA B200 HF Swin Base 128 4,572 samples/sec 5 samples/sec/watt 28 1x B200 DGX B200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA B200 HF Swin Large 128 2,820 samples/sec 3 samples/sec/watt 45.4 1x B200 DGX B200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA B200 HF ViT Base 1024 8,839 samples/sec 9 samples/sec/watt 115.85 1x B200 DGX B200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA B200 HF ViT Large 2048 3,127 samples/sec 3 samples/sec/watt 655.02 1x B200 DGX B200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA B200 Yolo v10 M 1 849.29 sequences/sec 1 sequences/sec/watt 1.18 1x B200 DGX B200 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA B200 Yolo v11 M 1 1043.32 samples/sec 1 samples/sec/watt 0.96 1x B200 DGX B200 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA B200
H200 Inference Performance
Network
Batch Size
Throughput
Efficiency
Latency (ms)
GPU
Server
Container
Precision
Dataset
Framework
GPU Version
Stable Diffusion v2.1 (512x512) 1 3.92 images/sec - 330 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 Stable Diffusion XL 1 1.6 images/sec - 750.22 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 ResNet-50v1.5 2048 81,317 images/sec 117 images/sec/watt 25.19 1x H200 DGX H200 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA H200 BEVFusion Head 1 2005.18 sequences/sec 6 sequences/sec/watt 0.5 1x H200 DGX H200 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA H200 Flux Image Generator 1 .21 images/sec - 4813.58 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 HF Swin Base 128 2,976 samples/sec 4 samples/sec/watt 43 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 HF Swin Large 128 1,803 samples/sec 3 samples/sec/watt 70.98 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 HF ViT Base 2048 4,930 samples/sec 7 samples/sec/watt 415.4 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 HF ViT Large 2048 1,684 samples/sec 2 samples/sec/watt 1215.82 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 Yolo v10 M 1 432.01 images/sec 1 images/sec/watt 2.31 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 Yolo v11 M 8 509.23 images/sec 1 images/sec/watt 1.96 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200
GH200 Inference Performance
Network
Batch Size
Throughput
Efficiency
Latency (ms)
GPU
Server
Container
Precision
Dataset
Framework
GPU Version
ResNet-50v1.5 2048 78,875 images/sec 119 images/sec/watt 25.97 1x GH200 NVIDIA P3880 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA GH200 BEVFusion Head 1 2013.77 images/sec 6 images/sec/watt 0.5 1x GH200 NVIDIA P3880 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA GH200 Flux Image Generator 1 . images/sec - 1x H200 DGX H200 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA H200 HF Swin Base 128 2,886 samples/sec 4 samples/sec/watt 44.35 1x GH200 NVIDIA P3880 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA GH200 HF Swin Large 128 1,733 samples/sec 3 samples/sec/watt 73.87 1x GH200 NVIDIA P3880 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA GH200 HF ViT Base 2048 4,710 samples/sec 7 samples/sec/watt 434.79 1x GH200 NVIDIA P3880 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA GH200 HF ViT Large 2048 1,626 samples/sec 2 samples/sec/watt 1259.68 1x GH200 NVIDIA P3880 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA GH200 Yolo v10 M 1 433.57 images/sec 1 images/sec/watt 2.31 1x GH200 NVIDIA P3880 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA GH200 Yolo v11 M 1 504.17 images/sec 1 images/sec/watt 1.98 1x GH200 NVIDIA P3880 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA GH200
H100 Inference Performance
Network
Batch Size
Throughput
Efficiency
Latency (ms)
GPU
Server
Container
Precision
Dataset
Framework
GPU Version
Stable Diffusion v2.1 (512x512) 1 3.83 images/sec - 340.56 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB Stable Diffusion XL 1 1.6 images/sec - 774.71 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB ResNet-50v1.5 2048 75,476 images/sec 110 images/sec/watt 27.13 1x H100 DGX H100 25.08-py3 INT8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB BEVFusion Head 1 1998.95 images/sec 6 images/sec/watt 0.5 1x H100 DGX H100 25.08-py3 INT8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB Flux Image Generator 1 .21 images/sec - 4747.1 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB HF Swin Base 128 2,852 samples/sec 4 samples/sec/watt 44.88 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB HF Swin Large 128 1,792 samples/sec 3 samples/sec/watt 71.44 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB HF ViT Base 2048 4,988 samples/sec 7 samples/sec/watt 410.58 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB HF ViT Large 2048 5,418 samples/sec 8 samples/sec/watt 377.97 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB Yolo v10 M 1 407.43 images/sec 1 images/sec/watt 2.45 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB Yolo v11 M 1 476 images/sec 1 images/sec/watt 2.1 1x H100 DGX H100 25.08-py3 FP8 Synthetic TensorRT 10.13.2 H100 SXM5-80GB
L40S Inference Performance
Network
Batch Size
Throughput
Efficiency
Latency (ms)
GPU
Server
Container
Precision
Dataset
Framework
GPU Version
Stable Diffusion v2.1 (512x512) 1 1.65 images/sec - 607.21 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA L40S Stable Diffusion XL 1 .6 images/sec - 1676.69 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA L40S ResNet-50v1.5 2048 23,555 images/sec 68 images/sec/watt 86.94 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA L40S BEVFusion Head 1 1944.21 images/sec 7 images/sec/watt 0.51 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA L40S HF Swin Base 32 1,376 samples/sec 4 samples/sec/watt 23.26 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA L40S HF Swin Large 32 705 samples/sec 2 samples/sec/watt 45.42 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA L40S HF ViT Base 1024 1,655 samples/sec 5 samples/sec/watt 618.88 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA L40S HF ViT Large 2048 570 samples/sec 2 samples/sec/watt 3591.09 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 FP8 Synthetic TensorRT 10.13.2 NVIDIA L40S Yolo v10 M 1 273.25 samples/sec 1 samples/sec/watt 3.66 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA L40S Yolo v11 M 1 308 images/sec 1 images/sec/watt 3.25 1x L40S Supermicro SYS-521GE-TNRT 25.08-py3 INT8 Synthetic TensorRT 10.13.2 NVIDIA L40S
.png)

![AI and Cybersecurity: Dan Boneh Interviews Sam Altman [video]](https://www.youtube.com/img/desktop/supported_browsers/firefox.png)
