96GB VRAM on a Single GPU – Benchmarking Nvidia's RTX Pro 6000

4 months ago 24

TL;DR: The RTX PRO 6000 Workstation Edition (96GB) delivers impressive performance for local LLMs, but the real story is its power efficiency curve. At 450W (75% power), most models retain 85-95% performance. Blackwell driver support on Linux needs work. Full benchmarks below.

96GB VRAM is pretty great

HN screenshot RTX PRO 6000 Workstation Edition

I recently got a RTX PRO 6000 Workstation Edition for my workstation (Wattzilla) after experimenting with a multi-GPU setup of five (5) A4000 GPUs and even considering getting multiple China-modded 48GB 4090s at one point.

With 96GB of VRAM, this card offers a clean one-card solution for anyone interested in new model training, finetuning, or just fast inference, eliminating PCIe bottlenecks common in distributed training and the complexity of model parallelism.

wattzilla

Hardware Overview

VRAM: 96GB GDDR7
Memory Bandwidth: 1792 GB/sec
CUDA Cores: 24,064
TDP: 600W (configurable down to 150W)
Architecture: Blackwell
Interface: PCIe 5.0 x16
Power Connector: 12V-2x6

The card is surprisingly dense—it feels like holding a brick. Unlike the consumer 5090 FE, it got delivered in minimal package (bubble wrapped in a generic box from an enterprise reseller).

If you get this card, be sure to triple-check that the 12V-2x6 connector is fully seated on the GPU; there are many stories online about melted 12VHPWR (ATX 3.0) power connectors on the 5090 series GPUs. My 1600W PSU is ATX 3.1 compatible, and I verified the connection on both sides of the cable a couple of times before powering it on.

HN screenshot RTX PRO 6000 in action

The Blackwell Driver Situation

First boot on Ubuntu 22.04 was predictably problematic:

"The NVIDIA GPU at PCI:65:0:0 is not supported by the 570.133.02 NVIDIA driver."

Solution:

Remove old driver and install version 575 or later

Nvidia’s driver page would recommend 570.153.02 but I had to install version 575.51.03 to get Blackwell support, and the GPL version of the driver is the recommended option for now:

sudo apt-get install cuda-toolkit-12-9 nvidia-open cuda-driver

After reboot:

voila!

I then setup a few latest container images with support for Cuda 12.9 to run the benchmarks.

Nvidia’s NGC Catalog for PyTorch. From here
Llama.cpp built from source. Instructions here

Note: Many ML/LLM libraries still lack Blackwell support. Expect compatibility issues.

Power Scaling Analysis

This card scales remarkably well with lower power limits. I tested with different power limits to understand the performance/watt curve.

ResNet-50 Training Throughput

Power LimitThroughput (img/s)Relative PowerRelative PerformanceEfficiencyPerf/Watt

600W*	1279.48	100%	100%	1.00x	2.41
450W	1213.15	75%	94.8%	1.26x	2.70
300W	964.51	50%	75.4%	1.51x	3.22
200W	595.14	33%	46.5%	1.41x	2.98
150W	359.36	25%	28.1%	1.12x	2.40

*Card peaked at 530W during this training run, never hit 600W

Key insight: Peak efficiency occurs at 300W, not at maximum power. For sustained workloads, 450W offers an excellent performance/thermal balance.

LLM Inference Benchmarks

I then tested some popular models at three power limits: 600W, 450W, and 300W.

Unsloth is my go-to for quantized models. I used their dynamically quantized GGUF format models. All models were tested with at least a 20k token context window and a 4k token prompt.

Small Models (<10B parameters)

ModelSize600W450W300WNotes

Qwen3-0.6B-Q4	0.4GB	168.1 tok/s	129.4 tok/s (-23%)	159.5 tok/s (-5%)	Max draw: 80W
gemma-3-1b-Q4	0.8GB	212.9 tok/s	166.0 tok/s (-22%)	227.1 tok/s (+7%)	Memory-bound
Qwen3-4B-Q4	2.4GB	127.7 tok/s	91.6 tok/s (-28%)	86.3 tok/s (-32%)
Qwen3-8B-Q4	4.8GB	91.3 tok/s	81.4 tok/s (-11%)	79.5 tok/s (-13%)	Excellent scaling

Medium Models (10-35B parameters)

ModelSize600W450W300WEfficiency @ 450W

Qwen3-14B-Q4	8.5GB	76.5 tok/s	65.9 tok/s (-14%)	54.9 tok/s (-28%)	86% perf @ 75% power
gemma-3-27b-Q4	15.6GB	53.3 tok/s	47.4 tok/s (-11%)	43.2 tok/s (-19%)	89% perf @ 75% power
Qwen3-32B-Q4	18.6GB	34.3 tok/s	31.1 tok/s (-9%)	24.9 tok/s (-27%)	91% perf @ 75% power
QwQ-32B-Q4	18.7GB	41.1 tok/s	35.7 tok/s (-13%)	28.7 tok/s (-30%)	87% perf @ 75% power

These models hit the efficiency sweet spot at 450W.

Large Models (>35B parameters) - Where 96GB Shines

ModelSize600W450W300WNotes

Qwen3-235B-Q2	46.4GB	32.4 tok/s	28.9 tok/s (-11%)	22.3 tok/s (-31%)	Fits in VRAM with usable context window
Llama-4-Scout-17B-16E-Q4	46.2GB	64.2 tok/s	60.9 tok/s (-5%)	52.6 tok/s (-18%)	MoE benefits from bandwidth

Size in the table denotes the actual size of the model file
I didn’t include massive models that required layer offloading in the benchmarks

Deepseek R1-0528 Distills

ModelSize600WNotes

DeepSeek-R1-0528-Qwen3-8B-Q8	10.1GB	~115 tok/s	131k context window: 36.48GB VRAM used

Image/Video Generation

ModelTypeResolutionPerformanceNotes

Flux.1-dev	Image	1024x1024	3.10 it/s	50 iterations
Bagel-7B-MoT	Image	1024x1024	0.98 it/s	50 iterations
Wan2.1-VACE-1.3B	Video	832x480	190s for 5s video	BF16, LORA should be faster

Generated by Flux.1-dev

These benchmarks are for reference only, exact performance will depend on your system hardware, quantization, libraries used and prompt length etc.

GPU noise for desktop use

Load StatePowerNoise LevelDistanceNotes

Idle	18W	41dB	3ft	open case
Full load	596W	54dB	3ft
Full load	596W	66dB	5cm

There is some coil whine at full load but it doesn’t bother me from 6ft away.

GPU Temperature

ConditionMax TemperatureGPU Fan Speed

Idle	28°C	30%
Open case, sustained load	66°C	100%

Key Insights

Power efficiency is non-linear: Small models (<2B parameters) can sometimes run faster at lower power settings, but overall they don’t draw more than 80W during most runs.
The 450W sweet spot: For most workloads, running at 450W (75% power) provides 85-95% of maximum performance while cutting power consumption by 25%. This is where I’d recommend most users operate.
96GB enables new workflows: Running a model like Qwen3-235B locally (even quantized) fundamentally changes the development experience. No external API calls, no rate limits, and complete privacy.
Idle efficiency matters: At 17-20W idle (compared to 30W+ for a 5090), this difference adds up if the system runs 24/7. A headless setup further helps here.

Practical Considerations

Who would benefit from this:

Anyone working with 70B+ models
Anyone needing to run multiple 30B models simultaneously
Teams requiring on-premises deployment of large models
Developers hitting context length limits on smaller cards
Anyone working with Image and Video generative models. I actually ran out of VRAM trying to run the Wan2.1-VACE-14B model!
Anyone training small/medium size LLM models (allows higher batch sizes without gradient checkpointing, etc.)

Who should wait or skip this:

Anyone primarily using ≤13B models (get a used 3090 or 5090 FE at retail price)
Those who want somewhere from 32-48GB VRAM (wait for the upcoming RTX PRO 5000 (48GB) or RTX PRO 4500 (32GB))
Anyone who prefers stable drivers and widespread support (give it 3-6 months)
If you’re only interested in running inference on similar sized models, a Mac Studio M4 Max (128GB) or M3 Ultra (96GB) should work well for typical use cases

Comparison to Multi-GPU Setups

I previously ran 5x A4000s (80GB VRAM total, 600W combined) and now moved to a RTX PRO 6000 + 4x A4000 setup.

Single-card advantages:

No PCIe bottlenecks: Model parallelism across multiple GPUs is limited by the slowest PCIe bandwidth (8-64GB/s).
Simpler deployment: No wrestling with DeepSpeed/FSDP configs, no debugging NCCL errors, no gradient synchronization overhead.
Cost efficiency: Counter-intuitively, one 96GB card can be more cost-effective than equivalent multi-GPU setups. A system with three 5090s requires a $1000+ motherboard with proper spacing, a beefier PSU, and additional cooling.
Density: One PCIe slot vs. an entire workstation chassis for equivalent VRAM.

Multi-GPU still makes sense for:

Scaling beyond 96GB (obviously)
Redundancy in production environments
Mixed workloads where you can dedicate different models to different GPUs (My case)

The 4-8x used 3090 setups are justifiably popular in the LocalLLM community because of their value for money. I’d recommend this route if you can handle 2400W+ power draw and the associated heat and noise.

Final Verdict

The RTX PRO 6000 is simultaneously overkill and exactly what many have wanted for local model training and inference. This is not a cheap GPU, but for specific use cases like fine-tuning 30B+ models, researching sparse models, or deploying multiple medium-to-large models, it’s currently the most straightforward solution.

For casual use, renting an H100 on a cloud provider (ex: Runpod) costs about $2.20/hr. For the price of this GPU, you could train on the cloud for roughly six months.

For the right use cases this GPU is currently unmatched. The power efficiency findings were a pleasant surprise, making deployment more practical than expected.

But let’s be honest: most people reading this would likely be better served by a used 3090 or a single 5090 at retail price. The 96GB VRAM is transformative for specific workflows but is overkill for casual experimentation.

Just be prepared to deal with early adopter challenges on the software side. Blackwell support is improving, but it’s not entirely seamless yet.

The real win would be if AMD (e.g., R9700 PRO - 32GB) or Intel (e.g., Arc Pro B60 Dual - 48GB) brought competition to the 48GB+ VRAM space. Until then, Nvidia knows exactly what it can charge for 96GB on a single card.

Bonus photos:

HN screenshot Unboxed

HN screenshot Top view of Wattzilla: 160GB VRAM setup

HN screenshot WattWise view: While testing DeepSeek R1 0528 Qwen3 8B

If you have any questions or want me to test additional models, feel free to reach out!

Read Entire Article