TL;DR: The RTX PRO 6000 Workstation Edition (96GB) delivers impressive performance for local LLMs, but the real story is its power efficiency curve. At 450W (75% power), most models retain 85-95% performance. Blackwell driver support on Linux needs work. Full benchmarks below.
96GB VRAM is pretty great
RTX PRO 6000 Workstation Edition
I recently got a RTX PRO 6000 Workstation Edition for my workstation (Wattzilla) after experimenting with a multi-GPU setup of five (5) A4000 GPUs and even considering getting multiple China-modded 48GB 4090s at one point.
With 96GB of VRAM, this card offers a clean one-card solution for anyone interested in new model training, finetuning, or just fast inference, eliminating PCIe bottlenecks common in distributed training and the complexity of model parallelism.
Hardware Overview
- VRAM: 96GB GDDR7
- Memory Bandwidth: 1792 GB/sec
- CUDA Cores: 24,064
- TDP: 600W (configurable down to 150W)
- Architecture: Blackwell
- Interface: PCIe 5.0 x16
- Power Connector: 12V-2x6
The card is surprisingly dense—it feels like holding a brick. Unlike the consumer 5090 FE, it got delivered in minimal package (bubble wrapped in a generic box from an enterprise reseller).
If you get this card, be sure to triple-check that the 12V-2x6 connector is fully seated on the GPU; there are many stories online about melted 12VHPWR (ATX 3.0) power connectors on the 5090 series GPUs. My 1600W PSU is ATX 3.1 compatible, and I verified the connection on both sides of the cable a couple of times before powering it on.
RTX PRO 6000 in action
The Blackwell Driver Situation
First boot on Ubuntu 22.04 was predictably problematic:
"The NVIDIA GPU at PCI:65:0:0 is not supported by the 570.133.02 NVIDIA driver."Solution:
Remove old driver and install version 575 or later
Nvidia’s driver page would recommend 570.153.02 but I had to install version 575.51.03 to get Blackwell support, and the GPL version of the driver is the recommended option for now:
sudo apt-get install cuda-toolkit-12-9 nvidia-open cuda-driverAfter reboot:
voila!
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |=========================================+========================+======================| | 2 NVIDIA RTX PRO 6000 Blac... Off | 00000000:41:00.0 Off | Off | | 30% 30C P8 17W / 600W | 15MiB / 97887MiB | 0% Default | +-----------------------------------------+------------------------+----------------------+I then setup a few latest container images with support for Cuda 12.9 to run the benchmarks.
Note: Many ML/LLM libraries still lack Blackwell support. Expect compatibility issues.
Power Scaling Analysis
This card scales remarkably well with lower power limits. I tested with different power limits to understand the performance/watt curve.
ResNet-50 Training Throughput
| 600W* | 1279.48 | 100% | 100% | 1.00x | 2.41 |
| 450W | 1213.15 | 75% | 94.8% | 1.26x | 2.70 |
| 300W | 964.51 | 50% | 75.4% | 1.51x | 3.22 |
| 200W | 595.14 | 33% | 46.5% | 1.41x | 2.98 |
| 150W | 359.36 | 25% | 28.1% | 1.12x | 2.40 |
*Card peaked at 530W during this training run, never hit 600W
Key insight: Peak efficiency occurs at 300W, not at maximum power. For sustained workloads, 450W offers an excellent performance/thermal balance.
LLM Inference Benchmarks
I then tested some popular models at three power limits: 600W, 450W, and 300W.
Unsloth is my go-to for quantized models. I used their dynamically quantized GGUF format models. All models were tested with at least a 20k token context window and a 4k token prompt.
Small Models (<10B parameters)
| Qwen3-0.6B-Q4 | 0.4GB | 168.1 tok/s | 129.4 tok/s (-23%) | 159.5 tok/s (-5%) | Max draw: 80W |
| gemma-3-1b-Q4 | 0.8GB | 212.9 tok/s | 166.0 tok/s (-22%) | 227.1 tok/s (+7%) | Memory-bound |
| Qwen3-4B-Q4 | 2.4GB | 127.7 tok/s | 91.6 tok/s (-28%) | 86.3 tok/s (-32%) | |
| Qwen3-8B-Q4 | 4.8GB | 91.3 tok/s | 81.4 tok/s (-11%) | 79.5 tok/s (-13%) | Excellent scaling |
Medium Models (10-35B parameters)
| Qwen3-14B-Q4 | 8.5GB | 76.5 tok/s | 65.9 tok/s (-14%) | 54.9 tok/s (-28%) | 86% perf @ 75% power |
| gemma-3-27b-Q4 | 15.6GB | 53.3 tok/s | 47.4 tok/s (-11%) | 43.2 tok/s (-19%) | 89% perf @ 75% power |
| Qwen3-32B-Q4 | 18.6GB | 34.3 tok/s | 31.1 tok/s (-9%) | 24.9 tok/s (-27%) | 91% perf @ 75% power |
| QwQ-32B-Q4 | 18.7GB | 41.1 tok/s | 35.7 tok/s (-13%) | 28.7 tok/s (-30%) | 87% perf @ 75% power |
These models hit the efficiency sweet spot at 450W.
Large Models (>35B parameters) - Where 96GB Shines
| Qwen3-235B-Q2 | 46.4GB | 32.4 tok/s | 28.9 tok/s (-11%) | 22.3 tok/s (-31%) | Fits in VRAM with usable context window |
| Llama-4-Scout-17B-16E-Q4 | 46.2GB | 64.2 tok/s | 60.9 tok/s (-5%) | 52.6 tok/s (-18%) | MoE benefits from bandwidth |
- Size in the table denotes the actual size of the model file
- I didn’t include massive models that required layer offloading in the benchmarks
Deepseek R1-0528 Distills
| DeepSeek-R1-0528-Qwen3-8B-Q8 | 10.1GB | ~115 tok/s | 131k context window: 36.48GB VRAM used |
Image/Video Generation
| Flux.1-dev | Image | 1024x1024 | 3.10 it/s | 50 iterations |
| Bagel-7B-MoT | Image | 1024x1024 | 0.98 it/s | 50 iterations |
| Wan2.1-VACE-1.3B | Video | 832x480 | 190s for 5s video | BF16, LORA should be faster |
Generated by Flux.1-dev
These benchmarks are for reference only, exact performance will depend on your system hardware, quantization, libraries used and prompt length etc.
GPU noise for desktop use
| Idle | 18W | 41dB | 3ft | open case |
| Full load | 596W | 54dB | 3ft | |
| Full load | 596W | 66dB | 5cm |
There is some coil whine at full load but it doesn’t bother me from 6ft away.
GPU Temperature
| Idle | 28°C | 30% |
| Open case, sustained load | 66°C | 100% |
Key Insights
-
Power efficiency is non-linear: Small models (<2B parameters) can sometimes run faster at lower power settings, but overall they don’t draw more than 80W during most runs.
-
The 450W sweet spot: For most workloads, running at 450W (75% power) provides 85-95% of maximum performance while cutting power consumption by 25%. This is where I’d recommend most users operate.
-
96GB enables new workflows: Running a model like Qwen3-235B locally (even quantized) fundamentally changes the development experience. No external API calls, no rate limits, and complete privacy.
-
Idle efficiency matters: At 17-20W idle (compared to 30W+ for a 5090), this difference adds up if the system runs 24/7. A headless setup further helps here.
Practical Considerations
Who would benefit from this:
- Anyone working with 70B+ models
- Anyone needing to run multiple 30B models simultaneously
- Teams requiring on-premises deployment of large models
- Developers hitting context length limits on smaller cards
- Anyone working with Image and Video generative models. I actually ran out of VRAM trying to run the Wan2.1-VACE-14B model!
- Anyone training small/medium size LLM models (allows higher batch sizes without gradient checkpointing, etc.)
Who should wait or skip this:
- Anyone primarily using ≤13B models (get a used 3090 or 5090 FE at retail price)
- Those who want somewhere from 32-48GB VRAM (wait for the upcoming RTX PRO 5000 (48GB) or RTX PRO 4500 (32GB))
- Anyone who prefers stable drivers and widespread support (give it 3-6 months)
- If you’re only interested in running inference on similar sized models, a Mac Studio M4 Max (128GB) or M3 Ultra (96GB) should work well for typical use cases
Comparison to Multi-GPU Setups
I previously ran 5x A4000s (80GB VRAM total, 600W combined) and now moved to a RTX PRO 6000 + 4x A4000 setup.
Single-card advantages:
- No PCIe bottlenecks: Model parallelism across multiple GPUs is limited by the slowest PCIe bandwidth (8-64GB/s).
- Simpler deployment: No wrestling with DeepSpeed/FSDP configs, no debugging NCCL errors, no gradient synchronization overhead.
- Cost efficiency: Counter-intuitively, one 96GB card can be more cost-effective than equivalent multi-GPU setups. A system with three 5090s requires a $1000+ motherboard with proper spacing, a beefier PSU, and additional cooling.
- Density: One PCIe slot vs. an entire workstation chassis for equivalent VRAM.
Multi-GPU still makes sense for:
- Scaling beyond 96GB (obviously)
- Redundancy in production environments
- Mixed workloads where you can dedicate different models to different GPUs (My case)
The 4-8x used 3090 setups are justifiably popular in the LocalLLM community because of their value for money. I’d recommend this route if you can handle 2400W+ power draw and the associated heat and noise.
Final Verdict
The RTX PRO 6000 is simultaneously overkill and exactly what many have wanted for local model training and inference. This is not a cheap GPU, but for specific use cases like fine-tuning 30B+ models, researching sparse models, or deploying multiple medium-to-large models, it’s currently the most straightforward solution.
For casual use, renting an H100 on a cloud provider (ex: Runpod) costs about $2.20/hr. For the price of this GPU, you could train on the cloud for roughly six months.
For the right use cases this GPU is currently unmatched. The power efficiency findings were a pleasant surprise, making deployment more practical than expected.
But let’s be honest: most people reading this would likely be better served by a used 3090 or a single 5090 at retail price. The 96GB VRAM is transformative for specific workflows but is overkill for casual experimentation.
Just be prepared to deal with early adopter challenges on the software side. Blackwell support is improving, but it’s not entirely seamless yet.
The real win would be if AMD (e.g., R9700 PRO - 32GB) or Intel (e.g., Arc Pro B60 Dual - 48GB) brought competition to the 48GB+ VRAM space. Until then, Nvidia knows exactly what it can charge for 96GB on a single card.
Bonus photos:
Unboxed
Top view of Wattzilla: 160GB VRAM setup
WattWise view: While testing DeepSeek R1 0528 Qwen3 8B
If you have any questions or want me to test additional models, feel free to reach out!
.png)

![New 'ghost tapping' scam warning: How to protect yourself [video]](https://www.youtube.com/img/desktop/supported_browsers/firefox.png)