🎯 Why This Model is Different
If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works.
The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.
Results
- ✅ Perfect audio, identical to the original model
- ✅ 11.6 GB instead of 18.7 GB (-38%)
- ✅ Uses ~12 GB VRAM instead of 20 GB
- ✅ Works on 12 GB GPUs (RTX 3060, 4070 Ti, etc.)
🚨 The Problem with Other 8-bit Models
Most 8-bit models you'll find online quantize everything aggressively: Result: Audio components get quantized → numerical errors propagate → audio = pure noise.
✅ The Solution: Selective Quantization
I only quantized what can be safely quantized without losing quality.
Result: 52% of parameters quantized, 48% at full precision = perfect audio quality.
📊 Quick Comparison
| Original VibeVoice | 18.7 GB | ⭐⭐⭐⭐⭐ | Full precision |
| Other 8-bit models | 10.6 GB | 💥 NOISE | ❌ Don't work |
| This model | 11.6 GB | ⭐⭐⭐⭐⭐ | ✅ Perfect |
+1.0 GB vs other 8-bit models = perfect audio instead of noise. Worth it.
💻 How to Use It
With Transformers
from transformers import AutoModelForCausalLM, AutoProcessor import torch import scipy.io.wavfile as wavfile model = AutoModelForCausalLM.from_pretrained( "FabioSarracino/VibeVoice-Large-Q8", device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16, ) processor = AutoProcessor.from_pretrained( "FabioSarracino/VibeVoice-Large-Q8", trust_remote_code=True ) text = "Hello, this is VibeVoice speaking." inputs = processor(text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=None) audio = output.speech_outputs[0].cpu().numpy() wavfile.write("output.wav", 24000, audio)With ComfyUI (recommended)
Install the custom node:
cd ComfyUI/custom_nodes git clone https://github.com/Enemyx-net/VibeVoice-ComfyUIDownload this model to ComfyUI/models/vibevoice/
Restart ComfyUI and use it normally!
💾 System Requirements
Minimum
- VRAM: 12 GB
- RAM: 16 GB
- GPU: NVIDIA with CUDA (required)
- Storage: 11 GB
Recommended
- VRAM: 16+ GB
- RAM: 32 GB
- GPU: RTX 3090/4090, A5000 or better
⚠️ Not supported: CPU, Apple Silicon (MPS), AMD GPUs
⚠️ Limitations
- Requires NVIDIA GPU with CUDA - won't work on CPU or Apple Silicon
- Inference only - don't use for fine-tuning
- Requires:
- transformers>=4.51.3
- bitsandbytes>=0.43.0
🆚 When to Use This Model
✅ Use this 8-bit if:
- You have 12-16 GB VRAM
- You want maximum quality with reduced size
- You need a production-ready model
- You want the best size/quality balance
Use full precision (18.7 GB) if:
- You have unlimited VRAM (24+ GB)
- You're doing research requiring absolute precision
Use 4-bit NF4 (~6.6 GB) if:
- You only have 8-10 GB VRAM
- You can accept a small quality trade-off
🔧 Troubleshooting
"OutOfMemoryError" during loading
- Close other GPU applications
- Use device_map="auto"
- Reduce batch size to 1
"BitsAndBytes not found"
pip install bitsandbytes>=0.43.0Audio sounds distorted
This shouldn't happen! If it does:
- Verify you downloaded the correct model
- Update transformers: pip install --upgrade transformers
- Check CUDA: torch.cuda.is_available() should return True
📚 Citation
@misc{vibevoice-q8-2025, title={VibeVoice-Large-Q8: Selective 8-bit Quantization for Audio Quality}, author={Fabio Sarracino}, year={2025}, url={https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8} }Original Model
@misc{vibevoice2024, title={VibeVoice: High-Quality Text-to-Speech with Large Language Models}, author={Microsoft Research}, year={2024}, url={https://github.com/microsoft/VibeVoice} }🔗 Related Resources
- Original Model - Full precision base
- ComfyUI Node - ComfyUI integration
📜 License
MIT License.
🤝 Support
- Issues: GitHub Issues
- Questions: HuggingFace Discussions
If this model helped you, leave a ⭐ on GitHub!
.png)

