VibeVoice-Large-Q8: a working 8-bit quantized VibeVoice model

1 month ago 9

🎯 Why This Model is Different

If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works.

The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.

Results

✅ Perfect audio, identical to the original model
✅ 11.6 GB instead of 18.7 GB (-38%)
✅ Uses ~12 GB VRAM instead of 20 GB
✅ Works on 12 GB GPUs (RTX 3060, 4070 Ti, etc.)

🚨 The Problem with Other 8-bit Models

Most 8-bit models you'll find online quantize everything aggressively: Result: Audio components get quantized → numerical errors propagate → audio = pure noise.

✅ The Solution: Selective Quantization

I only quantized what can be safely quantized without losing quality.

Result: 52% of parameters quantized, 48% at full precision = perfect audio quality.

📊 Quick Comparison

Model Size Audio Quality Status

Original VibeVoice	18.7 GB	⭐⭐⭐⭐⭐	Full precision
Other 8-bit models	10.6 GB	💥 NOISE	❌ Don't work
This model	11.6 GB	⭐⭐⭐⭐⭐	✅ Perfect

+1.0 GB vs other 8-bit models = perfect audio instead of noise. Worth it.

💻 How to Use It

With Transformers

from transformers import AutoModelForCausalLM, AutoProcessor import torch import scipy.io.wavfile as wavfile model = AutoModelForCausalLM.from_pretrained( "FabioSarracino/VibeVoice-Large-Q8", device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16, ) processor = AutoProcessor.from_pretrained( "FabioSarracino/VibeVoice-Large-Q8", trust_remote_code=True ) text = "Hello, this is VibeVoice speaking." inputs = processor(text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=None) audio = output.speech_outputs[0].cpu().numpy() wavfile.write("output.wav", 24000, audio)

With ComfyUI (recommended)

Install the custom node:
cd ComfyUI/custom_nodes git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI
Download this model to ComfyUI/models/vibevoice/
Restart ComfyUI and use it normally!

💾 System Requirements

Minimum

VRAM: 12 GB
RAM: 16 GB
GPU: NVIDIA with CUDA (required)
Storage: 11 GB

⚠️ Limitations

Requires NVIDIA GPU with CUDA - won't work on CPU or Apple Silicon
Inference only - don't use for fine-tuning
Requires:
- transformers>=4.51.3
- bitsandbytes>=0.43.0

🆚 When to Use This Model

✅ Use this 8-bit if:

You have 12-16 GB VRAM
You want maximum quality with reduced size
You need a production-ready model
You want the best size/quality balance

Use full precision (18.7 GB) if:

You have unlimited VRAM (24+ GB)
You're doing research requiring absolute precision

Use 4-bit NF4 (~6.6 GB) if:

You only have 8-10 GB VRAM
You can accept a small quality trade-off

🔧 Troubleshooting

"OutOfMemoryError" during loading

Close other GPU applications
Use device_map="auto"
Reduce batch size to 1

"BitsAndBytes not found"

pip install bitsandbytes>=0.43.0

Audio sounds distorted

This shouldn't happen! If it does:

Verify you downloaded the correct model
Update transformers: pip install --upgrade transformers
Check CUDA: torch.cuda.is_available() should return True

📚 Citation

@misc{vibevoice-q8-2025, title={VibeVoice-Large-Q8: Selective 8-bit Quantization for Audio Quality}, author={Fabio Sarracino}, year={2025}, url={https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8} }

Original Model

@misc{vibevoice2024, title={VibeVoice: High-Quality Text-to-Speech with Large Language Models}, author={Microsoft Research}, year={2024}, url={https://github.com/microsoft/VibeVoice} }

🔗 Related Resources

Original Model - Full precision base
ComfyUI Node - ComfyUI integration

📜 License

MIT License.

🤝 Support

Issues: GitHub Issues
Questions: HuggingFace Discussions

If this model helped you, leave a ⭐ on GitHub!

Read Entire Article