VibeVoice-Large-Q8: a working 8-bit quantized VibeVoice model

1 month ago 9

🎯 Why This Model is Different

If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works.

The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.

Results

  • ✅ Perfect audio, identical to the original model
  • ✅ 11.6 GB instead of 18.7 GB (-38%)
  • ✅ Uses ~12 GB VRAM instead of 20 GB
  • ✅ Works on 12 GB GPUs (RTX 3060, 4070 Ti, etc.)

🚨 The Problem with Other 8-bit Models

Most 8-bit models you'll find online quantize everything aggressively: Result: Audio components get quantized → numerical errors propagate → audio = pure noise.


✅ The Solution: Selective Quantization

I only quantized what can be safely quantized without losing quality.

Result: 52% of parameters quantized, 48% at full precision = perfect audio quality.


📊 Quick Comparison

Model Size Audio Quality Status
Original VibeVoice 18.7 GB ⭐⭐⭐⭐⭐ Full precision
Other 8-bit models 10.6 GB 💥 NOISE ❌ Don't work
This model 11.6 GB ⭐⭐⭐⭐⭐ Perfect

+1.0 GB vs other 8-bit models = perfect audio instead of noise. Worth it.


💻 How to Use It

With Transformers

from transformers import AutoModelForCausalLM, AutoProcessor import torch import scipy.io.wavfile as wavfile model = AutoModelForCausalLM.from_pretrained( "FabioSarracino/VibeVoice-Large-Q8", device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16, ) processor = AutoProcessor.from_pretrained( "FabioSarracino/VibeVoice-Large-Q8", trust_remote_code=True ) text = "Hello, this is VibeVoice speaking." inputs = processor(text, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=None) audio = output.speech_outputs[0].cpu().numpy() wavfile.write("output.wav", 24000, audio)

With ComfyUI (recommended)

  1. Install the custom node:

    cd ComfyUI/custom_nodes git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI
  2. Download this model to ComfyUI/models/vibevoice/

  3. Restart ComfyUI and use it normally!


💾 System Requirements

Minimum

  • VRAM: 12 GB
  • RAM: 16 GB
  • GPU: NVIDIA with CUDA (required)
  • Storage: 11 GB

Recommended

  • VRAM: 16+ GB
  • RAM: 32 GB
  • GPU: RTX 3090/4090, A5000 or better

⚠️ Not supported: CPU, Apple Silicon (MPS), AMD GPUs


⚠️ Limitations

  1. Requires NVIDIA GPU with CUDA - won't work on CPU or Apple Silicon
  2. Inference only - don't use for fine-tuning
  3. Requires:
    • transformers>=4.51.3
    • bitsandbytes>=0.43.0

🆚 When to Use This Model

✅ Use this 8-bit if:

  • You have 12-16 GB VRAM
  • You want maximum quality with reduced size
  • You need a production-ready model
  • You want the best size/quality balance

Use full precision (18.7 GB) if:

  • You have unlimited VRAM (24+ GB)
  • You're doing research requiring absolute precision

Use 4-bit NF4 (~6.6 GB) if:

  • You only have 8-10 GB VRAM
  • You can accept a small quality trade-off

🔧 Troubleshooting

"OutOfMemoryError" during loading

  • Close other GPU applications
  • Use device_map="auto"
  • Reduce batch size to 1

"BitsAndBytes not found"

pip install bitsandbytes>=0.43.0

Audio sounds distorted

This shouldn't happen! If it does:

  1. Verify you downloaded the correct model
  2. Update transformers: pip install --upgrade transformers
  3. Check CUDA: torch.cuda.is_available() should return True

📚 Citation

@misc{vibevoice-q8-2025, title={VibeVoice-Large-Q8: Selective 8-bit Quantization for Audio Quality}, author={Fabio Sarracino}, year={2025}, url={https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8} }

Original Model

@misc{vibevoice2024, title={VibeVoice: High-Quality Text-to-Speech with Large Language Models}, author={Microsoft Research}, year={2024}, url={https://github.com/microsoft/VibeVoice} }

🔗 Related Resources


📜 License

MIT License.


🤝 Support

If this model helped you, leave a ⭐ on GitHub!


Read Entire Article