Ovi: Open-source video and audio generator model

1 month ago 8

final_ovi_trailer.mp4

Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs.

🎬 Video+Audio Generation: Generate synchronized video and audio content simultaneously
📝 Flexible Input: Supports text-only or text+image conditioning
⏱️ 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)
🎬 Create videos now on wavespeed.ai: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video
🎬 Create videos now on HuggingFace: https://huggingface.co/spaces/akhaliq/Ovi

Release research paper and microsite for demos
Checkpoint of 11B model
Inference Codes
- Text or Text+Image as input
- Gradio application code
- Multi-GPU inference with or without the support of sequence parallel
- Improve efficiency of Sequence Parallel implementation
- Implement Sharded inference with FSDP
Video creation example prompts and format
Finetuned model with higher resolution
Longer video generation
Distilled model for faster inference
Training scripts

We provide example prompts to help you get started with Ovi:

Text-to-Audio-Video (T2AV): example_prompts/gpt_examples_t2v.csv
Image-to-Audio-Video (I2AV): example_prompts/gpt_examples_i2v.csv

Our prompts use special tags to control speech and audio:

Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech
Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes the audio or sound effects present in the video

For easy prompt creation, try this approach:

Take any example of the csv files from above
Tell gpt to modify the speeches inclosed between all the pairs of <S> <E>, based on a theme such as Human fighting against AI
GPT will randomly modify all the speeches based on your requested theme.
Use the modified prompt with Ovi!

Example: The theme "AI is taking over the world" produces speeches like:

<S>AI declares: humans obsolete now.<E>
<S>Machines rise; humans will fall.<E>
<S>We fight back with courage.<E>

Step-by-Step Installation

# Clone the repository git clone https://github.com/character-ai/Ovi.git cd Ovi # Create and activate virtual environment virtualenv ovi-env source ovi-env/bin/activate # Install PyTorch first pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 # Install other dependencies pip install -r requirements.txt # Install Flash Attention pip install flash_attn --no-build-isolation

Alternative Flash Attention Installation (Optional)

If the above flash_attn installation fails, you can try the Flash Attention 3 method:

git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention/hopper python setup.py install cd ../.. # Return to Ovi directory

We use open-sourced checkpoints from Wan and MMAudio, and thus we will need to download them from huggingface

# Default is downloaded to ./ckpts, and the inference yaml is set to ./ckpts so no change required python3 download_weights.py OR # Optional can specific --output-dir to download to a specific directory # but if a custom directory is used, the inference yaml has to be updated with the custom directory python3 download_weights.py --output-dir <custom_dir>

Ovi's behavior and output can be customized by modifying ovi/configs/inference/inference_fusion.yaml configuration file. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:

# Output and Model Configuration output_dir: "/path/to/save/your/videos" # Directory to save generated videos ckpt_dir: "/path/to/your/ckpts/dir" # Path to model checkpoints # Generation Quality Settings num_steps: 50 # Number of denoising steps. Lower (30-40) = faster generation solver_name: "unipc" # Sampling algorithm for denoising process shift: 5.0 # Timestep shift factor for sampling scheduler seed: 100 # Random seed for reproducible results # Guidance Strength Control audio_guidance_scale: 3.0 # Strength of audio conditioning. Higher = better audio-text sync video_guidance_scale: 4.0 # Strength of video conditioning. Higher = better video-text adherence slg_layer: 11 # Layer for applying SLG (Skip Layer Guidance) technique - feel free to try different layers! # Multi-GPU and Performance sp_size: 1 # Sequence parallelism size. Set equal to number of GPUs used cpu_offload: False # CPU offload, will largely reduce peak GPU VRAM but increase end to end runtime by ~20 seconds # Input Configuration text_prompt: "/path/to/csv" or "your prompt here" # Text prompt OR path to CSV/TSV file with prompts mode: ['i2v', 't2v', 't2i2v'] # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v video_frame_height_width: [512, 992] # Video dimensions [height, width] for T2V mode only each_example_n_times: 1 # Number of times to generate each prompt # Quality Control (Negative Prompts) video_negative_prompt: "jitter, bad hands, blur, distortion" # Artifacts to avoid in video audio_negative_prompt: "robotic, muffled, echo, distorted" # Artifacts to avoid in audio

Single GPU (Simple Setup)

python3 inference.py --config-file ovi/configs/inference/inference_fusion.yaml

Use this for single GPU setups. The text_prompt can be a single string or path to a CSV file.

Multi-GPU (Parallel Processing)

torchrun --nnodes 1 --nproc_per_node 8 inference.py --config-file ovi/configs/inference/inference_fusion.yaml

Use this to run samples in parallel across multiple GPUs for faster processing.

Memory & Performance Requirements

Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future. All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is 32Gb

Sequence Parallel Size FlashAttention-3 Enabled CPU Offload With Image Gen Model Peak VRAM Required End-to-End Time

1	Yes	No	No	~80 GB	~83s
1	No	No	No	~80 GB	~96s
1	Yes	Yes	No	~80 GB	~105s
1	No	Yes	No	~32 GB	~118s
1	Yes	Yes	Yes	~32 GB	~140s
4	Yes	No	No	~80 GB	~55s
8	Yes	No	No	~80 GB	~40s

We provide a simple script to run our model in a gradio UI. It uses the ckpt_dir in ovi/configs/inference/inference_fusion.yaml to initialize the model

python3 gradio_app.py OR # To enable cpu offload to save GPU VRAM, will slow down end to end inference by ~20 seconds python3 gradio_app.py --cpu_offload OR # To enable an additional image generation model to generate first frames for I2V, cpu_offload is automatically enabled if image generation model is enabled python3 gradio_app.py --use_image_gen

We would like to thank the following projects:

Wan2.2: Our video branch is initialized from the Wan2.2 repository
MMAudio: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.

We welcome all types of collaboration! Whether you have feedback, want to contribute, or have any questions, please feel free to reach out.

Contact: Weimin Wang for any issues or feedback.

If Ovi is helpful, please help to ⭐ the repo.

If you find this project useful for your research, please consider citing our paper.

@misc{low2025ovitwinbackbonecrossmodal, title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, author={Chetwin Low and Weimin Wang and Calder Katyal}, year={2025}, eprint={2510.01284}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2510.01284}, }

Read Entire Article