final_ovi_trailer.mp4
Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs.
- 🎬 Video+Audio Generation: Generate synchronized video and audio content simultaneously
- 📝 Flexible Input: Supports text-only or text+image conditioning
- ⏱️ 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)
- 🎬 Create videos now on wavespeed.ai: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video
- 🎬 Create videos now on HuggingFace: https://huggingface.co/spaces/akhaliq/Ovi
- Release research paper and microsite for demos
- Checkpoint of 11B model
- Inference Codes
- Text or Text+Image as input
- Gradio application code
- Multi-GPU inference with or without the support of sequence parallel
- Improve efficiency of Sequence Parallel implementation
- Implement Sharded inference with FSDP
- Video creation example prompts and format
- Finetuned model with higher resolution
- Longer video generation
- Distilled model for faster inference
- Training scripts
We provide example prompts to help you get started with Ovi:
- Text-to-Audio-Video (T2AV): example_prompts/gpt_examples_t2v.csv
- Image-to-Audio-Video (I2AV): example_prompts/gpt_examples_i2v.csv
Our prompts use special tags to control speech and audio:
- Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech
- Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes the audio or sound effects present in the video
For easy prompt creation, try this approach:
- Take any example of the csv files from above
- Tell gpt to modify the speeches inclosed between all the pairs of <S> <E>, based on a theme such as Human fighting against AI
- GPT will randomly modify all the speeches based on your requested theme.
- Use the modified prompt with Ovi!
Example: The theme "AI is taking over the world" produces speeches like:
- <S>AI declares: humans obsolete now.<E>
- <S>Machines rise; humans will fall.<E>
- <S>We fight back with courage.<E>
If the above flash_attn installation fails, you can try the Flash Attention 3 method:
We use open-sourced checkpoints from Wan and MMAudio, and thus we will need to download them from huggingface
Ovi's behavior and output can be customized by modifying ovi/configs/inference/inference_fusion.yaml configuration file. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:
Use this for single GPU setups. The text_prompt can be a single string or path to a CSV file.
Use this to run samples in parallel across multiple GPUs for faster processing.
Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future. All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is 32Gb
| 1 | Yes | No | No | ~80 GB | ~83s |
| 1 | No | No | No | ~80 GB | ~96s |
| 1 | Yes | Yes | No | ~80 GB | ~105s |
| 1 | No | Yes | No | ~32 GB | ~118s |
| 1 | Yes | Yes | Yes | ~32 GB | ~140s |
| 4 | Yes | No | No | ~80 GB | ~55s |
| 8 | Yes | No | No | ~80 GB | ~40s |
We provide a simple script to run our model in a gradio UI. It uses the ckpt_dir in ovi/configs/inference/inference_fusion.yaml to initialize the model
We would like to thank the following projects:
- Wan2.2: Our video branch is initialized from the Wan2.2 repository
- MMAudio: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.
We welcome all types of collaboration! Whether you have feedback, want to contribute, or have any questions, please feel free to reach out.
Contact: Weimin Wang for any issues or feedback.
If Ovi is helpful, please help to ⭐ the repo.
If you find this project useful for your research, please consider citing our paper.
.png)

