We design and implement Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, our goal is to foster innovation, creativity, and inclusivity within the field of content creation.
🎬 For a professional AI video-generation product, try Video Ocean — powered by a superior model.
- [2025.03.12] 🔥 We released Open-Sora 2.0 (11B). 🎬 11B model achieves on-par performance with 11B HunyuanVideo & 30B Step-Video on 📐VBench & 📊Human Preference. 🛠️ Fully open-source: checkpoints and training codes for training with only $200K. [report]
- [2025.02.20] 🔥 We released Open-Sora 1.3 (1B). With the upgraded VAE and Transformer architecture, the quality of our generated videos has been greatly improved 🚀. [checkpoints] [report] [demo]
- [2024.12.23] The development cost of video generation models has saved by 50%! Open-source solutions are now available with H200 GPU vouchers. [blog] [code] [vouchers]
- [2024.06.17] We released Open-Sora 1.2, which includes 3D-VAE, rectified flow, and score condition. The video quality is greatly improved. [checkpoints] [report] [arxiv]
- [2024.04.25] 🤗 We released the Gradio demo for Open-Sora on Hugging Face Spaces.
- [2024.04.25] We released Open-Sora 1.1, which supports 2s~15s, 144p to 720p, any aspect ratio text-to-image, text-to-video, image-to-video, video-to-video, infinite time generation. In addition, a full video processing pipeline is released. [checkpoints] [report]
- [2024.03.18] We released Open-Sora 1.0, a fully open-source project for video generation.
Open-Sora 1.0 supports a full pipeline of video data preprocessing, training with
 acceleration,
inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [checkpoints]
[blog] [report] acceleration,
inference, and more. Our model can produce 2s 512x512 videos with only 3 days training. [checkpoints]
[blog] [report]
- [2024.03.04] Open-Sora provides training with 46% cost reduction. [blog]
📍 Since Open-Sora is under active development, we remain different branches for different versions. The latest version is main. Old versions include: v1.0, v1.1, v1.2, v1.3.
Demos are presented in compressed GIF format for convenience. For original quality samples and their corresponding prompts, please visit our Gallery.
OpenSora 1.3 Demo OpenSora 1.2 Demo OpenSora 1.1 Demo OpenSora 1.0 DemoVideos are downsampled to .gif for display. Click for original videos. Prompts are trimmed for display, see here for full prompts.
- Tech Report of Open-Sora 2.0
- Step by step to train or finetune your own model
- Step by step to train and evaluate an video autoencoder
- Visit the high compression video autoencoder
- Reports of previous version (better see in according branch):
- Open-Sora 1.3: shift-window attention, unified spatial-temporal VAE, etc.
- Open-Sora 1.2, Tech Report: rectified flow, 3d-VAE, score condition, evaluation, etc.
- Open-Sora 1.1: multi-resolution/length/aspect-ratio, image/video conditioning/editing, data preprocessing, etc.
- Open-Sora 1.0: architecture, captioning, etc.
 
📍 Since Open-Sora is under active development, we remain different branches for different versions. The latest version is main. Old versions include: v1.0, v1.1, v1.2, v1.3.
Optionally, you can install flash attention 3 for faster speed.
Our 11B model supports 256px and 768px resolution. Both T2V and I2V are supported by one model. 🤗 Huggingface 🤖 ModelScope.
Download from huggingface:
Download from ModelScope:
Our model is optimized for image-to-video generation, but it can also be used for text-to-video generation. To generate high quality videos, with the help of flux text-to-image model, we build a text-to-image-to-video pipeline. For 256x256 resolution:
For 768x768 resolution:
You can adjust the generation aspect ratio by --aspect_ratio and the generation length by --num_frames. Candidate values for aspect_ratio includes 16:9, 9:16, 1:1, 2.39:1. Candidate values for num_frames should be 4k+1 and less than 129.
You can also run direct text-to-video by:
Given a prompt and a reference image, you can generate a video with the following command:
During training, we provide motion score into the text prompt. During inference, you can use the following command to generate videos with motion score (the default score is 4):
We also provide a dynamic motion score evaluator. After setting your OpenAI API key, you can use the following command to evaluate the motion score of a video:
We take advantage of ChatGPT to refine the prompt. You can use the following command to refine the prompt. The function is available for both text-to-video and image-to-video generation.
To make the results reproducible, you can set the random seed by:
Use --num-sample k to generate k samples for each prompt.
We test the computational efficiency of text-to-video on H100/H800 GPU. For 256x256, we use colossalai's tensor parallelism, and --offload True is used. For 768x768, we use colossalai's sequence parallelism. All use number of steps 50. The results are presented in the format: $\color{blue}{\text{Total time (s)}}/\color{red}{\text{peak GPU memory (GB)}}$
| 256x256 | $\color{blue}{60}/\color{red}{52.5}$ | $\color{blue}{40}/\color{red}{44.3}$ | $\color{blue}{34}/\color{red}{44.3}$ | |
| 768x768 | $\color{blue}{1656}/\color{red}{60.3}$ | $\color{blue}{863}/\color{red}{48.3}$ | $\color{blue}{466}/\color{red}{44.3}$ | $\color{blue}{276}/\color{red}{44.3}$ | 
On VBench, Open-Sora 2.0 significantly narrows the gap with OpenAI’s Sora, reducing it from 4.52% → 0.69% compared to Open-Sora 1.2.
Human preference results show our model is on par with HunyuanVideo 11B and Step-Video 30B.
With strong performance, Open-Sora 2.0 is cost-effective.
Thanks goes to these wonderful contributors:
If you wish to contribute to this project, please refer to the Contribution Guideline.
Here we only list a few of the projects. For other works and datasets, please refer to our report.
- ColossalAI: A powerful large model parallel acceleration and optimization system.
- DiT: Scalable Diffusion Models with Transformers.
- OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
- PixArt: An open-source DiT-based text-to-image model.
- Flux: A powerful text-to-image generation model.
- Latte: An attempt to efficiently train DiT for video.
- HunyuanVideo: Open-Source text-to-video model.
- StabilityAI VAE: A powerful image VAE model.
- DC-AE: Deep Compression AutoEncoder for image compression.
- CLIP: A powerful text-image embedding model.
- T5: A powerful text encoder.
- LLaVA: A powerful image captioning model based on Mistral-7B and Yi-34B.
- PLLaVA: A powerful video captioning model.
- MiraData: A large-scale video dataset with long durations and structured caption.
.png)
 4 months ago
                                12
                        4 months ago
                                12
                     
  
 
    
   




