TL;DR: Feed-forward 3D and 4D scene generation from a single image/video trained with synthetic data generated by a camera-controlled video diffusion model.
Full Abstract: The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
Sherwin Bahmani,
Tianchang Shen,
Jiawei Ren,
Jiahui Huang,
Yifeng Jiang,
Haithem Turki,
Andrea Tagliasacchi,
David B. Lindell,
Zan Gojcic,
Sanja Fidler,
Huan Ling,
Jun Gao,
Xuanchi Ren
Please follow the INSTALL.md to set up your conda environment and download pre-trained weights.
Lyra supports both images and videos as input. Below are examples of running Lyra on single images and videos.
First, you need to download the demo samples:
- Generate multi-view video latents from the input image using scripts/bash/static_sdg.sh.
If you want to skip the diffusion part, we have pre-generated the latents in assets/demo/static/diffusion_output. By default we use pre-generated latents, change dataset_name in configs/demo/lyra_static.yaml from lyra_static_demo to lyra_static_demo_generated to use your own generated latents.
- Reconstruct multi-view video latents with the 3DGS decoder (change dataset_name in the .yaml to generated path if 1. was done)
- Generate multi-view video latents from the input video and ViPE estimated depth using scripts/bash/dynamic_sdg.sh.
If you want to skip the diffusion part, we have pre-generated the latents in assets/demo/dynamic/diffusion_output. By default we use pre-generated latents, change dataset_name in configs/demo/lyra_dynamic.yaml from lyra_dynamic_demo to lyra_dynamic_demo_generated to use your own generated latents. Add --flip_supervision if you want to also generate the motion reversed training data (not needed for inference).
- Reconstruct multi-view video latents with the 3DGS decoder (change dataset_name in the .yaml to generated path if 1. was done)
Follow the installation instructions for ViPE. Note: ViPE's environment is not compatible with Lyra. We recommend installing ViPE in a separate conda environment. The ViPE results are required for dynamic scene generation. Moreover, we use the depth from ViPE for depth supervision during 3DGS decoder training.
- Run ViPE to extract depth, intrinsics, and camera poses (make sure to use the --lyra flag to use the same depth estimator as us):
- Define the new data path in src/models/data/registry.py as dataset following the structure of our provided datasets
We have tested Lyra only on H100 and A100 GPUs. For GPUs with limited memory, you can fully offload all models by appending the following flags to your SDG command:
Maximum observed memory during inference with full offloading: ~43GB. Note: Memory usage may vary depending on system specifications and is provided for reference only.
We provide training scripts to train from scratch or fine-tune our models. First, you need to download our training data:
Alternatively, use the demo script to generate training data. Here, the diffusion part is sufficient without running the 3DGS decoder, since we want to train that. Make sure to update the paths in src/models/data/registry.py for lyra_static / lyra_dynamic to wherever your data is stored. We provide our progressive training script:
We provide visualization scripts during training to export renderings and 3D Gaussians for each stage:
Our model is based on NVIDIA Cosmos and GEN3C. We use input images generated by Flux.
We are also grateful to several other open-source repositories that we drew inspiration from or built upon during the development of our pipeline:
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
Lyra source code is released under the Apache 2 License.
Lyra models are released under the NVIDIA Open Model License. For a custom license, please visit our website and submit the form: NVIDIA Research Licensing.