Lyra by Nvidia: 3D scene generation from a single image or video

2 hours ago 2

teaser

TL;DR: Feed-forward 3D and 4D scene generation from a single image/video trained with synthetic data generated by a camera-controlled video diffusion model.

Full Abstract: The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

Paper, Project Page, Dataset

Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, Xuanchi Ren

Please follow the INSTALL.md to set up your conda environment and download pre-trained weights.

Lyra supports both images and videos as input. Below are examples of running Lyra on single images and videos.

First, you need to download the demo samples:

# Download test samples from Hugging Face huggingface-cli download nvidia/Lyra-Testing-Example --repo-type dataset --local-dir assets/demo

Example 1: Single Image to 3D Gaussians Generation

  1. Generate multi-view video latents from the input image using scripts/bash/static_sdg.sh.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=1 cosmos_predict1/diffusion/inference/gen3c_single_image_sdg.py \ --checkpoint_dir checkpoints \ --num_gpus 1 \ --input_image_path assets/demo/static/diffusion_input/images/00172.png \ --video_save_folder assets/demo/static/diffusion_output_generated \ --foreground_masking \ --multi_trajectory

If you want to skip the diffusion part, we have pre-generated the latents in assets/demo/static/diffusion_output. By default we use pre-generated latents, change dataset_name in configs/demo/lyra_static.yaml from lyra_static_demo to lyra_static_demo_generated to use your own generated latents.

  1. Reconstruct multi-view video latents with the 3DGS decoder (change dataset_name in the .yaml to generated path if 1. was done)
accelerate launch sample.py --config configs/demo/lyra_static.yaml

Example 2: Single Video to Dynamic 3D Gaussians Generation

  1. Generate multi-view video latents from the input video and ViPE estimated depth using scripts/bash/dynamic_sdg.sh.
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) torchrun --nproc_per_node=1 cosmos_predict1/diffusion/inference/gen3c_dynamic_sdg.py \ --checkpoint_dir checkpoints \ --vipe_path assets/demo/dynamic/diffusion_input/rgb/6a71ee0422ff4222884f1b2a3cba6820.mp4 \ --video_save_folder assets/demo/dynamic/diffusion_output \ --disable_prompt_upsampler \ --num_gpus 1 \ --foreground_masking \ --multi_trajectory

If you want to skip the diffusion part, we have pre-generated the latents in assets/demo/dynamic/diffusion_output. By default we use pre-generated latents, change dataset_name in configs/demo/lyra_dynamic.yaml from lyra_dynamic_demo to lyra_dynamic_demo_generated to use your own generated latents. Add --flip_supervision if you want to also generate the motion reversed training data (not needed for inference).

  1. Reconstruct multi-view video latents with the 3DGS decoder (change dataset_name in the .yaml to generated path if 1. was done)
accelerate launch sample.py --config configs/demo/lyra_dynamic.yaml

Testing on your own videos using ViPE

Follow the installation instructions for ViPE. Note: ViPE's environment is not compatible with Lyra. We recommend installing ViPE in a separate conda environment. The ViPE results are required for dynamic scene generation. Moreover, we use the depth from ViPE for depth supervision during 3DGS decoder training.

  1. Run ViPE to extract depth, intrinsics, and camera poses (make sure to use the --lyra flag to use the same depth estimator as us):
vipe infer YOUR_VIDEO.mp4 -p lyra --output <vipe_results_dir>
  1. Define the new data path in src/models/data/registry.py as dataset following the structure of our provided datasets

We have tested Lyra only on H100 and A100 GPUs. For GPUs with limited memory, you can fully offload all models by appending the following flags to your SDG command:

--offload_diffusion_transformer \ --offload_tokenizer \ --offload_text_encoder_model \ --offload_prompt_upsampler \ --offload_guardrail_models \ --disable_guardrail \ --disable_prompt_encoder

Maximum observed memory during inference with full offloading: ~43GB. Note: Memory usage may vary depending on system specifications and is provided for reference only.

We provide training scripts to train from scratch or fine-tune our models. First, you need to download our training data:

# Download our training datasets from Hugging Face and untar them into a static/dynamic folder huggingface-cli download nvidia/PhysicalAI-SpatialIntelligence-Lyra-SDG --repo-type dataset --local-dir lyra_dataset/tar

Alternatively, use the demo script to generate training data. Here, the diffusion part is sufficient without running the 3DGS decoder, since we want to train that. Make sure to update the paths in src/models/data/registry.py for lyra_static / lyra_dynamic to wherever your data is stored. We provide our progressive training script:

We provide visualization scripts during training to export renderings and 3D Gaussians for each stage:

Our model is based on NVIDIA Cosmos and GEN3C. We use input images generated by Flux.

We are also grateful to several other open-source repositories that we drew inspiration from or built upon during the development of our pipeline:

@inproceedings{bahmani2025lyra, title={Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation}, author={Bahmani, Sherwin and Shen, Tianchang and Ren, Jiawei and Huang, Jiahui and Jiang, Yifeng and Turki, Haithem and Tagliasacchi, Andrea and Lindell, David B. and Gojcic, Zan and Fidler, Sanja and Ling, Huan and Gao, Jun and Ren, Xuanchi}, booktitle={arXiv preprint arXiv:2509.19296}, year={2025} }

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Lyra source code is released under the Apache 2 License.

Lyra models are released under the NVIDIA Open Model License. For a custom license, please visit our website and submit the form: NVIDIA Research Licensing.

Read Entire Article