Structured 3D Latents
for Scalable and Versatile 3D Generation
CVPR 2025 Highlight
(Note: This is the only official webpage for the TRELLIS project)
TL;DR: A native 3D generative model built on a unified Structured Latent representation and Rectified Flow Transformers, enabling versatile and high-quality 3D asset creation.
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation.
The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes.
This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model,
comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.
We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects.
Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales.
We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models.
Code, model, and data will be released.
* Generated by TRELLIS, using its image to 3D assets cabilities.
The appearance and geometry shown in this page are rendered from 3D Gaussians and meshes, respectively. GLB files are extracted by baking appearance from 3D Gaussians to meshes.
Generation | Text to 3D Asset
All text prompts are generated by GPT-4. Click on the cards to view extracted GLB files.
Generation | Image to 3D Asset
Image prompts are generated by DALL-E 3. Click on the cards to view extracted GLB files.
TRELLIS can generates variants of a given 3D asset coherent with given text prompts.
Editing | Local Manipulation
TRELLIS can manipulate targeted local regions of a given 3D asset according to given text or image prompts.
Application | 3D Art Designs
Compositing the high-quality 3D assets generated by TRELLIS, complex and vibrant 3D art designs can be created with ease.
We introduce Structured LATents (SLAT), a unified 3D latent representation for high-quality, versatile 3D generation. SLAT marries sparse structures with powerful visual representations. It defines local latents on active voxels intersecting the object's surface. The local latents are encoded by fusing and processing image features from densely rendered views of the 3D asset, while attaches them onto active voxels. These features, derived from powerful pretrained vision encoders, capture detailed geometric and visual characteristics, complementing the coarse structure provided by the active voxels. Different decoders can then be applied to map SLAT to diverse 3D representations of high quality.
Building on SLAT, we train a family of large 3D generation models, dubbed TRELLIS, with text prompts or images as conditions. A two stage pipeline is applied which first generates the sparse structure of SLAT, followed by generating the latent vectors for non-empty cells. We employ rectified flow transformers as our backbone models and adapt them properly to handle the sparsity in SLAT. We train Trellis with up to 2 billion parameters on a large dataset of carefully-collected 3D assets. TRELLIS can create high-quality 3D assets with detailed geometry and vivid texture, significantly surpassing previous methods. Moreover, it can easily generate 3D assets with different output formats to meet diverse downstream requirements.
If you find our work useful, please consider citing:
@article{xiang2024structured, title = {Structured 3D Latents for Scalable and Versatile 3D Generation}, author = {Xiang, Jianfeng and Lv, Zelong and Xu, Sicheng and Deng, Yu and Wang, Ruicheng and Zhang, Bowen and Chen, Dong and Tong, Xin and Yang, Jiaolong}, journal = {arXiv preprint arXiv:2412.01506}, year = {2024} }
Responsible AI Considerations
TRELLIS is purely a research project. Responsible AI considerations were factored into all stages. The datasets used in this paper are public and have been reviewed to ensure there is no personally identifiable information or harmful content. However, as these datasets are sourced from the Internet, potential bias may still be present. Currently the model excels at generating artistic-style 3D assets and its capability in generating photorealistic real-world objects is limited.
Material Disclaimer
The materials made available on this page are provided solely for academic and research purposes in connection with the exploration of text-to-3D and image-to-3D generation technologies, as described in the publication accessible at https://arxiv.org/abs/2412.01506. These materials are not intended for commercial exploitation or use.
If you believe that any content on this page infringes upon your intellectual property rights, including but not limited to copyright, please notify us by submitting a takedown request via email to jiaoyan (at) microsoft.com.