MMaDA is a new family of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:
- MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
- MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
- MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.
The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.
- [2025-05-22] We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
- [2025-05-22] We open source our MMaDA-8B-Base at Huggingface. MMaDA-8B-MixCoT and MMaDA-8B-Max will be released in the near future.
- [2025-05-22] We release our research paper and demo for the first unified multimodal diffusion model: MMaDA.
MMaDA includes a series of checkpoints reflecting different training stages:
- MMaDA-8B-Base: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and thinking ablities.
- MMaDA-8B-MixCoT (coming soon): After mixed long chain-of-thought (CoT) fine-tuning. Capable of complex textual, multimodal and image generation reasoning. Will be released in 2 weeks.
- MMaDA-8B-Max (coming soon): After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in 1 month.
- Release MMaDA-8B-MixCoT and MMaDA-8B-Max
- Release OpenRLHF-based UniGRPO training code.
First, set up the enviroment:
Launch local Gradio demo:
Or try it online via our Huggingface Demo.
For batch-level inference, we provide our inference scripts here.
For text generation, we follow LLaDA's configuration and generation script. Simple run:
For multimodal generation and text-to-image generation, first login your wandb account:
Inference demo for MultiModal Generation and you can view the results on wandb:
For multimodal generation and text-to-image generation, first login your wandb account:
Inference demo for Text-to-Image Genertion and you can view the results on wandb:
Update your training data path in configs/xx.yaml.
Please first prepare your accelerate configs. You can simple run
Or use our provided configs in accelerate_configs:
First we use LLaDA-8B-Instruct to initialize our model, and train on ImageNet for basic visual capbalities.
Then we replace the ImageNet dataset in Stage 1.1 with Image-Text Dataset. Please change the pretrained model path in mmada_pretraining_stage2_llada_instruct.yaml with your checkpoint in Stage 1.1
In this stage, we begin training on text instruction following and include corresponding validations. Please change the pretrained model path in mmada_pretraining_stage3_llada_instruct.yaml with your checkpoint in Stage 1.2
In this stage, we begin our Mix-CoT finetuning with text reasoning first, along with improved image quality. Please change the pretrained model path in mmada_pretraining_stage3_llada_instruct.yaml with your checkpoint in Stage 1.3 and prepare your CoT data.
In this stage, we include multimodal reasoning, along with improved image quality. Please change the pretrained model path in mmada_pretraining_stage3_llada_instruct.yaml with your checkpoint in Stage 2.1 and prepare your CoT data.
[Will be released once we finished our code transition to OpenRLHF]
This work is heavily based on Show-o, LLaDA, maskgit, transformers, accelerate and webdataset. Thanks to all the authors for their great work.
Welcome to discuss and collaborate with us for continuously improving MMaDA. Reach us with this WeChat QR code!