MMaDA – Open-Sourced Multimodal Large Diffusion Language Models

3 days ago 3

Multimodal Large Diffusion Language Models

MMaDA Paper on arXiv MMaDA on Hugging Face MMaDA on Hugging Face Wechat Group Link

MMaDA is a new family of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations:

  1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components.
  2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities.
  3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.

MMaDA decoding demo

MMaDA's decoding demo. This video showcases how a diffusion foundation model generates text and image.
The "Text Generation" part uses a semi-autoregressive sampling method, while the "Multimodal Generation" part adopts non-autoregressive diffusion denoising.

  • [2025-05-22] We release the inference and training code of MMaDA for text generation, multimodal generation and image generation.
  • [2025-05-22] We open source our MMaDA-8B-Base at Huggingface. MMaDA-8B-MixCoT and MMaDA-8B-Max will be released in the near future.
  • [2025-05-22] We release our research paper and demo for the first unified multimodal diffusion model: MMaDA.

MMaDA includes a series of checkpoints reflecting different training stages:

  1. MMaDA-8B-Base: After pretraining and instruction tuning. Capable of basic text generation, image generation, image captioning and thinking ablities.
  2. MMaDA-8B-MixCoT (coming soon): After mixed long chain-of-thought (CoT) fine-tuning. Capable of complex textual, multimodal and image generation reasoning. Will be released in 2 weeks.
  3. MMaDA-8B-Max (coming soon): After UniGRPO reinforment learning. Excels at complex reasoning and awesome visual generation. Will be released in 1 month.

Overview of MMaDA's capablities.

  • Release MMaDA-8B-MixCoT and MMaDA-8B-Max
  • Release OpenRLHF-based UniGRPO training code.

First, set up the enviroment:

pip install -r requirements.txt

Launch local Gradio demo:

Or try it online via our Huggingface Demo.

For batch-level inference, we provide our inference scripts here.

For text generation, we follow LLaDA's configuration and generation script. Simple run:

For multimodal generation and text-to-image generation, first login your wandb account:

Inference demo for MultiModal Generation and you can view the results on wandb:

python3 inference_mmu.py config=configs/mmada_demo.yaml mmu_image_root=./mmu_validation question='Please describe this image in detail.'

3. Text-to-Image Genertion

For multimodal generation and text-to-image generation, first login your wandb account:

Inference demo for Text-to-Image Genertion and you can view the results on wandb:

python3 inference_t2i.py config=configs/mmada_demo.yaml batch_size=1 validation_prompts_file=validation_prompts/text2image_prompts.txt guidance_scale=3.5 generation_timesteps=15 mode='t2i'

Update your training data path in configs/xx.yaml.

Stage 0. Prepare your accelerate configs

Please first prepare your accelerate configs. You can simple run

Or use our provided configs in accelerate_configs:

├── accelerate_configs/ | ├── 1_gpu.yaml | └── 8_node_8_gpus_deepspeed_zero2.yaml (for 8 * 8 gpus)

Stage 1.1: Pre-training on ImageNet

First we use LLaDA-8B-Instruct to initialize our model, and train on ImageNet for basic visual capbalities.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada.py config=configs/mmada_pretraining_stage1_llada_instruct.yaml

Stage 1.2 Pre-training on Image-Text Dataset

Then we replace the ImageNet dataset in Stage 1.1 with Image-Text Dataset. Please change the pretrained model path in mmada_pretraining_stage2_llada_instruct.yaml with your checkpoint in Stage 1.1

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage2.py config=configs/mmada_pretraining_stage2_llada_instruct.yaml

Stage 1.3 Pre-training on Text Instruction following

In this stage, we begin training on text instruction following and include corresponding validations. Please change the pretrained model path in mmada_pretraining_stage3_llada_instruct.yaml with your checkpoint in Stage 1.2

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage3.py config=configs/mmada_pretraining_stage3_llada_instruct.yaml

Stage 2.1 Mix-CoT Training (Text Only)

In this stage, we begin our Mix-CoT finetuning with text reasoning first, along with improved image quality. Please change the pretrained model path in mmada_pretraining_stage3_llada_instruct.yaml with your checkpoint in Stage 1.3 and prepare your CoT data.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage_cot_sft.py config=configs/mmada_pretraining_stage3_llada_instruct_512_cot.yaml

Stage 2.2 Mix-CoT Training (with MultiModal Reasoning)

In this stage, we include multimodal reasoning, along with improved image quality. Please change the pretrained model path in mmada_pretraining_stage3_llada_instruct.yaml with your checkpoint in Stage 2.1 and prepare your CoT data.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_mmada_stage4.py config=configs/mmada_pretraining_stage4_llada_instruct.yaml

[Will be released once we finished our code transition to OpenRLHF]

@article{yang2025mmada, title = {Multimodal Large Diffusion Language Models}, author = {Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi}, journal = {arXiv preprint arXiv:2505.15809}, year = {2025} }

This work is heavily based on Show-o, LLaDA, maskgit, transformers, accelerate and webdataset. Thanks to all the authors for their great work.

💬 Discussion and Collaboration

Welcome to discuss and collaborate with us for continuously improving MMaDA. Reach us with this WeChat QR code!

Star History Chart

Read Entire Article