LLaDA 1.5: Variance-Reduced Preference Optimization for Diffusion LLMs

3 hours ago 2

Fengqi Zhu^{1, *, §}, Rongzhen Wang^{1, *}, Shen Nie¹, Xiaolu Zhang³, Chunwei Wu³, Jun Hu³, Jun Zhou³, Jianfei Chen², Yankai Lin^{1, †}, Ji-Rong Wen¹, Chongxuan Li^{1, †, ‡}

¹Renmin University of China, ²Tsinghua University, ³Ant Group

^* Equal contribution, ^§ Work done during an internship at Ant Group, ^† Project leader, ^‡ Corresponding author

TL;DR: We propose VRPO to reduce gradient variance and improve preference alignment in masked diffusion language models.

llada_dpo

Motivation: The Problem with RL-based alignment in Diffusion Language Models

Masked Diffusion Models (MDMs) cannot directly compute exact log-likelihoods, take DPO as an example, we must approximate log-likelihoods using Evidence Lower Bounds:

\[\mathcal{L}_{\mathrm{DPO-E}}(\theta) = -\mathbb{E}_{(y_w, y_l)} \left[\log \sigma\left(\beta \left(\mathcal{B}_{\pi_\theta}(y_w) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_w)\right) - \beta\left(\mathcal{B}_{\pi_\theta}(y_l) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_l)\right)\right)\right]\]

Key Challenge: ELBO estimation introduces additional variance through Monte Carlo sampling, which propagates through the nonlinear log-sigmoid function, creating both bias and variance in the loss.

VRPO: Three Simple Techniques for Variance Reduction

Core Insight: We prove that both bias and variance can be bounded by the variance of the preference score estimator. Therefore, reducing this variance improves overall optimization.

1️⃣ Increased Budget

Use more samples \(n = n_{\mathrm{time}} \times n_{\mathrm{mask}}\) to estimate each ELBO

2️⃣ Optimal Allocation

Set \(n_{\mathrm{time}} = n\) and \(n_{\mathrm{mask}} = 1\) (one mask per timestep)

3️⃣ Antithetic Sampling

Share timesteps and masks between \(\pi_θ\) and \(\pi_{\mathrm{ref}}\)

method

Impact: VRPO improves LLaDA's performance across extensive benchmarks. Techniques 2 & 3 improve results without any additional cost.

Bibtex

Please consider cite:

@article{zhu2025llada, title={LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models}, author={Zhu, Fengqi and Wang, Rongzhen and Nie, Shen and Zhang, Xiaolu and Wu, Chunwei and Hu, Jun and Zhou, Jun and Chen, Jianfei and Lin, Yankai and Wen, Ji-Rong and others}, journal={arXiv preprint arXiv:2505.19223}, year={2025} }

Read Entire Article