Fengqi Zhu1, *, §, Rongzhen Wang1, *, Shen Nie1, Xiaolu Zhang3, Chunwei Wu3, Jun Hu3, Jun Zhou3, Jianfei Chen2, Yankai Lin1, †, Ji-Rong Wen1, Chongxuan Li1, †, ‡
1Renmin University of China, 2Tsinghua University, 3Ant Group
* Equal contribution, § Work done during an internship at Ant Group, † Project leader, ‡ Corresponding author
TL;DR: We propose VRPO to reduce gradient variance and improve preference alignment in masked diffusion language models.
Motivation: The Problem with RL-based alignment in Diffusion Language Models
Masked Diffusion Models (MDMs) cannot directly compute exact log-likelihoods, take DPO as an example, we must approximate log-likelihoods using Evidence Lower Bounds:
\[\mathcal{L}_{\mathrm{DPO-E}}(\theta) = -\mathbb{E}_{(y_w, y_l)} \left[\log \sigma\left(\beta \left(\mathcal{B}_{\pi_\theta}(y_w) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_w)\right) - \beta\left(\mathcal{B}_{\pi_\theta}(y_l) - \mathcal{B}_{\pi_{\mathrm{ref}}}(y_l)\right)\right)\right]\]
Key Challenge: ELBO estimation introduces additional variance through Monte Carlo sampling, which propagates through the nonlinear log-sigmoid function, creating both bias and variance in the loss.
VRPO: Three Simple Techniques for Variance Reduction
Core Insight: We prove that both bias and variance can be bounded by the variance of the preference score estimator. Therefore, reducing this variance improves overall optimization.
1️⃣ Increased Budget
Use more samples \(n = n_{\mathrm{time}} \times n_{\mathrm{mask}}\) to estimate each ELBO
2️⃣ Optimal Allocation
Set \(n_{\mathrm{time}} = n\) and \(n_{\mathrm{mask}} = 1\) (one mask per timestep)
3️⃣ Antithetic Sampling
Share timesteps and masks between \(\pi_θ\) and \(\pi_{\mathrm{ref}}\)
Impact: VRPO improves LLaDA's performance across extensive benchmarks. Techniques 2 & 3 improve results without any additional cost.
Bibtex
Please consider cite: