Skip to content
/ Uncode Public

Source code for paper "Empirical Analysis of Decoding Biases in Masked Diffusion Models"

License

Notifications You must be signed in to change notification settings

NEUIR/Uncode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Empirical Analysis of Decoding Biases in Masked Diffusion Models

📖 Introduction 🎉 News✨ Pipeline⚡️ Evaluation

📈 Decoding Trajectory💻 Algorithm📧 Contact

📖 Introduction

Uɴᴄᴏᴅᴇ is a novel decoding strategy for Masked Diffusion Models (MDMs) that unifies global trajectory planning with content-aware informativeness maximization. It addresses the key limitations of traditional uncertainty-based samplers when applied to MDMs: a rigid boundary bias and a bias toward "trivial tokens." By using a position-aware weighting mechanism and a calibrated confidence score, Uɴᴄᴏᴅᴇ guides the decoding path and prevents the premature selection of unimportant tokens, significantly improving generation quality.

🎉 News

  • 20250912: This release provides enhanced support for decoding with LLaDA, integrating a variety of recent semi- and non-autoregressive sampling strategies, including: ReMDM, Fast-dLLM, Semi-AR, Margin-based sampler, Entropy-based sampler and Confidence-based sampler.
  • 20250819: Released our Paper on arXiv. Released our Code on GitHub.

✨ Pipeline

Uɴᴄᴏᴅᴇ

Uɴᴄᴏᴅᴇr is a novel decoding strategy designed for advanced Masked Diffusion Models (MDMs) like LLaDA and Dream. These models are powerful non-autoregressive alternatives for sequence generation, enabling flexible decoding through the iterative denoising of masked tokens.

⚙️ Setup

git clone 
conda create --name Uɴᴄᴏᴅᴇ python==3.10
conda activate Uɴᴄᴏᴅᴇ
cd Uɴᴄᴏᴅᴇ
pip install -r requirements.txt

📃 Evaluation

Our method, along with all baseline methods, can be applied for prediction across mathematical reasoning, code generation, and question-answering datasets.

Eval Case

This is an example of evaluation on the HumanEval dataset using Uɴᴄᴏᴅᴇ. And you can run the change the --task and --mode to evaluate on other datasets and decoding methods.

cd scripts
python eval.py \
    --task 'humaneval' \
    --model_name 'GSAI-ML/LLaDA-8B-Instruct' \
    --device 'cuda:5' \
    --gen_length 256 \
    --steps 256 \
    --block_length 256 \
    --mode pc_sampler \
    --lambd 0.25 \
    --alpha 10 \
    --data_path ../data/humaneval.jsonl \
    --result_path results/humaneval_pc_sampler

Following are the evaluation bash scripts for all decoding methods.

Decoding Method Evaluation Command Decoding Method Evaluation Command
Semi-Autoregressive
cd scripts
bash eval_semi_ar.sh
Entropy
cd scripts
bash eval_entropy.sh
EB-Sampler
cd scripts
bash eval_eb_sampler.sh
Fast-dLLM
cd scripts
bash eval_fast_dllm.sh
Margin
cd scripts
bash eval_margin.sh
PC-sampler
cd scripts
bash eval_pc_sampler.sh
ReMDM
cd scripts
bash eval_remdm.sh
Linear_Position
cd scripts
bash eval_linear_position.sh

Evaluation of Decoding Methods

All decoding methods are evaluated on the same set of datasets: HumanEval, MBPP, GSM8K, MATH-500, GPQA, Countdown, and Sudoku. Evaluation results are saved in the results folder.

Evaluation Tools

  • For the GSM8K and GPQA datasets, we use lm-eval for evaluation.
  • For the remaining datasets, please refer to scripts/eval.py for more details.

Consistency Note

All methods are evaluated using the same set of evaluation scripts (including both lm-eval and our custom script) to ensure consistent assessment.

Painting Heatmap

We provide a script to generate heatmaps for the decoding trajectories of different decoding methods. The script is located in scripts/heatmap.sh.

cd scripts
bash heatmap.sh

Results

The heatmap results are saved in the heatmap_results folder.

📈 Decoding Trajectory

The choice of decoding strategy significantly impacts the generation order of Masked Diffusion Models (MDMs). A critical limitation of existing uncertainty-based methods is their tendency to exhibit a "U-shaped" trajectory (namely rigid boundary bias), where tokens at sequence boundaries are prioritized early in decoding, followed by convergence toward the center. This bias stems from the premature unmasking of boundary tokens (BOS and EOS), where the attention mechanism's local positional bias leads to elevated confidence for tokens near the sequence boundaries.

In contrast, our Uɴᴄᴏᴅᴇ introduces explicit trajectory control through position-aware weighting, enabling adaptive generation order tailored to task requirements. Below, we visualize the decoding trajectories on the GSM8K dataset for four representative sampling strategies:

🔍 Trajectory Visualizations on GSM8K

Sampling Strategy Decoding Trajectory Heatmap Sampling Strategy Decoding Trajectory Heatmap
Confidence-based Entropy-based
Margin-based Uɴᴄᴏᴅᴇ

🔑 Key Observations

  • Rigid Boundary Bias in Uncertainty-based Methods: Confidence, entropy, and margin-based samplers consistently exhibit the characteristic U-shaped pattern, with early decoding of tokens at both sequence boundaries. This behavior limits their ability to capture global dependencies required for complex reasoning tasks like mathematical problem-solving.

  • Trivial Token Bias: Uncertainty-based samplers tend to prioritize semantically trivial, high-frequency tokens (e.g., newline characters, spaces, common words like "the", punctuation marks such as ".", and exclamation marks "!") during the decoding process, leading to suboptimal reasoning paths.

  • Debias with Uɴᴄᴏᴅᴇ: Our method eliminates the U-shaped bias by regulating the decoding path through exponential positional weighting. This enables a more natural progression that aligns with the logical flow of reasoning tasks, as demonstrated by the sequential trajectory in the GSM8K dataset.

The adaptive trajectory control of Uɴᴄᴏᴅᴇ directly contributes to its superior performance on GSM8K (82.2% accuracy) compared to uncertainty-based alternatives, highlighting the importance of aligning decoding order with task-specific structural demands.

💻 Algorithm

Method Overview

Uɴᴄᴏᴅᴇ is a novel decoding strategy for Masked Diffusion Models (MDMs) that addresses key limitations of existing uncertainty-based sampling methods. It unifies global trajectory planning with content-aware informativeness maximization through two core components:

  1. Position-Aware Weighting Mechanism: Regulates the decoding path using an exponential decay function to enable flexible control over the generation order, adapting to task-specific structural demands.

  2. Calibrated Confidence Score: Suppresses premature selection of trivial tokens (e.g., punctuation, filler words) by incorporating frequency-based adjustment from a reference corpus, promoting semantically rich content generation.

Extensive experiments across seven benchmarks demonstrate that Uɴᴄᴏᴅᴇ consistently outperforms existing MDM decoding strategies by more than 10% on average, narrowing the performance gap with state-of-the-art autoregressive models .

Algorithm Workflow

The complete workflow of Uɴᴄᴏᴅᴇ is summarized in the following algorithm:

Require: Predictor $p_\theta$, prompt $p_0$, answer length $L$, steps $T$, Hyperparams $\lambda, \alpha$; reference corpus $\mathcal{D}'$

  1. $p_{\mathcal{D}'} \gets \text{FreqDist}(\mathcal{D}')$
  2. $x \gets \text{Concat}(p_0, \text{[MASK]} \times L)$
  3. for $t = 1$ to $T$ do
    • $\mathcal{M}_t \gets {i \mid x^i = \text{[MASK]}}$ // Get mask indices
    • if $\mathcal{M}_t = \emptyset$ then
      • break
    • $\hat{x}0, \hat{p}^i \gets p{\theta}(\cdot \mid x)$
    • for each position $i \in \mathcal{M}_t$ do
      • $\mathcal{C}^{(i)} \gets \hat{p}^i \cdot \log p_{\mathcal{D}'}(x^i)$
      • $\mathcal{C}^{(i)} \gets \min(\mathcal{C}^{(i)}, \alpha)$ // Clip salience score
      • $w^{(i)} \gets e^{-\lambda \cdot (i - |p_0|)}$
      • $\text{score}^{(i)} \gets w^{(i)} \cdot \mathcal{C}^{(i)}$
    • $n_k \gets \text{NumToReveal}(k, N, |\mathcal{M}_k|)$
    • $\mathcal{S}_t \gets \text{TopK}(\text{score}, n_k)$ // Select best tokens
    • for each index $j \in \mathcal{S}_t$ do
      • $x^j \gets \hat{x}_0^j$ // Reveal selected token
  4. return $x$

Hyperparameters

  • $\lambda$ (lambda_val): Controls positional bias strength. Typical values range from 0 (no positional bias) to 1.0 (strong left-to-right bias). Recommended: 0 for Sudoku, 0.25 for most tasks, 0.5 for Countdown .

  • $\alpha$: Clipping threshold for confidence scores. Recommended value: 10 (provides stable results across tasks) .

  • Background frequency distribution ($p_{\mathcal{D}'}$): Constructed from a comprehensive corpus combining general text, mathematical reasoning problems, and evaluation datasets (see /data/baseline) .

📧 Contact

If you have questions, suggestions, and bug reports, please email:

pengcheng.neu@outlook.com

About

Source code for paper "Empirical Analysis of Decoding Biases in Masked Diffusion Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published