Empirical Analysis of Decoding Biases in Masked Diffusion Models

• 📖 Introduction • 🎉 News • ✨ Pipeline • ⚡️ Evaluation

• 📈 Decoding Trajectory • 💻 Algorithm • 📧 Contact

📖 Introduction

Uɴᴄᴏᴅᴇ is a novel decoding strategy for Masked Diffusion Models (MDMs) that unifies global trajectory planning with content-aware informativeness maximization. It addresses the key limitations of traditional uncertainty-based samplers when applied to MDMs: a rigid boundary bias and a bias toward "trivial tokens." By using a position-aware weighting mechanism and a calibrated confidence score, Uɴᴄᴏᴅᴇ guides the decoding path and prevents the premature selection of unimportant tokens, significantly improving generation quality.

🎉 News

20250912: This release provides enhanced support for decoding with LLaDA, integrating a variety of recent semi- and non-autoregressive sampling strategies, including: ReMDM, Fast-dLLM, Semi-AR, Margin-based sampler, Entropy-based sampler and Confidence-based sampler.
20250819: Released our Paper on arXiv. Released our Code on GitHub.

✨ Pipeline

Uɴᴄᴏᴅᴇ

Uɴᴄᴏᴅᴇr is a novel decoding strategy designed for advanced Masked Diffusion Models (MDMs) like LLaDA and Dream. These models are powerful non-autoregressive alternatives for sequence generation, enabling flexible decoding through the iterative denoising of masked tokens.

⚙️ Setup

git clone 
conda create --name Uɴᴄᴏᴅᴇ python==3.10
conda activate Uɴᴄᴏᴅᴇ
cd Uɴᴄᴏᴅᴇ
pip install -r requirements.txt

📃 Evaluation

Our method, along with all baseline methods, can be applied for prediction across mathematical reasoning, code generation, and question-answering datasets.

Eval Case

This is an example of evaluation on the HumanEval dataset using Uɴᴄᴏᴅᴇ. And you can run the change the --task and --mode to evaluate on other datasets and decoding methods.

cd scripts
python eval.py \
    --task 'humaneval' \
    --model_name 'GSAI-ML/LLaDA-8B-Instruct' \
    --device 'cuda:5' \
    --gen_length 256 \
    --steps 256 \
    --block_length 256 \
    --mode pc_sampler \
    --lambd 0.25 \
    --alpha 10 \
    --data_path ../data/humaneval.jsonl \
    --result_path results/humaneval_pc_sampler

Following are the evaluation bash scripts for all decoding methods.

Decoding Method	Evaluation Command	Decoding Method	Evaluation Command
Semi-Autoregressive	cd scripts bash eval_semi_ar.sh	Entropy	cd scripts bash eval_entropy.sh
EB-Sampler	cd scripts bash eval_eb_sampler.sh	Fast-dLLM	cd scripts bash eval_fast_dllm.sh
Margin	cd scripts bash eval_margin.sh	PC-sampler	cd scripts bash eval_pc_sampler.sh
ReMDM	cd scripts bash eval_remdm.sh	Linear_Position	cd scripts bash eval_linear_position.sh

Evaluation of Decoding Methods

All decoding methods are evaluated on the same set of datasets: HumanEval, MBPP, GSM8K, MATH-500, GPQA, Countdown, and Sudoku. Evaluation results are saved in the results folder.

Evaluation Tools

For the GSM8K and GPQA datasets, we use lm-eval for evaluation.
For the remaining datasets, please refer to scripts/eval.py for more details.

Consistency Note

All methods are evaluated using the same set of evaluation scripts (including both lm-eval and our custom script) to ensure consistent assessment.

Painting Heatmap

We provide a script to generate heatmaps for the decoding trajectories of different decoding methods. The script is located in scripts/heatmap.sh.

cd scripts
bash heatmap.sh

Results

The heatmap results are saved in the heatmap_results folder.

📈 Decoding Trajectory

The choice of decoding strategy significantly impacts the generation order of Masked Diffusion Models (MDMs). A critical limitation of existing uncertainty-based methods is their tendency to exhibit a "U-shaped" trajectory (namely rigid boundary bias), where tokens at sequence boundaries are prioritized early in decoding, followed by convergence toward the center. This bias stems from the premature unmasking of boundary tokens (BOS and EOS), where the attention mechanism's local positional bias leads to elevated confidence for tokens near the sequence boundaries.

In contrast, our Uɴᴄᴏᴅᴇ introduces explicit trajectory control through position-aware weighting, enabling adaptive generation order tailored to task requirements. Below, we visualize the decoding trajectories on the GSM8K dataset for four representative sampling strategies:

🔍 Trajectory Visualizations on GSM8K

Sampling Strategy	Decoding Trajectory Heatmap	Sampling Strategy	Decoding Trajectory Heatmap
Confidence-based		Entropy-based
Margin-based		Uɴᴄᴏᴅᴇ

🔑 Key Observations

Rigid Boundary Bias in Uncertainty-based Methods: Confidence, entropy, and margin-based samplers consistently exhibit the characteristic U-shaped pattern, with early decoding of tokens at both sequence boundaries. This behavior limits their ability to capture global dependencies required for complex reasoning tasks like mathematical problem-solving.
Trivial Token Bias: Uncertainty-based samplers tend to prioritize semantically trivial, high-frequency tokens (e.g., newline characters, spaces, common words like "the", punctuation marks such as ".", and exclamation marks "!") during the decoding process, leading to suboptimal reasoning paths.
Debias with Uɴᴄᴏᴅᴇ: Our method eliminates the U-shaped bias by regulating the decoding path through exponential positional weighting. This enables a more natural progression that aligns with the logical flow of reasoning tasks, as demonstrated by the sequential trajectory in the GSM8K dataset.

The adaptive trajectory control of Uɴᴄᴏᴅᴇ directly contributes to its superior performance on GSM8K (82.2% accuracy) compared to uncertainty-based alternatives, highlighting the importance of aligning decoding order with task-specific structural demands.

💻 Algorithm

Method Overview

Uɴᴄᴏᴅᴇ is a novel decoding strategy for Masked Diffusion Models (MDMs) that addresses key limitations of existing uncertainty-based sampling methods. It unifies global trajectory planning with content-aware informativeness maximization through two core components:

Position-Aware Weighting Mechanism: Regulates the decoding path using an exponential decay function to enable flexible control over the generation order, adapting to task-specific structural demands.
Calibrated Confidence Score: Suppresses premature selection of trivial tokens (e.g., punctuation, filler words) by incorporating frequency-based adjustment from a reference corpus, promoting semantically rich content generation.

Extensive experiments across seven benchmarks demonstrate that Uɴᴄᴏᴅᴇ consistently outperforms existing MDM decoding strategies by more than 10% on average, narrowing the performance gap with state-of-the-art autoregressive models .

Algorithm Workflow

The complete workflow of Uɴᴄᴏᴅᴇ is summarized in the following algorithm:

Require: Predictor $p_\theta$, prompt $p_0$, answer length $L$, steps $T$, Hyperparams $\lambda, \alpha$; reference corpus $\mathcal{D}'$

$p_{\mathcal{D}'} \gets \text{FreqDist}(\mathcal{D}')$
$x \gets \text{Concat}(p_0, \text{[MASK]} \times L)$
for $t = 1$ to $T$ do
- $\mathcal{M}_t \gets {i \mid x^i = \text{[MASK]}}$ // Get mask indices
- if $\mathcal{M}_t = \emptyset$ then
  - break
- $\hat{x}0, \hat{p}^i \gets p{\theta}(\cdot \mid x)$
- for each position $i \in \mathcal{M}_t$ do
  - $\mathcal{C}^{(i)} \gets \hat{p}^i \cdot \log p_{\mathcal{D}'}(x^i)$
  - $\mathcal{C}^{(i)} \gets \min(\mathcal{C}^{(i)}, \alpha)$ // Clip salience score
  - $w^{(i)} \gets e^{-\lambda \cdot (i - |p_0|)}$
  - $\text{score}^{(i)} \gets w^{(i)} \cdot \mathcal{C}^{(i)}$
- $n_k \gets \text{NumToReveal}(k, N, |\mathcal{M}_k|)$
- $\mathcal{S}_t \gets \text{TopK}(\text{score}, n_k)$ // Select best tokens
- for each index $j \in \mathcal{S}_t$ do
  - $x^j \gets \hat{x}_0^j$ // Reveal selected token
return $x$

Hyperparameters

$\lambda$ (lambda_val): Controls positional bias strength. Typical values range from 0 (no positional bias) to 1.0 (strong left-to-right bias). Recommended: 0 for Sudoku, 0.25 for most tasks, 0.5 for Countdown .
$\alpha$: Clipping threshold for confidence scores. Recommended value: 10 (provides stable results across tasks) .
Background frequency distribution ($p_{\mathcal{D}'}$): Constructed from a comprehensive corpus combining general text, mathematical reasoning problems, and evaluation datasets (see /data/baseline) .

📧 Contact

If you have questions, suggestions, and bug reports, please email:

pengcheng.neu@outlook.com

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
figs		figs
results/humaneval_results		results/humaneval_results
scripts		scripts
src		src
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Empirical Analysis of Decoding Biases in Masked Diffusion Models

📖 Introduction

🎉 News

✨ Pipeline

Uɴᴄᴏᴅᴇ

⚙️ Setup

📃 Evaluation

Eval Case

Evaluation of Decoding Methods

Evaluation Tools

Consistency Note

Painting Heatmap

Results

📈 Decoding Trajectory

🔍 Trajectory Visualizations on GSM8K

🔑 Key Observations

💻 Algorithm

Method Overview

Algorithm Workflow

Hyperparameters

📧 Contact

About

Uh oh!

Releases

Packages

Languages

License

NEUIR/Uncode

Folders and files

Latest commit

History

Repository files navigation

Empirical Analysis of Decoding Biases in Masked Diffusion Models

📖 Introduction

🎉 News

✨ Pipeline

Uɴᴄᴏᴅᴇ

⚙️ Setup

📃 Evaluation

Eval Case

Evaluation of Decoding Methods

Evaluation Tools

Consistency Note

Painting Heatmap

Results

📈 Decoding Trajectory

🔍 Trajectory Visualizations on GSM8K

🔑 Key Observations

💻 Algorithm

Method Overview

Algorithm Workflow

Hyperparameters

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages