[DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding by edwardzjl · Pull Request #18171 · sgl-project/sglang

edwardzjl · 2026-02-03T12:34:05Z

Motivation

This PR introduces the JointThreshold algorithm, which enables the model to simultaneously fill in masks and refine previously generated tokens in a single iterative loop.

Modifications

Added JointThreshold Algorithm: Implemented a new DLLM algorithm that supports both Mask-to-Token (M2T) and Token-to-Token (T2T) decoding strategies.

Note on Model Compatibility:

The effectiveness of the T2T (Token-to-Token) refinement depends heavily on the underlying model. This algorithm is most effective when used with models specifically trained or fine-tuned for token to token editing tasks. For standard DLLM models without such training, the T2T component may even harm the final generation quality.

Accuracy Tests

Since the compatible models specifically trained for token to token editing are still awaiting public release, unittests for this commit are currently omitted. I will add test cases once these models are available.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-03T12:34:19Z

Summary of Changes

Hello @edwardzjl, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new JointThreshold algorithm into the DLLM framework. This innovative approach aims to improve text generation by unifying Mask-to-Token (M2T) and Token-to-Token (T2T) decoding. The algorithm iteratively fills masked tokens and refines existing ones, offering a more dynamic and potentially higher-quality output, particularly when used with models specifically fine-tuned for T2T editing.

Highlights

New Decoding Algorithm: Introduced a new DLLM algorithm named JointThreshold for enhanced text generation.
Combined Decoding Strategies: This algorithm uniquely combines Mask-to-Token (M2T) and Token-to-Token (T2T) decoding strategies within a single iterative loop.
Iterative Refinement: It allows for simultaneous filling of masked tokens and refinement of previously generated tokens, aiming for more dynamic and accurate outputs.
Model Compatibility Note: The effectiveness of the T2T component is highlighted as being highly dependent on the underlying model's specific training or fine-tuning for token editing tasks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/dllm/algorithm/joint_threshold.py
- Added a new Python file defining the JointThreshold class, which implements a novel DLLM decoding algorithm.
- The __init__ method initializes configurable parameters such as threshold, edit_threshold, and max_post_edit_steps.
- The run method contains the core logic for iterative decoding, handling both M2T (mask filling based on confidence) and T2T (token refinement based on prediction and confidence) operations.
- Includes logic to manage batch processing, track finished sequences, and determine when to perform additional forward passes for KV cache persistence.

Activity

No specific activity (comments, reviews, or progress updates) has been observed for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the JointThreshold algorithm for joint Mask-to-Token and Token-to-Token decoding. The implementation is clear and follows the described logic. My review includes suggestions to improve performance by vectorizing parts of the code, a correction to a type hint for better code quality, and a recommendation to change a default parameter value to prevent potentially harmful behavior for users, as noted in the pull request description.

gemini-code-assist · 2026-02-03T12:36:07Z

python/sglang/srt/dllm/algorithm/joint_threshold.py

+    ):
+        super().__init__(config)
+        self.threshold = config.algorithm_config.get("threshold", 0.5)
+        self.edit_threshold = config.algorithm_config.get("edit_threshold", 0)


The default value for edit_threshold is 0. Since p is a probability from softmax, it will always be greater than 0. This means the Token-to-Token (T2T) editing is enabled by default and will aggressively replace any generated token that is not the most likely one according to the current logits. The PR description warns that T2T can harm generation quality for models not trained for it. A safer default would be 1.0, which would effectively disable T2T unless explicitly configured by the user with a lower value.

Suggested change

self.edit_threshold = config.algorithm_config.get("edit_threshold", 0)

self.edit_threshold = config.algorithm_config.get("edit_threshold", 1.0)

gemini-code-assist · 2026-02-03T12:36:07Z

python/sglang/srt/dllm/algorithm/joint_threshold.py

+        self,
+        model_runner: ModelRunner,
+        forward_batch: ForwardBatch,
+    ) -> tuple[LogitsProcessorOutput | torch.Tensor, torch.Tensor | None, bool]:


The return type hint for the run method's second element is torch.Tensor | None, but the implementation returns a list of tensors ([] or next_token_ids_list). This should be updated to list[torch.Tensor] for correctness and consistency with other algorithms like LowConfidence.

Suggested change

) -> tuple[LogitsProcessorOutput | torch.Tensor, torch.Tensor | None, bool]:

) -> tuple[LogitsProcessorOutput | torch.Tensor, list[torch.Tensor], bool]:

gemini-code-assist · 2026-02-03T12:36:07Z

python/sglang/srt/dllm/algorithm/joint_threshold.py

+        start_list = []
+        prompt_masks = []
+        for i in range(batch_size):
+            block_start = i * self.block_size
+            block_end = block_start + self.block_size
+            block_input_ids = forward_batch.input_ids[block_start:block_end]
+
+            prompt_mask = block_input_ids != self.mask_id
+            prompt_masks.append(prompt_mask)
+            start_list.append(prompt_mask.sum().item())


This loop to compute start_list and prompt_masks can be vectorized for better performance. By reshaping forward_batch.input_ids and using batched tensor operations, you can avoid iterating over the batch size in Python.

reshaped_input_ids = forward_batch.input_ids.view(batch_size, self.block_size) prompt_masks_tensor = reshaped_input_ids != self.mask_id prompt_masks = list(torch.unbind(prompt_masks_tensor)) start_list = prompt_masks_tensor.sum(dim=1).tolist()

gemini-code-assist · 2026-02-03T12:36:07Z

python/sglang/srt/dllm/algorithm/joint_threshold.py

+            for i in range(batch_size):
+                if finished[i]:
+                    continue
+
+                block_start = i * self.block_size
+                block_end = block_start + self.block_size
+
+                curr_input_ids = forward_batch.input_ids[block_start:block_end]
+                curr_logits = logits_output.full_logits[block_start:block_end]
+                curr_prompt_mask = prompt_masks[i]
+
+                x = torch.argmax(curr_logits, dim=-1)
+                p = torch.squeeze(
+                    torch.gather(
+                        F.softmax(curr_logits, dim=-1),
+                        dim=-1,
+                        index=torch.unsqueeze(x, -1),
+                    ),
+                    -1,
+                )
+
+                mask_index = curr_input_ids == self.mask_id
+                has_mask = mask_index.any()
+
+                # Mask to token (M2T)
+                mask_transfer_index = torch.zeros_like(mask_index)
+                if has_mask:
+                    confidence = torch.where(mask_index, p, -np.inf)
+                    mask_transfer_index = confidence > self.threshold
+
+                    if not mask_transfer_index.any():
+                        _, select_index = torch.topk(confidence, k=1)
+                        mask_transfer_index[select_index] = True
+                else:
+                    post_edit_steps[i] += 1
+                    if post_edit_steps[i] > self.max_post_edit_steps:
+                        finished[i] = True
+                        continue
+
+                # Token to token (T2T)
+                edit_mask = ~mask_index & ~curr_prompt_mask
+                edit_transfer_index = (
+                    (p > self.edit_threshold) & (curr_input_ids != x) & edit_mask
+                )
+
+                transfer_index = mask_transfer_index | edit_transfer_index
+                if not transfer_index.any():
+                    finished[i] = True
+                    continue
+
+                curr_input_ids[transfer_index] = x[transfer_index]
+                any_changed_in_last_step = True


The main logic inside the decoding loop iterates over each sequence in the batch individually. This can be a performance bottleneck. Most of the operations within this loop are tensor-based and could be vectorized to operate on the entire batch at once. This would involve reshaping inputs and using masks to handle per-sequence conditional logic. While more complex to implement, it would significantly improve efficiency.

ClawSeven · 2026-02-03T12:42:19Z

Please add unit tests for the edit pattern, and include accuracy and performance benchmarks. An analysis using the GSM8K dataset would be sufficient.

Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Junlin Zhou <zhoujunlin.zjl@antgroup.com>

ispobock · 2026-02-08T16:53:26Z

/tag-and-rerun-ci

ClawSeven · 2026-02-09T03:07:53Z

/rerun-failed-ci

zhaochenyang20 · 2026-02-09T06:18:30Z

/rerun-failed-ci

wenxuewuhd · 2026-02-11T05:57:53Z

hi, is it possible have the Accuracy Tests and benchmark result on gsm8k with LLaDA2.1? thx

…gl-project#18171) Signed-off-by: Junlin Zhou <zhoujunlin.zjl@antgroup.com> Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com>

ClawSeven mentioned this pull request Feb 3, 2026

[Roadmap] Diffusion LLMs (2025 Q4 & 2026 Q1) #14199

Open

45 tasks

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Feb 8, 2026

edwardzjl force-pushed the dllm-editing branch from 8839f5e to 15b1b1b Compare February 8, 2026 04:28

[DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding

4909972

Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Junlin Zhou <zhoujunlin.zjl@antgroup.com>

edwardzjl force-pushed the dllm-editing branch from 15b1b1b to 4909972 Compare February 8, 2026 04:52

github-actions bot added the run-ci label Feb 8, 2026

ispobock approved these changes Feb 8, 2026

View reviewed changes

ClawSeven approved these changes Feb 9, 2026

View reviewed changes

Merge branch 'main' into dllm-editing

a7f62a2

ispobock added the high priority label Feb 9, 2026

ispobock merged commit 1465224 into sgl-project:main Feb 9, 2026
58 of 95 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding#18171

[DLLM] Add JointThreshold algorithm for joint M2T and T2T decoding#18171
ispobock merged 2 commits intosgl-project:mainfrom
edwardzjl:dllm-editing

edwardzjl commented Feb 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Uh oh!

gemini-code-assist bot Feb 3, 2026

Uh oh!

gemini-code-assist bot Feb 3, 2026

Uh oh!

gemini-code-assist bot Feb 3, 2026

Uh oh!

ClawSeven commented Feb 3, 2026

Uh oh!

ispobock commented Feb 8, 2026

Uh oh!

ClawSeven commented Feb 9, 2026

Uh oh!

zhaochenyang20 commented Feb 9, 2026

Uh oh!

Uh oh!

wenxuewuhd commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

	self.edit_threshold = config.algorithm_config.get("edit_threshold", 0)
	self.edit_threshold = config.algorithm_config.get("edit_threshold", 1.0)

	) -> tuple[LogitsProcessorOutput \| torch.Tensor, torch.Tensor \| None, bool]:
	) -> tuple[LogitsProcessorOutput \| torch.Tensor, list[torch.Tensor], bool]:

Conversation

edwardzjl commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

ClawSeven commented Feb 3, 2026

Uh oh!

ispobock commented Feb 8, 2026

Uh oh!

ClawSeven commented Feb 9, 2026

Uh oh!

zhaochenyang20 commented Feb 9, 2026

Uh oh!

Uh oh!

wenxuewuhd commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

edwardzjl commented Feb 3, 2026 •

edited

Loading