Skip to content

Alpha-VLLM/Lumina-DiMOO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

54 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

[๐Ÿ“‘ Technical Report ] โ€ƒ [๐ŸŒ Project Page (Demo & Benchmark)] โ€ƒ [๐Ÿค— Model ]

ยนShanghai Innovation Institute, ยฒShanghai AI Laboratory, ยณShanghai Jiao Tong University, โดNanjing University

โตThe University of Sydney, โถThe Chinese University of Hong Kong, โทTsinghua University

๐Ÿ“š Introduction

We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations:

  • Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities.

  • Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding.

  • Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x.

  • Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field.

๐Ÿ”ฅ News

  • [2025-11-27] ๐ŸŽ‰ We have released the evaluation code using VLMEvalKit.
  • [2025-10-24] ๐ŸŽ‰ We have released a guide for those who want to build worlds with the mask paradigm, see more details at ArXiv and Github.
  • [2025-10-21] ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Weโ€™ve added support for Diffusers and ComfyUI.
  • [2025-10-06] Training code is released.
  • [2025-09-25] We have released the Technical Report.
  • [2025-09-20] ๐ŸŽ‰ In the latest UniGenBench Leaderboard(maintained by Tencent Hunyuan Team), Lumina-DiMOO's generation evaluation ranks 1st ๐Ÿฅ‡ among all open-source unified models.
  • [2025-09-12] We have open-sourced Image Inpainting & Extrapolation code.
  • [2025-09-11] We have open-sourced the Max Logit-based Cache solution, offering a 2x speed improvement for sampling.
  • [2025-09-10] ๐ŸŽ‰ We release the initial version of Lumina-DiMOO, including:
    • ๐ŸŽฏ Model Checkpoints on HuggingFace!
    • ๐ŸŽฏ Text-to-Image & Image-to-Image Generation Inference code!
    • ๐ŸŽฏ Image Understanding Inference Code!
    • ๐ŸŽฏ Website & Demo on Project Page!

๐Ÿ“ Open-Source Plan

  • Image Inpainting & Extrapolation Code
  • Fast Sampling with Max Logit-based Cache
  • Diffusers and ComfyUI
  • Bechmark Evaluation Code
  • Fine-Tuning Code
  • Technical Report

๐Ÿ“ฝ๏ธ Qualitative Results

Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page.

Text-to-Image Comparison
Image Editing Comparison
Controllable & Subject-Driven Generation Comparison
Image Inpainting & Extrapolation

๐Ÿ“Š Quantitative Performance

GenEval Benchmark
DPG Benchmark
OneIG-EN Benchmark
TIIF Benchmark
Image-to-Image Benchmark
Image Understanding Benchmark

๐Ÿš€ Sampling Speed Analysis

  • Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation.

  • Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128.

๐Ÿ“Œ Quick Start

โš™๏ธ Installation

1. Create a conda environment

git clone https://github.com/Alpha-VLLM/Lumina-DiMOO.git && cd Lumina-DiMOO
conda create -n lumina_dimoo python=3.10 -y
conda activate lumina_dimoo

2. Install dependencies

pip install -r requirements.txt

๐Ÿงจ How to Fine-Tuning Lumina-DiMOO

Step 1: Pre-extract discrete codes of training images.

The final format after specific processing can refer to the sample json file assets/mmu_sample.json and assets/t2i_sample.json.

bash pre_tokenizer/run_pre_token.sh

Step 2: Train Lumina-DiMOO model.

bash train/train.sh

๐Ÿš— Text-to-Image Generation Inference

1. Normal Sampling

python inference/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort." \
    --height 768 \
    --width 1536 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image

2. DDP Sampling

To support large-scale sampling/testing, we provide additional ddp sampling scripts that support multi-GPU parallel sampling.

torchrun --nproc_per_node=8 inference/inference_t2i_ddp.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt_path /path/to/prompts.jsonl \
    --height 1024 \
    --width 1024 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image_ddp \
    --output_json output/results_image_to_image_ddp/results.json

3. Faster Sampling with Cache

  • Add --use-cache to accelerate sampling through max logit-based cache (ML-Cache). The efficiency-quality tradeoff can be tuned by cache_ratio (in (0,1); the higher the faster), warmup_ratio (in [0,1); the lower the faster), and refresh_interval (in (1, timesteps-int(warmup_ratio*timesteps)-1]; the higher the faster).
python inference/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A striking photograph of a glass of orange juice on a wooden kitchen table, capturing a playful moment. The orange juice splashes out of the glass and forms the word \"Smile\" in a whimsical, swirling script just above the glass. The background is softly blurred, revealing a cozy, homely kitchen with warm lighting and a sense of comfort." \
    --height 768 \
    --width 1536 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image_usecache \
    --use-cache \
    --cache_ratio 0.9 \
    --warmup_ratio 0.3 \
    --refresh_interval 5
  • We provide the inference time and GPU memory on one A800 as a reference:
Method Inference Time Inference GPU Memory
Lumina-DiMOO 58.2s 38.9 GB
+ ML-Cache 32.2s 45.9 GB

๐ŸŒŸ Image-to-Image Inference

1. Controllable Generation: "hed_control", "depth_control", "openpose_control", "subject_driven".

python inference/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A functional wooden printer stand.Nestled next to a brick wall in a bustling city street, it stands firm as pedestrians hustle by, illuminated by the warm glow of vintage street lamps." \
    --image_path examples/example_2.jpg \
    --edit_type depth_control \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

2. Subject-Driven Generation.

python inference/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A creamy, rich-flavored dark beverage.Captured in a bustling urban street at twilight, this item is placed on an outdoor cafรฉ table, as city lights begin to twinkle and passersby create a lively atmosphere." \
    --image_path examples/example_3.jpg \
    --edit_type subject_driven \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

3. Image Editing: "edit_add", "edit_remove", "edit_replace", "edit_background", "edit_text_transfer".

python inference/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Add a beige shed with brown trim and double doors with a diamond pattern in the center-right, occupying more than a third of the image." \
    --image_path examples/example_4.png \
    --edit_type edit_add \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

4. Style Transfer (An Image as Style Reference)

python inference/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Transform the current image into the style of the provided image." \
    --image_path examples/example_5.png \
    --ref_image_path examples/example_5_style.png \
    --edit_type image_ref_transfer \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

5. Dense Prediction: "canny_pred", "hed_pred", "depth_pred", "openpose_pred", "canny_control".

python inference/inference_i2i.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Generate a canny edge map accroding to the image." \
    --image_path examples/example_1.png \
    --edit_type canny_pred \
    --timesteps 64 \
    --cfg_scale 2.5 \
    --cfg_img 4.0 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_image_to_image

๐Ÿƒ Image Inpainting & Extrapolation Inference

1. Image Inpainting

python inference/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Porsche showroom. Make there be a Porsche logo on the back wall behind the car." \
    --painting_mode inpainting \
    --painting_image examples/example_8.png \
    --mask_h_ratio 0.5 \
    --mask_w_ratio 0.5 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image

2. Image Extrapolation

python inference/inference_t2i.py\
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "A photograph showcasing a pale gold moon, partially veiled by wispy cirrus clouds, dominating a dramatic twilight sky. The moon's soft glow reflects on the tranquil surface of a lake below, creating a shimmering mirror effect, while a small wooden rowboat gently bobs on the water's edge. Dark silhouettes of tall, ancient pine trees encircle the lake, their branches reaching towards the sky like skeletal fingers, as a gentle mist hangs low, diffusing the moonlight and adding a sense of serene mystery. The scene is bathed in soft, cool lighting, creating an ethereal and captivating atmosphere." \
    --painting_mode outpainting \
    --painting_image examples/example_7.png \
    --mask_h_ratio 1 \
    --mask_w_ratio 0.2 \
    --timesteps 64 \
    --cfg_scale 4.0 \
    --seed 65513 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/results_text_to_image

โšก๏ธ Image Understanding Inference

python inference/inference_mmu.py \
    --checkpoint Alpha-VLLM/Lumina-DiMOO \
    --prompt "Please describe this image." \
    --image_path examples/example_6.jpg \
    --steps 128 \
    --gen_length 128 \
    --block_length 32 \
    --vae_ckpt Alpha-VLLM/Lumina-DiMOO \
    --output_dir output/outputs_text_understanding

๐Ÿ† Benchmark Evaluation

We utilize the VLMEvalKit from OpenCompass to evaluate Lumina_DiMOO across multiple benchmarks.

1. Preparation

Navigate to the VLMEvalKit directory and install the required dependencies:

cd VLMEvalKit
pip install -r requirements.txt

โš ๏ธ Important Note: We utilize an LLM as the judge model for answer matching. Before running the evaluation, you need edit the VLMEvalKit/.env file to fill in your OPENAI_API_KEY and OPENAI_API_BASE.

2. Supported Benchmarks

We support evaluation on the following 5 benchmarks. Please use the corresponding Data Name in the command arguments:

Benchmark Data Name (--data)
POPE POPE
MME MME
MMBench MMBench_DEV_EN
SEEDBench SEEDBench_IMG
MMMU MMMU_DEV_VAL

3. Run Evaluation

You can perform the evaluation using either a single GPU or multiple GPUs.

Single GPU Evaluation:

python3 run.py --data MMMU_DEV_VAL --model Lumina_DiMOO --verbose

Multi-GPU Evaluation (8 GPUs):

torchrun --nproc-per-node=8 --master_port=29500 run.py \
    --data MMMU_DEV_VAL \
    --model Lumina_DiMOO \
    --verbose

๐Ÿ“œ Acknowledgements

This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huaweiโ€˜s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.

๐Ÿ“– BibTeX

@article{xin2025lumina,
  title={Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding},
  author={Xin, Yi and Qin, Qi and Luo, Siqi and Zhu, Kaiwen and Yan, Juncheng and Tai, Yan and Lei, Jiayi and Cao, Yuewen and Wang, Keqi and Wang, Yibin and others},
  journal={arXiv preprint arXiv:2510.06308},
  year={2025}
}

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •