GitHub - vivoCameraResearch/Magic-World: official code for "magicworld: towards long-horizon stability for interactive video world exploration"

MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration

⭐ MagicWorld achieves achieves 18 FPS on the L40S GPU and better results under VBench on the RealWM120K-Val dataset. ⭐

This repository is the official implementation of our MagicWorld, an interactive video world model that supports exploring a dynamic scene created from a single scene image through continuous keyboard actions (W, A, S, D), while maintaining structural and temporal consistency.

💡 Highlights

Motion Drift Constrint：We introduce a flow-guided motion preservation constraint that enforces temporal coherence in dynamic regions to prevent motion drift and ensure realistic motion evolution of dynamic subjects
Long-Horizon Stability: We design a history cache retrieval strategy to preserve historical scene states during autoregressive rollout, and an enhanced interactive training strategy based on multi-shot aggregated DMD with dual-reward weighting, jointly improving long-horizon stability and reducing error accumulation.
RealWM120K Dataset: We build the RealWM120K dataset with diverse citywalk videos and multimodal annotations for real-world video world modeling.

📣 News

2026/03/20: We open-source the MagicWorld v1.5 codebase, including training and inference scripts.
2026/02/10: We open-source the MagicWorld v1 codebase, including training and inference scripts.
2025/11/24: Our Paper on ArXiv is available 🥳!

✅ To-Do List for MagicTryOn Release

✅ Release the source code of MagicWorld
✅ Release the source code of MagicWorld-Fast
✅ Update MagicWorld training configuration and instructions
✅ Release the MagicWorld pretrained weights
✅ Release the MagicWorld-Fast pretrained weights
[ ] Release the RealWM120K dataset and processing tools
[ ] Update the ODE construction and initialization training code
[ ] Update the MagicWorld-Fast training code

Video Demo

💻 Installation

Create a conda environment & install requirements

# python==3.12.9 cuda==12.3 torch==2.2
conda create -n magicworld python==3.12.9
conda activate magicworld
pip install -r requirements.txt

If you encounter an error while installing Flash Attention, please manually download the installation package based on your Python version, CUDA version, and Torch version, and install it using pip install flash_attn-2.7.3+cu12torch2.2cxx11abiFALSE-cp312-cp312-linux_x86_64.whl.

📦 Pretrained Model Weights

Models	Download	Features
MagicWorld-Fast	🤗 Huggingface 🤖 ModelScope	DMD-based MagicWorld.
MagicWorld	🤗 Huggingface 🤖 ModelScope	Basic framework with geometry condition and history cache retrieval.
MagicWorld-Base	🤗 Huggingface 🤖 ModelScope	Basic framework.
Wan2.1-Fun-V1.1	🤗 Huggingface 🤖 ModelScope	You need to download other frozen pretrained weights such as VAE and CLIP, and specify their storage path in --model_name.

😉 Demo Inference

Before inference, you need to do two things: (1) install the Uni3C library and pytorch3D, then import the path to your installed Uni3C in uni3c_cam_render_api.py. eg. sys.path.insert(0, "path of Uni3C-main"). (2) run action2traj.py to map your keyboard actions to a camera trajectory and generate the trajectory .txt file.

python inference/inference_magicworld_base.py

python inference/inference_magicworld.py

python inference/inference_magicworld_fast.py \
  --num_output_frames 21 \
  --config_path config/reward_forcing_switch.yaml \
  --checkpoint_path checkpoints/MagicWorld-Fast/model.pt \
  --output_folder videos/ar_mutil_reward \
  --data_path asset/sense_image \
  --extended_prompt_path asset/sense_caption.json \
  --control_camera_txt asset/trajectory.txt \
  --i2v

🚀 Training

We can choose whether to use deep speed in MagicWorld, which can save a lot of video memory. The data format is shown as follows.

[
    {
      "file_path": "train/00000001.mp4",
      "control_file_path": "camera/trajectory.txt",
      "point_video_path": "render/00000001_render.mp4"
      "text": "A group of young men in suits and sunglasses are walking down a city street.",
      "type": "video"
    },
    .....
]

Some parameters in the sh file can be confusing, and they are explained in this document:

enable_bucket is used to enable bucket training. When enabled, the model does not crop the images and videos at the center, but instead, it trains the entire images and videos after grouping them into buckets based on resolution.
random_frame_crop is used for random cropping on video frames to simulate videos with different frame counts.
random_hw_adapt is used to enable automatic height and width scaling for images and videos. When random_hw_adapt is enabled, the training images will have their height and width set to image_sample_size as the maximum and min(video_sample_size, 512) as the minimum. For training videos, the height and width will be set to image_sample_size as the maximum and min(video_sample_size, 512) as the minimum.
- For example, when random_hw_adapt is enabled, with video_sample_n_frames=49, video_sample_size=1024, and image_sample_size=1024, the resolution of image inputs for training is 512x512 to 1024x1024, and the resolution of video inputs for training is 512x512x49 to 1024x1024x49.
- For example, when random_hw_adapt is enabled, with video_sample_n_frames=49, video_sample_size=1024, and image_sample_size=256, the resolution of image inputs for training is 256x256 to 1024x1024, and the resolution of video inputs for training is 256x256x49.
training_with_video_token_length specifies training the model according to token length. For training images and videos, the height and width will be set to image_sample_size as the maximum and video_sample_size as the minimum.
- For example, when training_with_video_token_length is enabled, with video_sample_n_frames=49, token_sample_size=1024, video_sample_size=1024, and image_sample_size=256, the resolution of image inputs for training is 256x256 to 1024x1024, and the resolution of video inputs for training is 256x256x49 to 1024x1024x49.
- For example, when training_with_video_token_length is enabled, with video_sample_n_frames=49, token_sample_size=512, video_sample_size=1024, and image_sample_size=256, the resolution of image inputs for training is 256x256 to 1024x1024, and the resolution of video inputs for training is 256x256x49 to 1024x1024x9.
- The token length for a video with dimensions 512x512 and 49 frames is 13,312. We need to set the token_sample_size = 512.
  - At 512x512 resolution, the number of video frames is 49 (~= 512 * 512 * 49 / 512 / 512).
  - At 768x768 resolution, the number of video frames is 21 (~= 512 * 512 * 49 / 768 / 768).
  - At 1024x1024 resolution, the number of video frames is 9 (~= 512 * 512 * 49 / 1024 / 1024).
  - These resolutions combined with their corresponding lengths allow the model to generate videos of different sizes.
resume_from_checkpoint is used to set whether the training should resume from a previous checkpoint. Use a path or "latest" to automatically select the last available checkpoint.

When training the model with multiple machines, please set the params as follows:

export MASTER_ADDR="your master address"
export MASTER_PORT=10086
export WORLD_SIZE=1 # The number of machines
export NUM_PROCESS=8 # The number of processes, such as WORLD_SIZE * 8
export RANK=0 # The rank of this machine

accelerate launch --mixed_precision="bf16" --main_process_ip=$MASTER_ADDR --main_process_port=$MASTER_PORT --num_machines=$WORLD_SIZE --num_processes=$NUM_PROCESS --machine_rank=$RANK scripts/xxx.py

You can run the following command:

bash train_magicworld_v1.sh

Pickup Event Smoke Finetuning

This is the current smoke-oriented pickup workflow. For the full operator notes and acceptance checklist, see docs/pickup_event_smoke_runbook.md.

Prepare a pickup manifest before export.

 python scripts/data/prepare_epic_pickup_manifest.py \
   --input-csv tmp/pickup_source.csv \
   --output-json tmp/pickup_manifest.json

Manifest rows must include source_video_path before export.

Run the smoke export.

python scripts/data/export_magicworld_pickup_dataset.py \
  --manifest-json tmp/pickup_manifest.json \
  --output-dir tmp/pickup_smoke_export \
  --dry-run

--dry-run still writes files. It copies source video bytes into both clips/ and controls/, then writes metadata.json.

Run the smoke finetune command.

python scripts/train_magicworld_v1.5.py \
  --pretrained_model_name_or_path ckpt/Wan2.1-T2V-1.3B \
  --pretrained_transformer_path ckpt/Wan2.1-T2V-1.3B \
  --config_path config/wan2.1/wan_civitai.yaml \
  --train_data_dir tmp/pickup_smoke_export \
  --train_data_meta tmp/pickup_smoke_export/metadata.json \
  --output_dir tmp/pickup_smoke_train \
  --train_mode control \
  --train_batch_size 1 \
  --max_train_steps 1 \
  --checkpointing_steps 1 \
  --smoke_run \
  --max_train_samples 1 \
  --enable_event_text \
  --text_composition_mode event_prefix

The exported metadata keeps text empty and event_text populated so event-prefix composition avoids duplication.

Expected outputs:

tmp/pickup_smoke_export/metadata.json
tmp/pickup_smoke_export/clips/<sample_id>.mp4
tmp/pickup_smoke_export/controls/<sample_id>.mp4
tmp/pickup_smoke_train/

📕 RealWM120K Dataset

⭐ Acknowledgement

Our code is modified based on VideoX-Fun. We adopt Wan2.1-Fun-V1.1-1.3B as the base model. We use Uni3C to generate 3D points. The style of the logo is borrowed from Helios. We thank Siming Zheng and Shuolin Xu for their initial support and suggestions for our basic framework. Thanks to all the contributors!

🔍 Related Works

Infinite-World
Matrix-Game 2.0
LingBot-World
YUME 1.5
Self-Forcing
LongLive

📜 License

All the materials, including code, checkpoints, and demos, are made available under the Creative Commons BY-NC-SA 4.0 license. You are free to copy, redistribute, remix, transform, and build upon the project for non-commercial purposes, as long as you give appropriate credit and distribute your contributions under the same license.

🎓 Citation

@article{li2026magicworld,
  title={MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration},
  author={Li, Guangyuan and Li, Bo and Chen, Jinwei and Hu, Xiaobin and Zhao, Lei and Jiang, Peng-Tao},
  journal={arXiv preprint arXiv:2511.18886v2},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
asset		asset
ckpt		ckpt
config		config
datasets		datasets
demo_utils		demo_utils
docs		docs
inference		inference
pipeline		pipeline
scripts		scripts
tests		tests
utils		utils
videoalign		videoalign
videox_fun		videox_fun
wan		wan
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration

⭐ MagicWorld achieves achieves 18 FPS on the L40S GPU and better results under VBench on the RealWM120K-Val dataset. ⭐

💡 Highlights

📣 News

✅ To-Do List for MagicTryOn Release

Video Demo

💻 Installation

📦 Pretrained Model Weights

😉 Demo Inference

🚀 Training

Pickup Event Smoke Finetuning

📕 RealWM120K Dataset

⭐ Acknowledgement

🔍 Related Works

📜 License

🎓 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MagicWorld: Towards Long-Horizon Stability for Interactive Video World Exploration

⭐ MagicWorld achieves achieves 18 FPS on the L40S GPU and better results under VBench on the RealWM120K-Val dataset. ⭐

💡 Highlights

📣 News

✅ To-Do List for MagicTryOn Release

Video Demo

💻 Installation

📦 Pretrained Model Weights

😉 Demo Inference

🚀 Training

Pickup Event Smoke Finetuning

📕 RealWM120K Dataset

⭐ Acknowledgement

🔍 Related Works

📜 License

🎓 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages