- Clone the repo with submodules (vjepa2, MAGI-1)
git clone --recurse-submodules https://github.com/facebookresearch/WMReward.git
cd WMRewardIf you already cloned without --recurse-submodules, initialize submodules with:
git submodule update --init --recursive
git submodule sync --recursive- Create conda environment and install dependencies (Python 3.10 + PyTorch 2.4 with CUDA 12.4)
conda env create -f environment.yml
conda activate wmreward
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
pip install flash-attn==2.4.2 --no-build-isolation
pip install flashinfer-python==0.2.0.post2 --extra-index-url https://flashinfer.ai/whl/cu124/torch2.4/- Download MAGI-1 model weights (only needed for video generation, not for
compute_wmreward.py)
Download from the MAGI-1 Hugging Face repo:
pip install "huggingface_hub[cli]"
# Download the 24B base model, VAE, and T5 text encoder
huggingface-cli download sand-ai/MAGI-1 --include "ckpt/magi/24B_base/*" --local-dir downloads
huggingface-cli download sand-ai/MAGI-1 --include "ckpt/vae/*" --local-dir downloads
huggingface-cli download sand-ai/MAGI-1 --include "ckpt/t5/*" --local-dir downloads
# Move into the expected layout
mv downloads/ckpt/magi/24B_base downloads/24B_base
mv downloads/ckpt/vae downloads/vae
mv downloads/ckpt/t5 downloads/t5_pretrained
rm -rf downloads/ckptThe expected directory structure:
WMReward/
└── downloads/
├── 24B_base/ # MAGI-1 DiT model weights
├── vae/ # MAGI-1 VAE encoder/decoder
└── t5_pretrained/ # T5-XXL text encoder
Note: VJEPA checkpoints are optional for computing WMReward. The
compute_wmreward.pyscript automatically downloads them viatorch.hub. If you want to use local checkpoints (viaload_vjepa_model_source), place them in./checkpoints/or setVJEPA_CHECKPOINT_DIRto your checkpoint directory.
Our WMReward is computed with the central function compute_vjepa_surprise() currently implemented for VJEPA models.
python compute_wmreward.py --video_path /path/to/video.mp4Options:
--model: Model variant (vith,vitg,vitg384,vitgac). Default:vitg--window_size: Sliding window size. Default:16--context_frames: Context frames per window. Default:8--stride: Sliding window stride. Default:2
Other models can be pretty easily integrated. Just compute a reward score with them, e.g. a yes/no log likelihood with a VLM. For WMReward Guidance on your own model, you can also use this function. We implemented the guidance too for MAGI-1 in generator_i2v_multinode.py.
python generate_magi1.py \
--config_file ./MAGI-1/example/24B/24B_base_config.json \
--prompt "A ball falls from the table onto the floor" \
--init_image ./example/0001_switch-frames_anyFPS_perspective-left_trimmed-ball-and-block-fall.jpg \
--output_path ./results/output.mp4 \
--mode i2vOptions:
Input/Output:
--prompt: Text prompt describing the video (required)--config_file: Path to MAGI-1 configuration JSON file (required)--output_path: Path to save the output video (required)--mode: Generation mode:t2v(text-to-video),i2v(image-to-video),v2v(video-to-video). Default:i2v--init_image: Path to initial image for I2V mode--init_video: Path to prefix video for V2V mode
Please follow the instructions from PhysicsIQ to prepare the condition image and prompts. The prompt lists are provided in the prompt folder. Then run
bash generation/generate_i2v_magi1_multinode.shThanks to these great repositories: MAGI-1, FrameGuidance and many other inspiring works in the community.
This project is licensed under the CC BY-NC 4.0 License - see the LICENSE file for details. Whenever we make use of other repos (MAGI-1 and VJEPA) those fall under their own copyright and license. Please make sure you adhere to them too.
If you find this work useful in your research, please consider citing:
@inproceedings{yuan2026inferencetimephysicsalignmentvideo,
title={Inference-time Physics Alignment of Video Generative Models with Latent World Models},
author={Jianhao Yuan and Xiaofeng Zhang and Felix Friedrich and Nicolas Beltran-Velez and Melissa Hall and Reyhane Askari-Hemmat and Xiaochuang Han and Nicolas Ballas and Michal Drozdzal and Adriana Romero-Soriano},
year={2026},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
}