Official pytorch implementation of the following paper:
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling.
Shuhuai Ren1, Shuming Ma2, Xu Sun1, Furu Wei2
1Peking University
2Microsoft Research Asia
We introduce a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. This framework features the following properties:
- 🚀 The generation unit is shifted from individual tokens to blocks (e.g., rows or frames), where each token in the current block simultaneously predicts the corresponding token in the next block;
- 🔥 By employing bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies;
- ⚡ By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to 11x faster inference;
- 🥇 State-of-the-art generation performance on video datasets compared to AR-based models;
Please setup the environment using the following commands:
conda create -n nbp python=3.9
conda activate nbp
sh setup.sh
Download the datasets from the official websites.
The file structure should look like:
NBP/
data/
|–– UCF-101/
|-- ApplyEyeMakeup/
|-- v_ApplyEyeMakeup_g01_c01.avi
|-- ...
|-- ...
|-- kinetics-dataset/
|-- k600/
|-- train/
|-- val/
|-- test/
ckpt/
|–– NBP-ucf-base/
|-- ucf_base_nbp16_hybrid.ckpt
|-- NBP-k600-base/
|-- k600_base_nbp16.ckpt
|–– NBP-tokenizer-ucf/
|-- magvit2_ucf.pt
|-- NBP-tokenizer-k600/
|-- magvit2_k600.pt
| Training Data | rFVD (128x128) | ckpt |
|---|---|---|
| UCF-101 | 15.50 | magvit2_ucf.pt |
| K600 | 6.73 | magvit2_k600.pt |
You can easily incorporate our tokenizer into your language model with:
from nbp.download import load_magvit2
# load tokenizer
tokenizer = load_magvit2('/path/to/tokenizer.pt', resolution=128, device="cuda")
tokenizer.eval()
# encode
_, tokens = tokenizer.encode(raw_video, quantize=True)
# decode
pred_video = tokenizer.decode_from_code_indices(tokens)
pred_video = torch.clamp(pred_video, 0, 1)
For the evaluation of our tokenizer, please refer to scripts/recons/eval_tokenizer.sh.
| Training Data | Model Size | #Token | #step | gFVD (128x128) | ckpt |
|---|---|---|---|---|---|
| UCF-101 | 700M | 1280 | 95 | 103.3 | ucf_base_nbp16_hybrid.ckpt |
| UCF-101 | 3B | 1280 | 95 | 55.3 | ucf_3b_nbp16_hybrid.ckpt |
| K600 | 700M | 768 | 48 | 25.5 | k600_base_nbp16.ckpt |
| K600 | 3B | 768 | 48 | 19.5 | k600_3b_nbp16.ckpt |
Please refer to scripts/lm_train for model training.
If you use deepspeed, after training, run
python zero_to_fp32.py /path/to/checkpoint-folder /path/to/output-file
to convert the model to fp32.
Please refer to scripts/lm_eval for model evaluation.
Our code is partially built upon OmniTokenizer and FSQ-pytorch.
This project is licensed under the MIT license, as found in the LICENSE file.

