Official Code for Paper Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach
Authors: Yaofang Liu, Yumeng REN, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H. Chan, Jean-michel Morel
FVDM (Frame-aware Video Diffusion Model) introduces a novel vectorized timestep variable (VTV) to revolutionize video generation, addressing limitations in current video diffusion models (VDMs). Unlike previous VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.
- Vectorized Timestep Variable (VTV) for fine-grained temporal modeling
- Great flexibility across a wide range of video generation tasks (in a zero-shot way)
- Superior quality in generated videos
- No additional computation cost during training and inference
With different VTV configurations, FVDM can be extended to numerous tasks (in a zero-shot way).
Below are FVDM generated videos w.r.t. datasets FaceForensics, SkyTimelapse, Taichi-HD, and UCF101. Note that the models/checkpoints are the same across different tasks (reflects strong zero-shot capabilities).
demo.mp4
git clone https://github.com/Yaofang-Liu/FVDM.git
cd FVDM
conda env create -f environment.yml
conda activate latte.
├── configs/ # Training and sampling configurations
│ ├── ffs/ # FaceForensics configs
│ ├── sky/ # SkyTimelapse configs
│ ├── taichi/ # Taichi-HD configs
│ ├── ucf101/ # UCF101 configs
│ └── t2v/ # Text-to-Video configs
├── datasets/ # Dataset loaders
├── diffusers/ # Diffusion model components
├── diffusion/ # Gaussian diffusion utilities
├── models/ # Model architectures
├── sample/ # Sampling scripts
├── tools/ # Evaluation metrics (FVD, FID, IS)
├── train_scripts/ # Training shell scripts
├── train.py # Base training script
├── train_video.py # Video training script
└── train_with_img.py # Video-image joint training script
To train FVDM on different datasets:
# FaceForensics
bash train_scripts/ffs_train_video.sh
# SkyTimelapse
bash train_scripts/sky_train_video.sh
# Taichi-HD
bash train_scripts/taichi_train_video.sh
# UCF101
bash train_scripts/ucf101_train_video.shOr use torchrun directly:
torchrun --nnodes=1 --nproc_per_node=N train_video.py --config ./configs/ffs/ffs_train_video.yamlTo generate videos:
# FaceForensics
bash sample/ffs_video.sh
# SkyTimelapse
bash sample/sky_video.sh
# Taichi-HD
bash sample/taichi_video.sh
# UCF101
bash sample/ucf101_video.sh
# Text-to-Video
bash sample/t2v.shWe provide evaluation scripts for FVD, FID, and IS metrics:
bash tools/eval_metrics_ucf101.sh
bash tools/eval_metrics_taichi.shIf you find our work useful, please consider citing:
@misc{liu2024redefiningtemporalmodelingvideo,
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
author={Yaofang Liu and Yumeng Ren and Xiaodong Cun and Aitor Artola and Yang Liu and Tieyong Zeng and Raymond H. Chan and Jean-michel Morel},
year={2024},
eprint={2410.03160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.03160},
}This implementation is built upon Latte. We thank the authors for their excellent work.
For any questions or feedback, please contact yaofanliu2-c@my.cityu.edu.hk.
See LICENSE for details.

