Skip to content

Code for Paper 'Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach'

License

Notifications You must be signed in to change notification settings

Yaofang-Liu/FVDM

FVDM

Official Code for Paper Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

Authors: Yaofang Liu, Yumeng REN, Xiaodong Cun, Aitor Artola, Yang Liu, Tieyong Zeng, Raymond H. Chan, Jean-michel Morel

arXiv code

FVDM (Frame-aware Video Diffusion Model) introduces a novel vectorized timestep variable (VTV) to revolutionize video generation, addressing limitations in current video diffusion models (VDMs). Unlike previous VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.

Highlights

  • Vectorized Timestep Variable (VTV) for fine-grained temporal modeling
  • Great flexibility across a wide range of video generation tasks (in a zero-shot way)
  • Superior quality in generated videos
  • No additional computation cost during training and inference

Demos

With different VTV configurations, FVDM can be extended to numerous tasks (in a zero-shot way).

Below are FVDM generated videos w.r.t. datasets FaceForensics, SkyTimelapse, Taichi-HD, and UCF101. Note that the models/checkpoints are the same across different tasks (reflects strong zero-shot capabilities).

demo.mp4

Setup

git clone https://github.com/Yaofang-Liu/FVDM.git
cd FVDM
conda env create -f environment.yml
conda activate latte

Code Structure

.
├── configs/                # Training and sampling configurations
│   ├── ffs/               # FaceForensics configs
│   ├── sky/               # SkyTimelapse configs
│   ├── taichi/            # Taichi-HD configs
│   ├── ucf101/            # UCF101 configs
│   └── t2v/               # Text-to-Video configs
├── datasets/              # Dataset loaders
├── diffusers/             # Diffusion model components
├── diffusion/             # Gaussian diffusion utilities
├── models/                # Model architectures
├── sample/                # Sampling scripts
├── tools/                 # Evaluation metrics (FVD, FID, IS)
├── train_scripts/         # Training shell scripts
├── train.py               # Base training script
├── train_video.py         # Video training script
└── train_with_img.py      # Video-image joint training script

Training

To train FVDM on different datasets:

# FaceForensics
bash train_scripts/ffs_train_video.sh

# SkyTimelapse
bash train_scripts/sky_train_video.sh

# Taichi-HD
bash train_scripts/taichi_train_video.sh

# UCF101
bash train_scripts/ucf101_train_video.sh

Or use torchrun directly:

torchrun --nnodes=1 --nproc_per_node=N train_video.py --config ./configs/ffs/ffs_train_video.yaml

Sampling

To generate videos:

# FaceForensics
bash sample/ffs_video.sh

# SkyTimelapse
bash sample/sky_video.sh

# Taichi-HD
bash sample/taichi_video.sh

# UCF101
bash sample/ucf101_video.sh

# Text-to-Video
bash sample/t2v.sh

Evaluation

We provide evaluation scripts for FVD, FID, and IS metrics:

bash tools/eval_metrics_ucf101.sh
bash tools/eval_metrics_taichi.sh

Citation

If you find our work useful, please consider citing:

@misc{liu2024redefiningtemporalmodelingvideo,
      title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
      author={Yaofang Liu and Yumeng Ren and Xiaodong Cun and Aitor Artola and Yang Liu and Tieyong Zeng and Raymond H. Chan and Jean-michel Morel},
      year={2024},
      eprint={2410.03160},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.03160},
}

Acknowledgments

This implementation is built upon Latte. We thank the authors for their excellent work.

Contact

For any questions or feedback, please contact yaofanliu2-c@my.cityu.edu.hk.

License

See LICENSE for details.

About

Code for Paper 'Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach'

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors