This repository provides tools for full fine-tuning or partial fine-tuning (e.g., specific DiT blocks) of the Mochi and Pusa 0.5 video generation models. It supports training on both single-node and multi-node configurations with our provided dataset or your custom data.
- GPU: 8x H800 or above for full fine-tuning
- Video Length Support: Up to 163 frames (~5.4 seconds at 30 FPS)
- Choose frame counts in increments of 6: 25, 31, 37, ... 163.
- Single node (8 GPUs): 163 frames with batch_size_per_worker=1 uses ~59GB VRAM per GPU
- Two nodes (16 GPUs): supports up to 163 frames with batch_size_per_worker=2 (total batch size 32), Pusa-V0.5 model was trained for 500 steps in ~7 hours
- Multi-node training above two nodes is also supported
Set up the environment and install dependencies:
git clone https://github.com/Yaofang-Liu/Mochi-Full-Finetuner.git
cd Mochi-Full-Finetuner
pip install uv
uv venv pusa
source pusa/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation
uv pip install torchmetrics ipdb opencv-python pyarrow "ray[train]" lightning
uv pip install flash-attn --no-build-isolation
uv pip install decord PyAV
pip install ray[client]Download Mochi 1 weights:
huggingface-cli download genmo/mochi-1-preview --repo-type model --local-dir <path_to_model_directory>Download our training dataset (52695 pre-encoded latent samples from VIDGEN-1M, Pusa V0.5 only used 16000 samples):
huggingface-cli download RaphaelLiu/PusaV0.5_Training --repo-type dataset --local-dir <path_to_dataset_directory>Alternatively, you can use your own dataset following the instructions as Mochi Lora Training here. Note that your final dataset structure should be arranged like this:
path/to/datasets/
videos/
xxxx.latent.pt
xxxx.latent.pt
...
captions/
xxxx.embed.pt
xxxx.embed.pt
...
python -u /path/to/src/genmo/mochi_preview/train_xxxx.py \
--world_size=8 \
--model_dir="/path/to/model/directory" \
--data_path="/path/to/datasets/videos"Note:
-/path/to/src/genmo/mochi_preview/train_xxxx.py can be train_mochi.py if you want to train original mochi model or train_pusa.py if you want to train Pusa model.
- please provide only the path to the videos directory for
--data_pathargument, the captions directory will be automatically derived by replacing base directory name"videos" with "captions". os.path.join(args.model_dir, 'checkpoints')will be used as the checkpoint directory.
Edit the SLURM configuration in src/genmo/mochi_preview/train_multi_nodes.sh:
-
Update SLURM parameters:
--partition: Your cluster's partition--nodes: Number of nodes--nodelist: Node names (optional)--cpus-per-task: CPUs per node--mem: Memory per node--gres: GPU resources per node
-
Update paths:
- Project directory
- Model directory
- Data directory
- Training script path (train_mochi.py or train_pusa.py)
-
Adjust training parameters:
--num_frames: Frame count--frame_interval: Frame interval--widthand--height: Frame dimensions
sbatch ./src/genmo/mochi_preview/train_multi_nodes.shTraining logs are saved to:
logs/mochi_[job_id].out: Standard outputlogs/mochi_[job_id].err: Standard error
You can find the training checkpoints in os.path.join(args.model_dir, 'checkpoints') directory. Since we use fsdp to train the model (splited the model into multiple parts), we need to convert your saved checkpoints to safety tensors:
bash ./src/genmo/mochi_preview/convert_checkpoint.sh /path/to/your/xxxxx.ckptPlease give the path to the local checkpoint file.
After the conversion, you will have a finetuned dit safetytensor file and you can use the file to replace the original dit safetytensor file and use in the same way as the original Mochi or Pusa model. For example with Pusa, you can use the following command to generate video:
bash ./demos/cli_test_ti2v_release.shand give the finetuned dit safetytensor file as the checkpoint path in cli_test_ti2v_release.sh.
CHECKPOINTS=(
"<path_to_finetuned_dit_safetytensor_file>"
)For LoRA fine-tuning, refer to:
Currently, we do not support training with context parallelism/sequence parallelism/tensor parallelism, which can make the training process much more memory efficient. Contributions to implement these memory-efficient training methods are welcome!
If you use this work in your project, please cite:
@misc{Liu2025pusa,
title={Pusa: Thousands Timesteps Video Diffusion Model},
author={Yaofang Liu and Rui Liu},
year={2025},
url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}@article{liu2024redefining,
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
journal={arXiv preprint arXiv:2410.03160},
year={2024}
}