Skip to content

Add LongCat T2V (Base, Distillation and Refinement) Support to FastVideo#883

Merged
SolitaryThinker merged 134 commits intohao-ai-lab:mainfrom
FoundationResearch:main
Dec 23, 2025
Merged

Add LongCat T2V (Base, Distillation and Refinement) Support to FastVideo#883
SolitaryThinker merged 134 commits intohao-ai-lab:mainfrom
FoundationResearch:main

Conversation

@alexzms
Copy link
Copy Markdown
Collaborator

@alexzms alexzms commented Nov 18, 2025

Summary

This PR integrates LongCat-Video into FastVideo as a first-class text-to-video (T2V) pipeline, including:

  • Native LongCat DiT model config and registration
  • A LongCat pipeline that supports base 480p generation
  • Distillation via LoRA
  • 480p → 720p refinement (two-stage pipeline)
  • Optional Block Sparse Attention (BSA) and 3D RoPE support
  • Utilities for checkpoint conversion

Key Changes

1. Model & Pipeline Configuration

  • fastvideo/configs/models/dits/longcat.py

    • Defines LongCatVideoArchConfig and LongCatVideoConfig for the native LongCat DiT.
    • Adds parameter name mappings to convert official LongCat weights to FastVideo naming (embedders, AdaLN, self/cross-attn, FFN, final layer).
    • Exposes BSA-related fields and video-specific settings (3D patches, caption channels, etc.).
  • fastvideo/configs/pipelines/longcat.py

    • Adds LongCatDiTArchConfig for Phase 1 wrapper compatibility.
    • Adds LongCatT2V480PConfig (base 480p pipeline) and LongCatT2V704PConfig (704p refinement with BSA enabled).
  • fastvideo/configs/models/dits/__init__.py / fastvideo/models/registry.py / fastvideo/configs/pipelines/registry.py / fastvideo/pipelines/pipeline_registry.py

    • Wires LongCat configs and models into the existing registry:
      • Registers LongCat DiT classes and LongCatPipeline.
      • Adds pipeline detection and fallback under the "longcat" key.

2. LongCat Pipeline & Stages (Base + Refinement)

  • fastvideo/pipelines/basic/longcat/longcat_pipeline.py

    • Implements LongCatPipeline as a composed pipeline with LoRA support.
    • Assembles stages for:
      • text encoding, timestep prep, latent prep, denoising, decoding
      • LongCat-specific refine stages (LongCatRefineInitStage, LongCatRefineTimestepStage).
    • Enables runtime BSA configuration from pipeline config / CLI and propagates parameters to transformer blocks.
  • fastvideo/pipelines/stages/longcat_refine_init.py

    • Initializes 480p → 720p refinement:
      • Loads stage1 video (path or in-memory frames).
      • Upsamples spatially/temporally, applies temporal padding compatible with VAE/BSA.
      • VAE-encodes, normalizes latents, and mixes with noise according to t_thresh.
      • Stores padding metadata in batch for later cropping.
  • fastvideo/pipelines/stages/longcat_refine_timestep.py

    • Builds LongCat refinement timesteps starting at t_thresh and updates scheduler timesteps/sigmas accordingly.
  • fastvideo/pipelines/stages/longcat_denoising.py

    • LongCat-specific denoising loop:
      • Batched CFG with CFG-zero optimal guidance scale.
      • Negates noise_pred to match flow-matching scheduler convention.
  • fastvideo/pipelines/pipeline_batch_info.py / fastvideo/configs/sample/base.py

    • Extends ForwardBatch and SamplingParam with LongCat refine fields: refine_from, t_thresh, spatial_refine_only, num_cond_frames, stage1_video.
    • Adds corresponding CLI args (--refine-from, --t-thresh, etc.).
  • fastvideo/pipelines/stages/latent_preparation.py / fastvideo/pipelines/stages/decoding.py

    • Adjusts latent scaling for pre-initialized latents in refine mode (no double init_noise_sigma).
    • Crops extra padded frames after decoding using refine padding metadata.
  • fastvideo/pipelines/stages/utils.py

    • Adds aspect-ratio bucket tables and get_bucket_config() used to select resolutions for 480p/720p LongCat.

3. RoPE, LoRA & BSA Triton kernel support

  • fastvideo/layers/rotary_embedding_3d.py

    • Adds 3D RoPE implementation for video transformers, splitting head dim into (T, H, W) components.
  • fastvideo/layers/lora/linear.py / fastvideo/pipelines/lora_pipeline.py / fastvideo/fastvideo_args.py

    • Implements LoRA alpha scaling in merge logic (alpha = lora_alpha / rank).
    • Teaches LoRAPipeline to parse/store *.lora_alpha alongside lora_A, lora_B.
    • Introduces CLI flags for loading LoRA adapters (--lora-path, --lora-nickname, --lora-target-modules).
  • fastvideo/third_party/longcat_video/block_sparse_attention/*

    • Integrates LongCat’s Triton-based Block Sparse Attention kernels and helper utilities (including p2p communication for context parallelism).

4. Checkpoint Conversion, Inference Scripts

  • scripts/checkpoint_conversion/longcat_to_fastvideo.py

    • Converts official LongCat-Video checkpoints into a FastVideo-compatible layout using the new param name mappings.
  • scripts/inference/v1_inference_longcat*.sh

    • Example scripts for:
      • 480p LongCat T2V generation
      • 480p distillation with LoRA
      • 480p → 720p refinement from an existing video (with BSA + refinement LoRA)

Tests

  • Base LongCat T2V: Ran generation on all prompts under assets/ and verified successful 480p video outputs for each prompt.
  • Distill + Refine pipeline: Used a representative prompt to test the full two-stage flow:
    1. 480p distilled generation, and
    2. 480p → 720p refinement from the generated video.

Known limitations:

  • Distill and refine are currently triggered by two separate scripts and require manual chaining. If needed in the future, we plan to add a patch that unifies both stages into a single end-to-end command.
  • We observe that 2-GPU generation is currently slower than 1-GPU generation for the LongCat pipeline; this performance issue is under investigation.

@alexzms
Copy link
Copy Markdown
Collaborator Author

alexzms commented Dec 16, 2025

Successfully implemented SSIM and verified it on the 4x L40S platform. Currently, the thresholds are set to 0.93 for distill/base and 0.90 for refine. Please let me know if these values align with expectations (or if they need tuning).

@alexzms
Copy link
Copy Markdown
Collaborator Author

alexzms commented Dec 23, 2025

The SSIM tests are still timing out (>60 mins) even with reduced steps. I suggest merging the Longcat t2v without ssim for now, and I'll handle the SSIM efficiency optimization in a separate PR.

@SolitaryThinker SolitaryThinker merged commit 8f1e6c3 into hao-ai-lab:main Dec 23, 2025
1 check passed
shijiew555 pushed a commit to Gary-ChenJL/FastVideo that referenced this pull request Apr 8, 2026
…deo (hao-ai-lab#883)

Co-authored-by: Shao Duan <shaoxiongduan@gmail.com>
RandNMR73 pushed a commit that referenced this pull request Apr 8, 2026
…deo (#883)

Co-authored-by: Shao Duan <shaoxiongduan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants