Skip to content

Commit de1ae4e

Browse files
author
yiyi@huggingface.co
committed
Merge branch 'main' into helios-modular
2 parents 40c0bd1 + 8ec0a5c commit de1ae4e

File tree

62 files changed

+10131
-62
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+10131
-62
lines changed

docs/source/en/_toctree.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,8 @@
194194
title: Model accelerators and hardware
195195
- isExpanded: false
196196
sections:
197+
- local: using-diffusers/helios
198+
title: Helios
197199
- local: using-diffusers/consisid
198200
title: ConsisID
199201
- local: using-diffusers/sdxl
@@ -350,6 +352,8 @@
350352
title: FluxTransformer2DModel
351353
- local: api/models/glm_image_transformer2d
352354
title: GlmImageTransformer2DModel
355+
- local: api/models/helios_transformer3d
356+
title: HeliosTransformer3DModel
353357
- local: api/models/hidream_image_transformer
354358
title: HiDreamImageTransformer2DModel
355359
- local: api/models/hunyuan_transformer2d
@@ -456,6 +460,8 @@
456460
title: AutoencoderKLQwenImage
457461
- local: api/models/autoencoder_kl_wan
458462
title: AutoencoderKLWan
463+
- local: api/models/autoencoder_rae
464+
title: AutoencoderRAE
459465
- local: api/models/consistency_decoder_vae
460466
title: ConsistencyDecoderVAE
461467
- local: api/models/autoencoder_oobleck
@@ -625,7 +631,6 @@
625631
title: Image-to-image
626632
- local: api/pipelines/stable_diffusion/inpaint
627633
title: Inpainting
628-
629634
- local: api/pipelines/stable_diffusion/latent_upscale
630635
title: Latent upscaler
631636
- local: api/pipelines/stable_diffusion/ldm3d_diffusion
@@ -674,6 +679,8 @@
674679
title: ConsisID
675680
- local: api/pipelines/framepack
676681
title: Framepack
682+
- local: api/pipelines/helios
683+
title: Helios
677684
- local: api/pipelines/hunyuan_video
678685
title: HunyuanVideo
679686
- local: api/pipelines/hunyuan_video15
@@ -745,6 +752,10 @@
745752
title: FlowMatchEulerDiscreteScheduler
746753
- local: api/schedulers/flow_match_heun_discrete
747754
title: FlowMatchHeunDiscreteScheduler
755+
- local: api/schedulers/helios_dmd
756+
title: HeliosDMDScheduler
757+
- local: api/schedulers/helios
758+
title: HeliosScheduler
748759
- local: api/schedulers/heun
749760
title: HeunDiscreteScheduler
750761
- local: api/schedulers/ipndm

docs/source/en/api/loaders/lora.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
2323
- [`AuraFlowLoraLoaderMixin`] provides similar functions for [AuraFlow](https://huggingface.co/fal/AuraFlow).
2424
- [`LTXVideoLoraLoaderMixin`] provides similar functions for [LTX-Video](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video).
2525
- [`SanaLoraLoaderMixin`] provides similar functions for [Sana](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana).
26+
- [`HeliosLoraLoaderMixin`] provides similar functions for [HunyuanVideo](https://huggingface.co/docs/diffusers/main/en/api/pipelines/helios).
2627
- [`HunyuanVideoLoraLoaderMixin`] provides similar functions for [HunyuanVideo](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hunyuan_video).
2728
- [`Lumina2LoraLoaderMixin`] provides similar functions for [Lumina2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/lumina2).
2829
- [`WanLoraLoaderMixin`] provides similar functions for [Wan](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan).
@@ -86,6 +87,10 @@ LoRA is a fast and lightweight training method that inserts and trains a signifi
8687

8788
[[autodoc]] loaders.lora_pipeline.SanaLoraLoaderMixin
8889

90+
## HeliosLoraLoaderMixin
91+
92+
[[autodoc]] loaders.lora_pipeline.HeliosLoraLoaderMixin
93+
8994
## HunyuanVideoLoraLoaderMixin
9095

9196
[[autodoc]] loaders.lora_pipeline.HunyuanVideoLoraLoaderMixin
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
<!-- Copyright 2026 The NYU Vision-X and HuggingFace Teams. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# AutoencoderRAE
14+
15+
The Representation Autoencoder (RAE) model introduced in [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690) by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx.
16+
17+
RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation).
18+
19+
The following RAE models are released and supported in Diffusers:
20+
21+
| Model | Encoder | Latent shape (224px input) |
22+
|:------|:--------|:---------------------------|
23+
| [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08) | DINOv2-base | 768 x 16 x 16 |
24+
| [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512) | DINOv2-base (512px) | 768 x 32 x 32 |
25+
| [`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08) | DINOv2-small | 384 x 16 x 16 |
26+
| [`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08) | DINOv2-large | 1024 x 16 x 16 |
27+
| [`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08) | SigLIP2-base | 768 x 16 x 16 |
28+
| [`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-mae-base-p16-ViTXL-n08) | MAE-base | 768 x 16 x 16 |
29+
30+
## Loading a pretrained model
31+
32+
```python
33+
from diffusers import AutoencoderRAE
34+
35+
model = AutoencoderRAE.from_pretrained(
36+
"nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
37+
).to("cuda").eval()
38+
```
39+
40+
## Encoding and decoding a real image
41+
42+
```python
43+
import torch
44+
from diffusers import AutoencoderRAE
45+
from diffusers.utils import load_image
46+
from torchvision.transforms.functional import to_tensor, to_pil_image
47+
48+
model = AutoencoderRAE.from_pretrained(
49+
"nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
50+
).to("cuda").eval()
51+
52+
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
53+
image = image.convert("RGB").resize((224, 224))
54+
x = to_tensor(image).unsqueeze(0).to("cuda") # (1, 3, 224, 224), values in [0, 1]
55+
56+
with torch.no_grad():
57+
latents = model.encode(x).latent # (1, 768, 16, 16)
58+
recon = model.decode(latents).sample # (1, 3, 256, 256)
59+
60+
recon_image = to_pil_image(recon[0].clamp(0, 1).cpu())
61+
recon_image.save("recon.png")
62+
```
63+
64+
## Latent normalization
65+
66+
Some pretrained checkpoints include per-channel `latents_mean` and `latents_std` statistics for normalizing the latent space. When present, `encode` and `decode` automatically apply the normalization and denormalization, respectively.
67+
68+
```python
69+
model = AutoencoderRAE.from_pretrained(
70+
"nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
71+
).to("cuda").eval()
72+
73+
# Latent normalization is handled automatically inside encode/decode
74+
# when the checkpoint config includes latents_mean/latents_std.
75+
with torch.no_grad():
76+
latents = model.encode(x).latent # normalized latents
77+
recon = model.decode(latents).sample
78+
```
79+
80+
## AutoencoderRAE
81+
82+
[[autodoc]] AutoencoderRAE
83+
- encode
84+
- decode
85+
- all
86+
87+
## DecoderOutput
88+
89+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# HeliosTransformer3DModel
13+
14+
A 14B Real-Time Autogressive Diffusion Transformer model (support T2V, I2V and V2V) for 3D video-like data from [Helios](https://github.com/PKU-YuanGroup/Helios) was introduced in [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379) by Peking University & ByteDance & etc.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import HeliosTransformer3DModel
20+
21+
# Best Quality
22+
transformer = HeliosTransformer3DModel.from_pretrained("BestWishYsh/Helios-Base", subfolder="transformer", torch_dtype=torch.bfloat16)
23+
# Intermediate Weight
24+
transformer = HeliosTransformer3DModel.from_pretrained("BestWishYsh/Helios-Mid", subfolder="transformer", torch_dtype=torch.bfloat16)
25+
# Best Efficiency
26+
transformer = HeliosTransformer3DModel.from_pretrained("BestWishYsh/Helios-Distilled", subfolder="transformer", torch_dtype=torch.bfloat16)
27+
```
28+
29+
## HeliosTransformer3DModel
30+
31+
[[autodoc]] HeliosTransformer3DModel
32+
33+
## Transformer2DModelOutput
34+
35+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput

0 commit comments

Comments
 (0)