|
| 1 | +<!-- Copyright 2026 The NYU Vision-X and HuggingFace Teams. All rights reserved. |
| 2 | +
|
| 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
| 4 | +the License. You may obtain a copy of the License at |
| 5 | +
|
| 6 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 7 | +
|
| 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
| 9 | +an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| 10 | +specific language governing permissions and limitations under the License. |
| 11 | +--> |
| 12 | + |
| 13 | +# AutoencoderRAE |
| 14 | + |
| 15 | +The Representation Autoencoder (RAE) model introduced in [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690) by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx. |
| 16 | + |
| 17 | +RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation). |
| 18 | + |
| 19 | +The following RAE models are released and supported in Diffusers: |
| 20 | + |
| 21 | +| Model | Encoder | Latent shape (224px input) | |
| 22 | +|:------|:--------|:---------------------------| |
| 23 | +| [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08) | DINOv2-base | 768 x 16 x 16 | |
| 24 | +| [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512) | DINOv2-base (512px) | 768 x 32 x 32 | |
| 25 | +| [`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08) | DINOv2-small | 384 x 16 x 16 | |
| 26 | +| [`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08) | DINOv2-large | 1024 x 16 x 16 | |
| 27 | +| [`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08) | SigLIP2-base | 768 x 16 x 16 | |
| 28 | +| [`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-mae-base-p16-ViTXL-n08) | MAE-base | 768 x 16 x 16 | |
| 29 | + |
| 30 | +## Loading a pretrained model |
| 31 | + |
| 32 | +```python |
| 33 | +from diffusers import AutoencoderRAE |
| 34 | + |
| 35 | +model = AutoencoderRAE.from_pretrained( |
| 36 | + "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" |
| 37 | +).to("cuda").eval() |
| 38 | +``` |
| 39 | + |
| 40 | +## Encoding and decoding a real image |
| 41 | + |
| 42 | +```python |
| 43 | +import torch |
| 44 | +from diffusers import AutoencoderRAE |
| 45 | +from diffusers.utils import load_image |
| 46 | +from torchvision.transforms.functional import to_tensor, to_pil_image |
| 47 | + |
| 48 | +model = AutoencoderRAE.from_pretrained( |
| 49 | + "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" |
| 50 | +).to("cuda").eval() |
| 51 | + |
| 52 | +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") |
| 53 | +image = image.convert("RGB").resize((224, 224)) |
| 54 | +x = to_tensor(image).unsqueeze(0).to("cuda") # (1, 3, 224, 224), values in [0, 1] |
| 55 | + |
| 56 | +with torch.no_grad(): |
| 57 | + latents = model.encode(x).latent # (1, 768, 16, 16) |
| 58 | + recon = model.decode(latents).sample # (1, 3, 256, 256) |
| 59 | + |
| 60 | +recon_image = to_pil_image(recon[0].clamp(0, 1).cpu()) |
| 61 | +recon_image.save("recon.png") |
| 62 | +``` |
| 63 | + |
| 64 | +## Latent normalization |
| 65 | + |
| 66 | +Some pretrained checkpoints include per-channel `latents_mean` and `latents_std` statistics for normalizing the latent space. When present, `encode` and `decode` automatically apply the normalization and denormalization, respectively. |
| 67 | + |
| 68 | +```python |
| 69 | +model = AutoencoderRAE.from_pretrained( |
| 70 | + "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" |
| 71 | +).to("cuda").eval() |
| 72 | + |
| 73 | +# Latent normalization is handled automatically inside encode/decode |
| 74 | +# when the checkpoint config includes latents_mean/latents_std. |
| 75 | +with torch.no_grad(): |
| 76 | + latents = model.encode(x).latent # normalized latents |
| 77 | + recon = model.decode(latents).sample |
| 78 | +``` |
| 79 | + |
| 80 | +## AutoencoderRAE |
| 81 | + |
| 82 | +[[autodoc]] AutoencoderRAE |
| 83 | + - encode |
| 84 | + - decode |
| 85 | + - all |
| 86 | + |
| 87 | +## DecoderOutput |
| 88 | + |
| 89 | +[[autodoc]] models.autoencoders.vae.DecoderOutput |
0 commit comments