Request: Add Flash Attention 2.0 Support for ViTMAEForPreTraining

Hi Hugging Face team!

I am currently working on pre-training a Foundation Model using ViTMAEForPreTraining, and I was hoping to use Flash Attention 2.0 to speed up training and reduce memory usage. However, when I attempted to enable Flash Attention, I encountered the following error:

`ValueError: ViTMAEForPreTraining does not support Flash Attention 2.0 yet. 
Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//discussions/new 
or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new`

Since MAE pre-training is heavily dependent on the attention mechanism, adding Flash Attention support would be a valuable enhancement—especially for larger ViT models and high-resolution datasets, like Landsat data we are working with.

**Feature Request**

- Please add support for Flash Attention 2.0 to ViTMAEForPreTraining.
- This would help make MAE pre-training more efficient in terms of speed and memory consumption.

**Why This Matters**

- Many users working with large imagery datasets (like remote sensing, medical imaging, etc.) would greatly benefit from this.
- Flash Attention has already proven useful in other ViT variants, so bringing this to MAE feels like a natural next step.

**Environment Details**

- Transformers version: v4.41.0.dev0
- PyTorch version: 2.5.1
- Running on multi-GPU with NCCL backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Add Flash Attention 2.0 Support for ViTMAEForPreTraining #36527

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Add Flash Attention 2.0 Support for ViTMAEForPreTraining #36527

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions