Skip to content

Commit 439abe9

Browse files
linoytsabansayakpaul
authored andcommitted
klein lora training scripts (#3)
* initial commit * initial commit * remove remote text encoder * initial commit * initial commit * initial commit * revert * img2img fix * text encoder + tokenizer * text encoder + tokenizer * update readme * guidance * guidance * guidance * test * test * revert changes not needed for the non klein model * Update examples/dreambooth/train_dreambooth_lora_flux2_klein.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * fix guidance * fix validation * fix validation * fix validation * fix path * space --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
1 parent eb84d50 commit 439abe9

File tree

5 files changed

+4204
-23
lines changed

5 files changed

+4204
-23
lines changed

examples/dreambooth/README_flux2.md

Lines changed: 107 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
1-
# DreamBooth training example for FLUX.2 [dev]
1+
# DreamBooth training example for FLUX.2 [dev] and FLUX 2 [klein]
22

33
[DreamBooth](https://huggingface.co/papers/2208.12242) is a method to personalize image generation models given just a few (3~5) images of a subject/concept.
4+
[LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) is a popular parameter-efficient fine-tuning technique that allows you to achieve full-finetuning like performance but with a fraction of learnable parameters.
5+
6+
The `train_dreambooth_lora_flux2.py`, `train_dreambooth_lora_flux2_klein.py` scripts shows how to implement the training procedure for [LoRAs](https://huggingface.co/blog/lora) and adapt it for [FLUX.2 [dev]](https://huggingface.co/black-forest-labs/FLUX.2-dev) and [FLUX 2 [klein]](https://huggingface.co/black-forest-labs/FLUX.2-klein).
47

5-
The `train_dreambooth_lora_flux2.py` script shows how to implement the training procedure for [LoRAs](https://huggingface.co/blog/lora) and adapt it for [FLUX.2 [dev]](https://github.com/black-forest-labs/flux2).
8+
> [!NOTE]
9+
> **Model Variants**
10+
>
11+
> We support two FLUX model families:
12+
> - **FLUX.2 [dev]**: The full-size model using Mistral Small 3.1 as the text encoder. Very capable but memory intensive.
13+
> - **FLUX 2 [klein]**: Available in 4B and 9B parameter variants, using Qwen VL as the text encoder. Much more memory efficient and suitable for consumer hardware.
614
715
> [!NOTE]
816
> **Memory consumption**
917
>
10-
> Flux can be quite expensive to run on consumer hardware devices and as a result finetuning it comes with high memory requirements -
11-
> a LoRA with a rank of 16 can exceed XXGB of VRAM for training. below we provide some tips and tricks to reduce memory consumption during training.
18+
> FLUX.2 [dev] can be quite expensive to run on consumer hardware devices and as a result finetuning it comes with high memory requirements -
19+
> a LoRA with a rank of 16 can exceed XXGB of VRAM for training. FLUX 2 [klein] models (4B and 9B) are significantly more memory efficient alternatives. Below we provide some tips and tricks to reduce memory consumption during training.
1220
1321
> For more tips & guidance on training on a resource-constrained device and general good practices please check out these great guides and trainers for FLUX:
1422
> 1) [`@bghira`'s guide](https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX2.md)
@@ -17,7 +25,7 @@ The `train_dreambooth_lora_flux2.py` script shows how to implement the training
1725
> [!NOTE]
1826
> **Gated model**
1927
>
20-
> As the model is gated, before using it with diffusers you first need to go to the [FLUX.2 [dev] Hugging Face page](https://huggingface.co/black-forest-labs/FLUX.2-dev), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows youve accepted the gate. Use the command below to log in:
28+
> As the model is gated, before using it with diffusers you first need to go to the [FLUX.2 [dev] Hugging Face page](https://huggingface.co/black-forest-labs/FLUX.2-dev), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you've accepted the gate. Use the command below to log in:
2129
2230
```bash
2331
hf auth login
@@ -88,23 +96,32 @@ snapshot_download(
8896

8997
This will also allow us to push the trained LoRA parameters to the Hugging Face Hub platform.
9098

91-
As mentioned, Flux2 LoRA training is *very* memory intensive. Here are memory optimizations we can use (some still experimental) for a more memory efficient training:
99+
As mentioned, Flux2 LoRA training is *very* memory intensive (especially for FLUX.2 [dev]). Here are memory optimizations we can use (some still experimental) for a more memory efficient training:
92100

93101
## Memory Optimizations
94102
> [!NOTE] many of these techniques complement each other and can be used together to further reduce memory consumption.
95103
> However some techniques may be mutually exclusive so be sure to check before launching a training run.
104+
96105
### Remote Text Encoder
97-
Flux.2 uses Mistral Small 3.1 as text encoder which is quite large and can take up a lot of memory. To mitigate this, we can use the `--remote_text_encoder` flag to enable remote computation of the prompt embeddings using the HuggingFace Inference API.
106+
FLUX.2 [dev] uses Mistral Small 3.1 as text encoder which is quite large and can take up a lot of memory. To mitigate this, we can use the `--remote_text_encoder` flag to enable remote computation of the prompt embeddings using the HuggingFace Inference API.
98107
This way, the text encoder model is not loaded into memory during training.
108+
109+
> [!IMPORTANT]
110+
> **Remote text encoder is only supported for FLUX.2 [dev]**. FLUX 2 [klein] models use the Qwen VL text encoder and do not support remote text encoding.
111+
99112
> [!NOTE]
100113
> to enable remote text encoding you must either be logged in to your HuggingFace account (`hf auth login`) OR pass a token with `--hub_token`.
114+
101115
### FSDP Text Encoder
102-
Flux.2 uses Mistral Small 3.1 as text encoder which is quite large and can take up a lot of memory. To mitigate this, we can use the `--fsdp_text_encoder` flag to enable distributed computation of the prompt embeddings.
116+
FLUX.2 [dev] uses Mistral Small 3.1 as text encoder which is quite large and can take up a lot of memory. To mitigate this, we can use the `--fsdp_text_encoder` flag to enable distributed computation of the prompt embeddings.
103117
This way, it distributes the memory cost across multiple nodes.
118+
104119
### CPU Offloading
105120
To offload parts of the model to CPU memory, you can use `--offload` flag. This will offload the vae and text encoder to CPU memory and only move them to GPU when needed.
121+
106122
### Latent Caching
107123
Pre-encode the training images with the vae, and then delete it to free up some memory. To enable `latent_caching` simply pass `--cache_latents`.
124+
108125
### QLoRA: Low Precision Training with Quantization
109126
Perform low precision training using 8-bit or 4-bit quantization to reduce memory usage. You can use the following flags:
110127
- **FP8 training** with `torchao`:
@@ -114,22 +131,29 @@ enable FP8 training by passing `--do_fp8_training`.
114131
- **NF4 training** with `bitsandbytes`:
115132
Alternatively, you can use 8-bit or 4-bit quantization with `bitsandbytes` by passing:
116133
`--bnb_quantization_config_path` to enable 4-bit NF4 quantization.
134+
117135
### Gradient Checkpointing and Accumulation
118136
* `--gradient accumulation` refers to the number of updates steps to accumulate before performing a backward/update pass.
119137
by passing a value > 1 you can reduce the amount of backward/update passes and hence also memory reqs.
120138
* with `--gradient checkpointing` we can save memory by not storing all intermediate activations during the forward pass.
121139
Instead, only a subset of these activations (the checkpoints) are stored and the rest is recomputed as needed during the backward pass. Note that this comes at the expanse of a slower backward pass.
140+
122141
### 8-bit-Adam Optimizer
123142
When training with `AdamW`(doesn't apply to `prodigy`) You can pass `--use_8bit_adam` to reduce the memory requirements of training.
124143
Make sure to install `bitsandbytes` if you want to do so.
144+
125145
### Image Resolution
126146
An easy way to mitigate some of the memory requirements is through `--resolution`. `--resolution` refers to the resolution for input images, all the images in the train/validation dataset are resized to this.
127147
Note that by default, images are resized to resolution of 512, but it's good to keep in mind in case you're accustomed to training on higher resolutions.
148+
128149
### Precision of saved LoRA layers
129150
By default, trained transformer layers are saved in the precision dtype in which training was performed. E.g. when training in mixed precision is enabled with `--mixed_precision="bf16"`, final finetuned layers will be saved in `torch.bfloat16` as well.
130151
This reduces memory requirements significantly w/o a significant quality loss. Note that if you do wish to save the final layers in float32 at the expanse of more memory usage, you can do so by passing `--upcast_before_saving`.
131152

153+
## Training Examples
132154

155+
### FLUX.2 [dev] Training
156+
To perform DreamBooth with LoRA on FLUX.2 [dev], run:
133157
```bash
134158
export MODEL_NAME="black-forest-labs/FLUX.2-dev"
135159
export INSTANCE_DIR="dog"
@@ -161,13 +185,84 @@ accelerate launch train_dreambooth_lora_flux2.py \
161185
--push_to_hub
162186
```
163187

188+
### FLUX 2 [klein] Training
189+
190+
FLUX 2 [klein] models are more memory efficient alternatives available in 4B and 9B parameter variants. They use the Qwen VL text encoder instead of Mistral Small 3.1.
191+
192+
> [!NOTE]
193+
> The `--remote_text_encoder` flag is **not supported** for FLUX 2 [klein] models. The Qwen VL text encoder must be loaded locally, but offloading is still supported.
194+
195+
**FLUX 2 [klein] 4B:**
196+
197+
```bash
198+
export MODEL_NAME="black-forest-labs/FLUX.2-klein-4B"
199+
export INSTANCE_DIR="dog"
200+
export OUTPUT_DIR="trained-flux2-klein-4b"
201+
202+
accelerate launch train_dreambooth_lora_flux2_klein.py \
203+
--pretrained_model_name_or_path=$MODEL_NAME \
204+
--instance_data_dir=$INSTANCE_DIR \
205+
--output_dir=$OUTPUT_DIR \
206+
--do_fp8_training \
207+
--gradient_checkpointing \
208+
--cache_latents \
209+
--instance_prompt="a photo of sks dog" \
210+
--resolution=1024 \
211+
--train_batch_size=1 \
212+
--guidance_scale=1 \
213+
--use_8bit_adam \
214+
--gradient_accumulation_steps=4 \
215+
--optimizer="adamW" \
216+
--learning_rate=1e-4 \
217+
--report_to="wandb" \
218+
--lr_scheduler="constant" \
219+
--lr_warmup_steps=100 \
220+
--max_train_steps=500 \
221+
--validation_prompt="A photo of sks dog in a bucket" \
222+
--validation_epochs=25 \
223+
--seed="0" \
224+
--push_to_hub
225+
```
226+
227+
**FLUX 2 [klein] 9B:**
228+
229+
```bash
230+
export MODEL_NAME="black-forest-labs/FLUX.2-klein-9B"
231+
export INSTANCE_DIR="dog"
232+
export OUTPUT_DIR="trained-flux2-klein-9b"
233+
234+
accelerate launch train_dreambooth_lora_flux2_klein.py \
235+
--pretrained_model_name_or_path=$MODEL_NAME \
236+
--instance_data_dir=$INSTANCE_DIR \
237+
--output_dir=$OUTPUT_DIR \
238+
--do_fp8_training \
239+
--gradient_checkpointing \
240+
--cache_latents \
241+
--instance_prompt="a photo of sks dog" \
242+
--resolution=1024 \
243+
--train_batch_size=1 \
244+
--guidance_scale=1 \
245+
--use_8bit_adam \
246+
--gradient_accumulation_steps=4 \
247+
--optimizer="adamW" \
248+
--learning_rate=1e-4 \
249+
--report_to="wandb" \
250+
--lr_scheduler="constant" \
251+
--lr_warmup_steps=100 \
252+
--max_train_steps=500 \
253+
--validation_prompt="A photo of sks dog in a bucket" \
254+
--validation_epochs=25 \
255+
--seed="0" \
256+
--push_to_hub
257+
```
258+
164259
To better track our training experiments, we're using the following flags in the command above:
165260

166261
* `report_to="wandb` will ensure the training runs are tracked on [Weights and Biases](https://wandb.ai/site). To use it, be sure to install `wandb` with `pip install wandb`. Don't forget to call `wandb login <your_api_key>` before training if you haven't done it before.
167262
* `validation_prompt` and `validation_epochs` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.
168263

169264
> [!NOTE]
170-
> If you want to train using long prompts with the T5 text encoder, you can use `--max_sequence_length` to set the token limit. The default is 77, but it can be increased to as high as 512. Note that this will use more resources and may slow down the training in some cases.
265+
> If you want to train using long prompts, you can use `--max_sequence_length` to set the token limit. Note that this will use more resources and may slow down the training in some cases.
171266
172267
### FSDP on the transformer
173268
By setting the accelerate configuration with FSDP, the transformer block will be wrapped automatically. E.g. set the configuration to:
@@ -189,12 +284,6 @@ fsdp_config:
189284
fsdp_cpu_ram_efficient_loading: false
190285
```
191286

192-
## LoRA + DreamBooth
193-
194-
[LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) is a popular parameter-efficient fine-tuning technique that allows you to achieve full-finetuning like performance but with a fraction of learnable parameters.
195-
196-
Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.6.0` installed in your environment.
197-
198287
### Prodigy Optimizer
199288
Prodigy is an adaptive optimizer that dynamically adjusts the learning rate learned parameters based on past gradients, allowing for more efficient convergence.
200289
By using prodigy we can "eliminate" the need for manual learning rate tuning. read more [here](https://huggingface.co/blog/sdxl_lora_advanced_script#adaptive-optimizers).
@@ -206,8 +295,6 @@ to use prodigy, first make sure to install the prodigyopt library: `pip install
206295
> [!TIP]
207296
> When using prodigy it's generally good practice to set- `--learning_rate=1.0`
208297
209-
To perform DreamBooth with LoRA, run:
210-
211298
```bash
212299
export MODEL_NAME="black-forest-labs/FLUX.2-dev"
213300
export INSTANCE_DIR="dog"
@@ -271,13 +358,10 @@ the exact modules for LoRA training. Here are some examples of target modules yo
271358
> keep in mind that while training more layers can improve quality and expressiveness, it also increases the size of the output LoRA weights.
272359
273360

274-
275361
## Training Image-to-Image
276362

277363
Flux.2 lets us perform image editing as well as image generation. We provide a simple script for image-to-image(I2I) LoRA fine-tuning in [train_dreambooth_lora_flux2_img2img.py](./train_dreambooth_lora_flux2_img2img.py) for both T2I and I2I. The optimizations discussed above apply this script, too.
278364

279-
**important**
280-
281365
**Important**
282366
To make sure you can successfully run the latest version of the image-to-image example script, we highly recommend installing from source, specifically from the commit mentioned below. To do this, execute the following steps in a new virtual environment:
283367

@@ -334,5 +418,6 @@ we've added aspect ratio bucketing support which allows training on images with
334418
To enable aspect ratio bucketing, pass `--aspect_ratio_buckets` argument with a semicolon-separated list of height,width pairs, such as:
335419
336420
`--aspect_ratio_buckets="672,1568;688,1504;720,1456;752,1392;800,1328;832,1248;880,1184;944,1104;1024,1024;1104,944;1184,880;1248,832;1328,800;1392,752;1456,720;1504,688;1568,672"
337-
`
338-
Since Flux.2 finetuning is still an experimental phase, we encourage you to explore different settings and share your insights! 🤗
421+
422+
423+
Since Flux.2 finetuning is still an experimental phase, we encourage you to explore different settings and share your insights! 🤗

0 commit comments

Comments
 (0)