Fix group offloading with block_level and use_stream=True#11375
Merged
Fix group offloading with block_level and use_stream=True#11375
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Member
|
I did some testing and we get the following numbers: No record_stream=== System Memory Stats (Before encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1942.53 GB
=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB
=== System Memory Stats (After encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1932.83 GB
=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB
=== System Memory Stats (Before transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1917.84 GB
=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB
=== System Memory Stats (After loading transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1880.56 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:30<00:00, 4.20s/it]
latents.shape=torch.Size([1, 16, 128, 128])
=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 5.68 GB
Max reserved: 5.68 GB
record_stream=== System Memory Stats (start) ===
Total system memory: 1999.99 GB
Available system memory:1941.94 GB
=== CUDA Memory Stats start ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB
=== System Memory Stats (Before encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1940.32 GB
=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB
=== System Memory Stats (After encode prompt) ===
Total system memory: 1999.99 GB
Available system memory:1930.62 GB
=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB
=== System Memory Stats (Before transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1915.65 GB
=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB
=== System Memory Stats (After loading transformer.) ===
Total system memory: 1999.99 GB
Available system memory:1883.74 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:14<00:00, 3.89s/it]
latents.shape=torch.Size([1, 16, 128, 128])
=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 4.30 GB
Max reserved: 4.30 GB
- 🤗 Diffusers version: 0.34.0.dev0
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.8.0.dev20250417+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.52.0.dev0
- Accelerate version: 1.4.0.dev0
- PEFT version: 0.15.2.dev0
- Bitsandbytes version: 0.45.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA H100 80GB HBM3, 81559 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in> |
a-r-r-o-w
commented
Apr 21, 2025
sayakpaul
approved these changes
Apr 21, 2025
Member
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for adding the test! Just two comments.
Contributor
Author
|
Failing tests seem unrelated |
DN6
reviewed
Apr 27, 2025
| option only matters when using streamed CPU offloading (i.e. `use_stream=True`). This can be useful when | ||
| the CPU memory is a bottleneck but may counteract the benefits of using streams. | ||
| """ | ||
| if stream is not None and num_blocks_per_group != 1: |
Collaborator
There was a problem hiding this comment.
This is potentially breaking no? What if there is existing code with num_blocks_per_group>1 and stream=True? If so, it might be better to raise a warning and set the num_blocks_per_group to 1 if stream is True?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #11307
The previous implementation assumed that the layers were instantiated in order of invocation. This is not true for HiDream (caption projection layers are instantiated after transformer layers).
The new implementation makes sure to first capture invocation order and then apply group offloading. In the case of
use_stream=True, it does not really make sense to onload more than 1 block at a time, so we also now raise an error ifnum_blocks_per_group != 1whenuse_stream=TrueAnother possible fix is to simply move the initialization of the caption layers above the transformer blocks.
@sayakpaul @asomoza Could you verify if this fixes it for you?