Skip to content

Conversation

@felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Jun 13, 2024

Context

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Added the image transforms for clip.

Builder with the model specific values in:
torchtune/models/clip/_model_builders.py

Clip specific use of transforms in:
torchtune/models/clip/_transforms.py

Vision transforms in:
torchtune/modules/transforms/vision

Every transform is a function.

Algorithm TLDR:

  1. Gets list of possible resolutions based on tile_size and max_num_tiles;
  2. Finds resolutions that best fits the image;
  3. Resizes with distortion
  4. Pads
  5. Normalizes
  6. tile_crop -> output is of shape [num_tiles, 3, tile_size, tile_size]

Changelog

Added torchvision to the requirements

Test plan

Every function is covered, some indirectly. All tests pass.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
    • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

docs
image
image
image
image
image

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1084

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2024
@felipemello1 felipemello1 changed the title [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder - Do not land Jun 13, 2024
@felipemello1 felipemello1 changed the title [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder - Do not land Do not land - [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 13, 2024
@felipemello1 felipemello1 changed the title Do not land - [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 13, 2024
@felipemello1 felipemello1 changed the title [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder wip [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 14, 2024
@felipemello1 felipemello1 marked this pull request as draft June 14, 2024 20:39
Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great first pass over adding all the functionality needed for high resolution CLIP image transforms. I've left a lot of comments on the structure of the transforms as you're adding the first ones to the library. After we update these I'll take a pass over the tests and docs.

@felipemello1 felipemello1 changed the title wip [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 25, 2024
If False, pick the canvas that minimizes downscaling, including no downscaling at all.
Returns:
Tuple[int, int]: The best resolution to fit the image into.
Examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These examples are not rendering correctly in docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon

Suggested change
Examples:
Examples::

If None, will upscale up to target_size.
Returns:
torch.Tensor: The resized and padded image tensor in the format [..., H, W].
Examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These examples are not rendering correctly in docs.

@felipemello1 felipemello1 marked this pull request as ready for review June 25, 2024 01:32
Copy link
Collaborator

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome overall, thanks for all the work you put into this. One overall question I had is the distinction between the CLIP transform which lives in torchtune/models and the other transforms in modules/transforms/vision. Will only the model transforms be classes with __call__ and all the transforms that live in modules/transforms/ be normal functions?

Also, what is the expected approach for a user to make a new model transform? I had originally thought there would be common image transforms that they can just Compose together in the model builder, but I see that CLIPImageTransform is a bit more involved than just piping transforms. So is this the overall user flow we want to enforce:

  • all general transforms are standalone functions in modules
  • to make a new model transform, user needs to define a class that uses the general transform functions in any way they want
  • then they create a builder function that parametrizes the class and can be instantiated in the config

from torchtune.models.clip._transforms import CLIPImageTransform

def clip_vit_336_transform():

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a paper ref for these numbers that we can add to the docstring?

If False, pick the canvas that minimizes downscaling, including no downscaling at all.
Returns:
Tuple[int, int]: The best resolution to fit the image into.
Examples:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon

Suggested change
Examples:
Examples::

>>> get_canvas_best_fit(image, possible_resolutions, resize_to_max_canvas=False)
(224, 448)

In the example above, we calculate the scaling factors for each possible resolution
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe these lines in between the code need to be tabbed back

Copy link
Collaborator

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major concerns on my end - thanks for pushing this through. You might have to wait until recipe tests is fixed on main before landing this.

Also curious if we plan to add model transforms to the api_ref_models.rst, cc @pbontrager . But this can be a follow-up

Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good here. I'll approve this, but I think you should update the check for possible_resolutions to explicitly check for None.

), f"Either possible_resolutions or max_num_tiles must be given. Got {possible_resolutions=} and {max_num_tiles=}"

# If possible_resolutions are not given, then calculate possible ones based on max_num_tiles
if not possible_resolutions and max_num_tiles:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to explicitly check for None, otherwise 0 or empty tuples resolve as false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible_resolutions is None or is empty or anything that is not a list with items, we must activate this condition to find possible_resolutions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

possible_resolutions = find_supported_resolutions(
max_num_tiles=max_num_tiles, tile_size=tile_size
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else doesn't make sense to me

@felipemello1 felipemello1 merged commit 06a125e into main Jul 5, 2024
@felipemello1 felipemello1 deleted the image_transforms branch July 5, 2024 19:57
@felipemello1 felipemello1 restored the image_transforms branch July 5, 2024 19:58
Aditya-dom added a commit to Aditya-dom/torchtune that referenced this pull request Jul 6, 2024
[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder (meta-pytorch#1084)
) -> torch.Tensor:
"""
Places the image at the top left of the canvas and pads with 0 the right and bottom
to fit to the target resolution. If target_size < image_size, it will crop the image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if target_size < image_size, it will crop the image.

From _get_max_res_without_distortion, it seems like we always resize such that the image remains within bounds of target_size - is cropping still required?

new_height = min(math.floor(original_height * scale_w), target_height)
else:
new_height = target_height
new_width = min(math.floor(original_width * scale_h), target_width)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified to:

new_width = min(math.floor(original_width * scale_h), target_width)
new_height = min(math.floor(original_height * scale_w), target_height)

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024
FlamingoPg pushed a commit to FlamingoPg/sgl-tune-eagle that referenced this pull request May 26, 2025
This PR
1. adds grouped gemm support
(pytorch/pytorch#150374) for llama4 MoE. In
general, it avoids device/host syncs; for TP, it avoided the sharding
prop overhead caused by varying number of tokens for individual experts.
The speedup on the debug model is ~4x with/without TP. I'm deliberately
keeping the for-loop implementation for now, for comparison and
readability purposes.
2. moves the MoE indices kernel from the deepseek folder to the kernel
folder.

In order for TP to work, it requires some pytorch-side changes (e.g.
DTensor support for `torch._grouped_mm`), for which I will submit PRs
soon.

A issue is that the grouped gemm version doesn't work well with AdamW
optimizer, which is to be investigated. cc: @janeyx99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants