[Trainer] Allow passing image processor by NielsRogge · Pull Request #29896 · huggingface/transformers

NielsRogge · 2024-03-27T08:08:11Z

What does this PR do?

Fixes #29790. Also reported here: https://discuss.huggingface.co/t/vitimageprocessor-object-has-no-attribute-pad/32511.

Currently, passing an image processor to the Trainer is pretty hacky, as users need to do tokenizer=image_processor. 🤔 yes, that's right.

This PR adds the image_processor argument to the Trainer, such that the default data collator is used by default in case an image processor is passed.

To do:

update example scripts to no longer use tokenizer=image_processor

HuggingFaceDocBuilderDev · 2024-03-27T08:35:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Very good to have

ArthurZucker · 2024-03-27T15:05:32Z

src/transformers/trainer.py

            self.place_model_on_device = False

-        default_collator = default_data_collator if tokenizer is None else DataCollatorWithPadding(tokenizer)
+        default_collator = DataCollatorWithPadding(tokenizer) if tokenizer is not None else default_data_collator


might not be BC if people pass a ImageProcessor. Let's deprecate passing image processors checking the type

If people currently pass an image processor as tokenizer, then they need to pass a data collator as otherwise they'll hit the issue mentioned at #29790 (as image processors typically don't have a pad method). So it doesn't break backwards

amyeroberts · 2024-03-28T14:34:46Z

Can we add enabling passing a feature_extractor and processor too?

NielsRogge · 2024-04-03T15:57:43Z

@ArthurZucker feel free to approve the PR, @amyeroberts I would add those in a separate PR

ArthurZucker

Okay, let's do a followup PR then for feature processor and else

daniellok-db · 2024-04-06T03:50:37Z

cc @NielsRogge @tomaarsen the addition of the new param to trainer_callback.CallbackHandler is currently causing problems for the setfit.Trainer class in the SetFit library. can we either make the param optional, or update setfit.Trainer to pass a value in for the new param?

chenin-wang · 2024-04-07T08:53:02Z

self.callback_handler = CallbackHandler(callbacks, self.model, self.model.model_body.tokenizer, None, None, None)

Can we add enabling passing a feature_extractor and processor too?

@ArthurZucker feel free to approve the PR, @amyeroberts I would add those in a separate PR

useful, waiting...

chenin-wang · 2024-04-07T08:53:33Z

cc @NielsRogge @tomaarsen the addition of the new param to trainer_callback.CallbackHandler is currently causing problems for the setfit.Trainer class in the SetFit library. can we either make the param optional, or update setfit.Trainer to pass a value in for the new param?

self.callback_handler = CallbackHandler(callbacks, self.model, self.model.model_body.tokenizer, None, None, None)

chenin-wang · 2024-04-07T09:00:29Z

@NielsRogge @ArthurZucker For Preprocessing classes. Suggest modifying the tokenizer parameter name instead of adding parameters. Consider changing tokenizer to processors.

NielsRogge · 2024-04-07T09:32:01Z

Yes @chenin-wang I was also considering this, rather than having various attributes for tokenizer, image processor, feature extractor and multimodal processor it probably makes more sense to just have a single argument called processor. Although that would be a breaking change since many people alreasy use tokenizer=...

Will await opinion of Arthur and Amy

chenin-wang · 2024-04-07T09:52:02Z

Yes @chenin-wang I was also considering this, rather than having various attributes for tokenizer, image processor, feature extractor and multimodal processor it probably makes more sense to just have a single argument called processor. Although that would be a breaking change since many people alreasy use tokenizer=...

Will await opinion of Arthur and Amy

You are right. Will change user habits.

NielsRogge · 2024-04-07T20:45:40Z

Reflecting more on this, I think the best way is to have a proper deprecation message.

Basically now whenever people pass the tokenizer argument, we should add a message saying "the tokenizer argument is going to be deprecated in v4.xx of Transformers and you need to update your code to pass processor instead."

=> this way people can safely update their code without having breaking changes from the start. Will open a follow-up PR for this.

* Undo * Use tokenizer * Undo data collator

chenin-wang · 2024-04-10T04:49:43Z

@NielsRogge @amyeroberts The problem remains. #30102 (comment)

NielsRogge · 2024-04-10T08:55:35Z

Hi @chenin-wang could you clarify your issue?

chenin-wang · 2024-04-10T09:39:36Z

@NielsRogge tokenizer=image_processor is a confusing API (as highlighted by a few issues from users) and considering we have more and more multimodal, audio and vision models increasingly out-of-date，what amyeroberts saied #30102 (comment).

NielsRogge · 2024-04-10T10:07:32Z

Yes @chenin-wang I agree. So indeed we should continue with #30102. However I'm thinking about the term preprocessor rather than processor, as processor is already used for multimodal processors like CLIP, BLIP-2, etc.

Not sure if people prefer processor=image_processor or preprocessor=image_processor.

Hence I'll ask the question to some HF members and see which term they prefer. Will then proceed.

NielsRogge added 2 commits March 27, 2024 09:06

Add image processor to trainer

c5fb77c

Replace tokenizer=image_processor everywhere

7c71afd

NielsRogge requested a review from ArthurZucker March 27, 2024 10:01

ArthurZucker reviewed Mar 27, 2024

View reviewed changes

ArthurZucker approved these changes Apr 5, 2024

View reviewed changes

NielsRogge merged commit 1ab7136 into huggingface:main Apr 5, 2024

NielsRogge mentioned this pull request Apr 7, 2024

[Trainer] Rename tokenizer to processor, add deprecation #30102

Closed

2 tasks

younesbelkada mentioned this pull request Apr 8, 2024

Trainer / Core : Do not change init signature order #30126

Merged

NielsRogge mentioned this pull request Apr 8, 2024

[Trainer] Undo #29896 #30129

Merged

NielsRogge added a commit that referenced this pull request Apr 9, 2024

[Trainer] Undo #29896 (#30129)

e9c23fa

* Undo * Use tokenizer * Undo data collator

amyeroberts mentioned this pull request May 16, 2024

[trainer] allow processor instead of tokenizer #30864

Closed

amyeroberts mentioned this pull request Sep 9, 2024

Trainer - deprecate tokenizer for processing_class #32385

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 28, 2026

feat(ml): ensemble XGBoost+BERT, calibración isotónica y experimento embeddings e5 maquinas-que-aprenden/proyecto-final#146

Open

4 tasks

Conversation

NielsRogge commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

NielsRogge Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts commented Mar 28, 2024

Uh oh!

NielsRogge commented Apr 3, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

daniellok-db commented Apr 6, 2024

Uh oh!

chenin-wang commented Apr 7, 2024

Uh oh!

chenin-wang commented Apr 7, 2024

Uh oh!

chenin-wang commented Apr 7, 2024

Uh oh!

NielsRogge commented Apr 7, 2024

Uh oh!

chenin-wang commented Apr 7, 2024

Uh oh!

NielsRogge commented Apr 7, 2024

Uh oh!

chenin-wang commented Apr 10, 2024

Uh oh!

NielsRogge commented Apr 10, 2024

Uh oh!

chenin-wang commented Apr 10, 2024

Uh oh!

NielsRogge commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

NielsRogge commented Mar 27, 2024 •

edited

Loading

NielsRogge Mar 27, 2024 •

edited

Loading

NielsRogge commented Apr 10, 2024 •

edited

Loading