Skip to content

Conversation

@tomaarsen
Copy link
Member

Hello!

Pull Request overview

  • Prevent TypeError on model.predict when using string labels.
  • Added a test case to show correct behaviour.

Details

When training with string labels (which is not strictly recommended, but possible), then model.predict broke as of the latest version. See the following script to reproduce:

Reproduction

from datasets import Dataset, load_dataset
from setfit import SetFitModel, SetFitTrainer

dataset = Dataset.from_dict(
    {"text": ["positive sentence", "negative sentence"], "label": ["positive", "negative"]}
)
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-albert-small-v2")
trainer = SetFitTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=dataset,
    num_iterations=1,
)
trainer.train()
# This used to fail due to "TypeError: can't convert np.ndarray of type numpy.str_.
# The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool."
model.predict(["another positive sentence"])

This resulted in

Traceback (most recent call last):
  File "[sic]demo_string_issue.py", line 17, in <module>
    model.predict(["another positive sentence"])
  File "[sic]src\setfit\modeling.py", line 419, in predict
    outputs = torch.from_numpy(outputs)
TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

See also #329, which shows this same issue, but for evaluate (which calls predict behind the scenes).

Why do we get this error?

Consider the following lines in the predict method:

outputs = self.model_head.predict(embeddings)
if as_numpy and self.has_differentiable_head:
outputs = outputs.detach().cpu().numpy()
elif not as_numpy and not self.has_differentiable_head:
outputs = torch.from_numpy(outputs)
return outputs

And consider the scenario with the (default) non-differentiable head and as_numpy=False. In this case, we reach line 419 and call torch.from_numpy. However, outputs has dtype <U8, where the U indicates that the type is a unicode string. There is no Torch tensor equivalent of this type, and thus we get the error shown above.

The fix

The fix is simply to prevent calling torch.from_numpy if the head outputs a numpy array with strings.

Note

The issue from #329 isn't exactly fixed, calling evaluate using string labels still fails, as the evaluate library does not support string labels in its accuracy metric. This can be counteracted by supplying a different metric, e.g. a function, that computes some metric with support of strings.

  • Tom Aarsen

on non-differentiable heads. This used to error out, especially causing issues for model.evaluate()
@tomaarsen tomaarsen added the bug Something isn't working label Mar 13, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 13, 2023

The documentation is not available anymore as the PR was closed or merged.

@tomaarsen
Copy link
Member Author

Test failures are unrelated, solved by #332.

@tomaarsen tomaarsen merged commit 83e3cf9 into huggingface:main Apr 12, 2023
@tomaarsen tomaarsen deleted the hotfix/string_predict_error branch April 12, 2023 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants