Skip to content

with_format("numpy") silently downcasts float64 to float32 features #5517

@ernestum

Description

@ernestum

Describe the bug

When I create a dataset with a float64 feature, then apply numpy formatting the returned numpy arrays are silently downcasted to float32.

Steps to reproduce the bug

import datasets
dataset = datasets.Dataset.from_dict({'a': [1.0, 2.0, 3.0]}).with_format("numpy")
print("feature dtype:", dataset.features['a'].dtype)
print("array dtype:", dataset['a'].dtype)

output:

feature dtype: float64
array dtype: float32

Expected behavior

feature dtype: float64
array dtype: float64

Environment info

  • datasets version: 2.8.0
  • Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyArrow version: 10.0.1
  • Pandas version: 1.4.4

Suggested Fix

Changing the _tensorize function of the numpy formatter to

    def _tensorize(self, value):

        if isinstance(value, (str, bytes, type(None))):
            return value
        elif isinstance(value, (np.character, np.ndarray)) and np.issubdtype(value.dtype, np.character):
            return value
        elif isinstance(value, np.number):
            return value

        return np.asarray(value, **self.np_array_kwargs)

fixes this particular issue for me. Not sure if this would break other tests. This should also avoid unnecessary copying of the array.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions