Describe the bug
When I create a dataset with a float64 feature, then apply numpy formatting the returned numpy arrays are silently downcasted to float32.
Steps to reproduce the bug
import datasets
dataset = datasets.Dataset.from_dict({'a': [1.0, 2.0, 3.0]}).with_format("numpy")
print("feature dtype:", dataset.features['a'].dtype)
print("array dtype:", dataset['a'].dtype)
output:
feature dtype: float64
array dtype: float32
Expected behavior
feature dtype: float64
array dtype: float64
Environment info
datasets version: 2.8.0
- Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyArrow version: 10.0.1
- Pandas version: 1.4.4
Suggested Fix
Changing the _tensorize function of the numpy formatter to
def _tensorize(self, value):
if isinstance(value, (str, bytes, type(None))):
return value
elif isinstance(value, (np.character, np.ndarray)) and np.issubdtype(value.dtype, np.character):
return value
elif isinstance(value, np.number):
return value
return np.asarray(value, **self.np_array_kwargs)
fixes this particular issue for me. Not sure if this would break other tests. This should also avoid unnecessary copying of the array.
Describe the bug
When I create a dataset with a
float64feature, then apply numpy formatting the returned numpy arrays are silently downcasted tofloat32.Steps to reproduce the bug
output:
Expected behavior
Environment info
datasetsversion: 2.8.0Suggested Fix
Changing the
_tensorizefunction of the numpy formatter tofixes this particular issue for me. Not sure if this would break other tests. This should also avoid unnecessary copying of the array.