Skip to content

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

@mosalx

Description

@mosalx

Summary

When a pyarrow ListArray or FixedSizeListArray has a struct type, it is possible to run into a condition when two equal scalars have different hash values. It violates the contract for python hash function stating "The only required property is that objects which compare equal have the same hash value"
https://docs.python.org/3/reference/datamodel.html#object.__hash__

Below is the smallest reproducible example that demonstrates this issue. This example is for FixedSizeListArray but it affects ListSizeArray too.

Environment

Windows 10
python=3.11.2
pyarrow=11.0.0

Details

import pyarrow as pa

# initial array
_type = pa.list_(pa.struct([('a', pa.int32())]), list_size=1)
array = pa.array([[{'a': 1}], [{'a': 1}], [{'a': 1}], None], type=_type)

# make a deep copy of the last two elements. This involves copying all array buffers 
# and truncating unused bytes due to array offset. For simplicity, I am not copying all buffers
# (`field` array is not copied). This step was omitted to keep the example small
chunk = array[2:]
child = chunk.values[chunk.offset:]  # StructArray
field = child.field('a')  # Int32Array

# create a copy of `child`. Validity buffer could be set to None, it would not change the outcome
validity_buffer_child = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
child_copy = type(child).from_buffers(child.type, length=len(child), 
                                      buffers=[validity_buffer_child], 
                                      children=[field])
assert child_copy.equals(child)

# create a copy of `chunk` using `child_copy` made above
validity_buffer_chunk = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
chunk_copy = pa.FixedSizeListArray.from_buffers(type=chunk.type, length=len(chunk), 
                                                buffers=[validity_buffer_chunk], 
                                                children=[child_copy])
assert chunk_copy.equals(chunk)

Now we have two equal arrays, where the first element is valid (not-null).
Equality check for the first element passes

assert chunk_copy[0] == chunk[0]  # Ok

But their hash values are different

assert hash(chunk_copy[0]) == hash(chunk[0])  # AssertionError

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions