Summary
When a pyarrow ListArray or FixedSizeListArray has a struct type, it is possible to run into a condition when two equal scalars have different hash values. It violates the contract for python hash function stating "The only required property is that objects which compare equal have the same hash value"
https://docs.python.org/3/reference/datamodel.html#object.__hash__
Below is the smallest reproducible example that demonstrates this issue. This example is for FixedSizeListArray but it affects ListSizeArray too.
Environment
Windows 10
python=3.11.2
pyarrow=11.0.0
Details
import pyarrow as pa
# initial array
_type = pa.list_(pa.struct([('a', pa.int32())]), list_size=1)
array = pa.array([[{'a': 1}], [{'a': 1}], [{'a': 1}], None], type=_type)
# make a deep copy of the last two elements. This involves copying all array buffers
# and truncating unused bytes due to array offset. For simplicity, I am not copying all buffers
# (`field` array is not copied). This step was omitted to keep the example small
chunk = array[2:]
child = chunk.values[chunk.offset:] # StructArray
field = child.field('a') # Int32Array
# create a copy of `child`. Validity buffer could be set to None, it would not change the outcome
validity_buffer_child = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
child_copy = type(child).from_buffers(child.type, length=len(child),
buffers=[validity_buffer_child],
children=[field])
assert child_copy.equals(child)
# create a copy of `chunk` using `child_copy` made above
validity_buffer_chunk = pa.array([True, False, False, False, False, False, False, False]).buffers()[1]
chunk_copy = pa.FixedSizeListArray.from_buffers(type=chunk.type, length=len(chunk),
buffers=[validity_buffer_chunk],
children=[child_copy])
assert chunk_copy.equals(chunk)
Now we have two equal arrays, where the first element is valid (not-null).
Equality check for the first element passes
assert chunk_copy[0] == chunk[0] # Ok
But their hash values are different
assert hash(chunk_copy[0]) == hash(chunk[0]) # AssertionError
Component(s)
Python
Summary
When a pyarrow
ListArrayorFixedSizeListArrayhas a struct type, it is possible to run into a condition when two equal scalars have different hash values. It violates the contract for python hash function stating "The only required property is that objects which compare equal have the same hash value"https://docs.python.org/3/reference/datamodel.html#object.__hash__
Below is the smallest reproducible example that demonstrates this issue. This example is for
FixedSizeListArraybut it affectsListSizeArraytoo.Environment
Windows 10
python=3.11.2
pyarrow=11.0.0
Details
Now we have two equal arrays, where the first element is valid (not-null).
Equality check for the first element passes
But their hash values are different
Component(s)
Python