Skip to content

[C++] Handle only relevant slices of child arrays when hashing scalars from ListArrays #35830

@felipecrv

Description

@felipecrv

Describe the enhancement requested

Issue is explained in the Python code below:

import pyarrow as pa

a = pa.array([
    [{'a': 5}, {'a': 6}],
    [{'a': 7}, None]
])
b = pa.array([
    [{'a': 7}, None]
])

# a[1] and b[0] are represented as 2-element slices of a child array containing struct values
# they start on different offsets, but obviously compare as equal
assert a[1] == b[0]

# logically equal values should hash to the same value, so when hashing the hashing
# of the child array should start at the offset and not from 0 as it's done by default.
hash1 = hash(a[1])
hash2 = hash(b[0])
assert hash1 == hash2

#35814 fixes the bug for lists of structs, but the same bug might exist for other nested types:

  • struct
  • sparse union
  • dense union
  • run-end encoded
  • more?

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions