Skip to content

GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash()#35814

Merged
pitrou merged 15 commits intoapache:mainfrom
felipecrv:hash_scalar_fix
Jun 1, 2023
Merged

GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash()#35814
pitrou merged 15 commits intoapache:mainfrom
felipecrv:hash_scalar_fix

Conversation

@felipecrv
Copy link
Copy Markdown
Contributor

@felipecrv felipecrv commented May 30, 2023

Rationale for this change

A fix for #35360.

What changes are included in this PR?

  • A hash function that can hash bitmaps
  • The fix for hashes of equal scalars sometimes not being equal because of offsets

Are these changes tested?

Yes. By unit tests.

Are there any user-facing changes?

No.

@felipecrv felipecrv requested a review from AlenkaF as a code owner May 30, 2023 02:35
@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felipecrv . I posted some comments below.

Comment thread cpp/src/arrow/util/hashing.cc Outdated
/// \param seed The seed for the hash function (useful when chaining hash functions).
/// \param bits_offset The offset in bits relative to the start of the bitmap.
/// \param num_bits The number of bits after the offset to be hashed.
uint64_t MurmurHashBitmap64A(const uint8_t* key, uint64_t seed, uint64_t bits_offset,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already vendor XXHash64, why add MurmurHash as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is MurmurHash modified to take bits starting on non-byte boundaries.

Comment thread cpp/src/arrow/util/hashing.h Outdated
/// \param seed The seed for the hash function (useful when chaining hash functions).
/// \param bits_offset The offset in bits relative to the start of the bitmap.
/// \param num_bits The number of bits after the offset to be hashed.
ARROW_EXPORT hash_t ComputeBitmapHash(const uint8_t* bitmap, int64_t length, hash_t seed,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass length in addition to num_bits?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, no need to pass a seed since we can easily combine separate hash values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length in bytes allows me to DCHECK if the it's safe to read that number of bits within that range of bytes. I can remove it if you don't think this check is worth it.

Also, no need to pass a seed since we can easily combine separate hash values.

This way of combining hashes is part of the MurmurHash design and it is inspired by how block ciphers are chained to guarantee that a single bit change early in the input has a huge effect on the output (Avalanche Effect).

The xor we are using to combine hashes in our hash functions is modifying only the LSB bits (biasing them), probably making the hash output worse as a key for hash tables.

       Type::Hash: 0xfc0a5d5d3a4a5d05
          StdHash: 0xfc0a5d5d3a4a5d04 (from 1)
          StdHash: 0xfc0a5d5d3a4a5d06 (from 2)
ComputeBitmapHash: 0xacf1b1adcf3c8bc5
          StdHash: 0xacf1b1adcf3c8bc5 (from 0)
          StdHash: 0xacf1b1adcf3c8bc7 (from 2)
        ArrayHash: 0xacf1b1adcf3c8bc7
        ArrayHash: 0xacf1b1adcf3c8bc7

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think length is useful. I'm not fond of having MurmurHash in our codebase, but I understand the argument of it being easy to inline into the loop (XXH64 does have streaming APIs, but they would probably require a bit more code).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm removing the length parameter now.

Comment thread cpp/src/arrow/util/hashing.cc Outdated
h ^= (k); \
h *= m

// Shift key pointer by as many words as possible.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to do all this manually. BitmapWordReader or BitmapUInt64Reader can automate most of this for you. Another possibility is VisitBitBlocksVoid...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but it's not the same. For hashing, when the input is not word-aligned, I have to load a partial word from current word and next before I pass it into the hash mixing for EVERY block. This is what makes the hash depends solely on the bit values. To be able to use the bitmap readers the hash algorithm would have to be a Rolling Hash at the bit level which I suspect would be expensive to compute.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be misunderstanding what those utilities do? You could take a look at how they are used, for example:

if (bit_offset || dest_bit_offset) {
auto reader = internal::BitmapWordReader<uint64_t>(data, offset, length);
auto writer = internal::BitmapWordWriter<uint64_t>(dest, dest_offset, length);
auto nwords = reader.words();
while (nwords--) {
auto word = reader.NextWord();
writer.PutNextWord(mode == TransferMode::Invert ? ~word : word);
}
auto nbytes = reader.trailing_bytes();
while (nbytes--) {
int valid_bits;
auto byte = reader.NextTrailingByte(valid_bits);
writer.PutNextTrailingByte(mode == TransferMode::Invert ? ~byte : byte, valid_bits);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Now I see what you mean.

Comment thread cpp/src/arrow/util/hashing_test.cc Outdated
Comment thread cpp/src/arrow/util/hashing_test.cc Outdated
}

TEST(TestBitmapHash, Empty) {
BooleanBuilder builder;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks a bit complicated. Would it simplify to:

  • generate random boolean arrays
  • slice them and check hashing the slice vs. hashing an aligned copy of the buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one of these tests fail, it's easier to see which part of the function is not handling input correctly. Random arrays might not necessarily cover all the range checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the 2,3,5... block was easy to spot in the debugger to see what was going on by printing the words in binary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you prefer, though it would be nice to find a way to make the test more compact and easier to read.

Comment thread python/pyarrow/tests/test_scalars.py
Comment thread cpp/src/arrow/scalar.cc Outdated
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 30, 2023
Comment thread cpp/src/arrow/util/hashing_test.cc Outdated
felipecrv added 2 commits May 30, 2023 19:29
 - Simplify the tests by dropping many cases
 - Add tests on the C++ side as well
@felipecrv felipecrv requested a review from pitrou May 31, 2023 12:49
Comment thread cpp/src/arrow/scalar_test.cc
Comment thread cpp/src/arrow/util/hashing_test.cc Outdated
}
const auto hash_of_block = HashDataBitmap(*block_of_bools->data());

const auto kStep = 9;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like simply bumping this to 13 makes the test 2x faster (which will help on instrumentation-heavy builds such as ASAN or Valgrind).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a cubic speed-up. Bumping it to 13.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

felipeo@thinkpad: ~/code/arrow/cpp/ninja (hash_scalar_fix $%>)$ ninja arrow-utility-test && ./**/arrow-utility-test --gtest_break_on_failure --gtest_filter="*BitmapHash*"
[15/15] Linking CXX executable debug/arrow-utility-test
Note: Google Test filter = *BitmapHash*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from BitmapHashTest
[ RUN      ] BitmapHashTest.SmallInputs
[       OK ] BitmapHashTest.SmallInputs (107 ms)
[ RUN      ] BitmapHashTest.LongerInputs
[       OK ] BitmapHashTest.LongerInputs (121 ms)
[----------] 2 tests from BitmapHashTest (228 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (228 ms total)
[  PASSED  ] 2 tests.

@felipecrv felipecrv requested a review from pitrou May 31, 2023 17:37
@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 31, 2023

Hmm, there's a bunch of CI failures which look related.

@felipecrv
Copy link
Copy Markdown
Contributor Author

felipecrv commented Jun 1, 2023

Hmm, there's a bunch of CI failures which look related.

@pitrou it was a bad index in the hash selectivity test code. It's now fixed and CI is green.

@pitrou pitrou changed the title GH-35360: Take offset into account in ScalarHashImpl::ArrayHash() GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash() Jun 1, 2023
@pitrou pitrou merged commit 3299d12 into apache:main Jun 1, 2023
@felipecrv felipecrv deleted the hash_scalar_fix branch June 1, 2023 16:44
@ursabot
Copy link
Copy Markdown

ursabot commented Jun 2, 2023

Benchmark runs are scheduled for baseline = d20a1d1 and contender = 3299d12. 3299d12 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.74% ⬆️0.09%] test-mac-arm
[Finished ⬇️1.31% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3299d12e ec2-t3-xlarge-us-east-2
[Finished] 3299d12e test-mac-arm
[Finished] 3299d12e ursa-i9-9960x
[Finished] 3299d12e ursa-thinkcentre-m75q
[Finished] d20a1d1e ec2-t3-xlarge-us-east-2
[Finished] d20a1d1e test-mac-arm
[Finished] d20a1d1e ursa-i9-9960x
[Finished] d20a1d1e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Incorrect hash value for Scalars with sliced child data (ignores offset)

3 participants