GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash() by felipecrv · Pull Request #35814 · apache/arrow

felipecrv · 2023-05-30T02:35:24Z

Rationale for this change

A fix for #35360.

What changes are included in this PR?

A hash function that can hash bitmaps
The fix for hashes of equal scalars sometimes not being equal because of offsets

Are these changes tested?

Yes. By unit tests.

Are there any user-facing changes?

No.

Closes: [C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

github-actions · 2023-05-30T02:35:48Z

Closes: [C++] Incorrect hash value for Scalars with sliced child data (ignores offset) #35360

As a basis for what is going to be a bitmap hashing function.

…itmap

pitrou

Thanks @felipecrv . I posted some comments below.

pitrou · 2023-05-30T12:49:45Z

+/// \param seed The seed for the hash function (useful when chaining hash functions).
+/// \param bits_offset The offset in bits relative to the start of the bitmap.
+/// \param num_bits The number of bits after the offset to be hashed.
+uint64_t MurmurHashBitmap64A(const uint8_t* key, uint64_t seed, uint64_t bits_offset,


We already vendor XXHash64, why add MurmurHash as well?

This is MurmurHash modified to take bits starting on non-byte boundaries.

pitrou · 2023-05-30T12:50:06Z

+/// \param seed The seed for the hash function (useful when chaining hash functions).
+/// \param bits_offset The offset in bits relative to the start of the bitmap.
+/// \param num_bits The number of bits after the offset to be hashed.
+ARROW_EXPORT hash_t ComputeBitmapHash(const uint8_t* bitmap, int64_t length, hash_t seed,


Why pass length in addition to num_bits?

Also, no need to pass a seed since we can easily combine separate hash values.

The length in bytes allows me to DCHECK if the it's safe to read that number of bits within that range of bytes. I can remove it if you don't think this check is worth it.

Also, no need to pass a seed since we can easily combine separate hash values.

This way of combining hashes is part of the MurmurHash design and it is inspired by how block ciphers are chained to guarantee that a single bit change early in the input has a huge effect on the output (Avalanche Effect).

The xor we are using to combine hashes in our hash functions is modifying only the LSB bits (biasing them), probably making the hash output worse as a key for hash tables.

Type::Hash: 0xfc0a5d5d3a4a5d05 StdHash: 0xfc0a5d5d3a4a5d04 (from 1) StdHash: 0xfc0a5d5d3a4a5d06 (from 2) ComputeBitmapHash: 0xacf1b1adcf3c8bc5 StdHash: 0xacf1b1adcf3c8bc5 (from 0) StdHash: 0xacf1b1adcf3c8bc7 (from 2) ArrayHash: 0xacf1b1adcf3c8bc7 ArrayHash: 0xacf1b1adcf3c8bc7

Yes, I don't think length is useful. I'm not fond of having MurmurHash in our codebase, but I understand the argument of it being easy to inline into the loop (XXH64 does have streaming APIs, but they would probably require a bit more code).

OK. I'm removing the length parameter now.

pitrou · 2023-05-30T12:57:03Z

+  h ^= (k);              \
+  h *= m
+
+  // Shift key pointer by as many words as possible.


You don't need to do all this manually. BitmapWordReader or BitmapUInt64Reader can automate most of this for you. Another possibility is VisitBitBlocksVoid...

I considered it, but it's not the same. For hashing, when the input is not word-aligned, I have to load a partial word from current word and next before I pass it into the hash mixing for EVERY block. This is what makes the hash depends solely on the bit values. To be able to use the bitmap readers the hash algorithm would have to be a Rolling Hash at the bit level which I suspect would be expensive to compute.

You seem to be misunderstanding what those utilities do? You could take a look at how they are used, for example:

arrow/cpp/src/arrow/util/bitmap_ops.cc

Lines 127 to 141 in 431785f

if (bit_offset || dest_bit_offset) {

auto reader = internal::BitmapWordReader<uint64_t>(data, offset, length);

auto writer = internal::BitmapWordWriter<uint64_t>(dest, dest_offset, length);

auto nwords = reader.words();

while (nwords--) {

auto word = reader.NextWord();

writer.PutNextWord(mode == TransferMode::Invert ? ~word : word);

}

auto nbytes = reader.trailing_bytes();

while (nbytes--) {

int valid_bits;

auto byte = reader.NextTrailingByte(valid_bits);

writer.PutNextTrailingByte(mode == TransferMode::Invert ? ~byte : byte, valid_bits);

}

Alright. Now I see what you mean.

pitrou · 2023-05-30T13:03:20Z

+}
+
+TEST(TestBitmapHash, Empty) {
+  BooleanBuilder builder;


This test looks a bit complicated. Would it simplify to:

generate random boolean arrays

slice them and check hashing the slice vs. hashing an aligned copy of the buffer?

When one of these tests fail, it's easier to see which part of the function is not handling input correctly. Random arrays might not necessarily cover all the range checks.

And the 2,3,5... block was easy to spot in the debugger to see what was going on by printing the words in binary.

As you prefer, though it would be nice to find a way to make the test more compact and easier to read.

- Simplify the tests by dropping many cases - Add tests on the C++ side as well

pitrou · 2023-05-31T14:53:59Z

+  }
+  const auto hash_of_block = HashDataBitmap(*block_of_bools->data());
+
+  const auto kStep = 9;


Looks like simply bumping this to 13 makes the test 2x faster (which will help on instrumentation-heavy builds such as ASAN or Valgrind).

It's a cubic speed-up. Bumping it to 13.

felipeo@thinkpad: ~/code/arrow/cpp/ninja (hash_scalar_fix $%>)$ ninja arrow-utility-test && ./**/arrow-utility-test --gtest_break_on_failure --gtest_filter="*BitmapHash*" [15/15] Linking CXX executable debug/arrow-utility-test Note: Google Test filter = *BitmapHash* [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from BitmapHashTest [ RUN ] BitmapHashTest.SmallInputs [ OK ] BitmapHashTest.SmallInputs (107 ms) [ RUN ] BitmapHashTest.LongerInputs [ OK ] BitmapHashTest.LongerInputs (121 ms) [----------] 2 tests from BitmapHashTest (228 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (228 ms total) [ PASSED ] 2 tests.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

pitrou · 2023-05-31T19:07:13Z

Hmm, there's a bunch of CI failures which look related.

felipecrv · 2023-06-01T13:48:09Z

Hmm, there's a bunch of CI failures which look related.

@pitrou it was a bad index in the hash selectivity test code. It's now fixed and CI is green.

ursabot · 2023-06-02T20:46:06Z

Benchmark runs are scheduled for baseline = d20a1d1 and contender = 3299d12. 3299d12 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.74% ⬆️0.09%] test-mac-arm
[Finished ⬇️1.31% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 3299d12e ec2-t3-xlarge-us-east-2
[Finished] 3299d12e test-mac-arm
[Finished] 3299d12e ursa-i9-9960x
[Finished] 3299d12e ursa-thinkcentre-m75q
[Finished] d20a1d1e ec2-t3-xlarge-us-east-2
[Finished] d20a1d1e test-mac-arm
[Finished] d20a1d1e ursa-i9-9960x
[Finished] d20a1d1e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

felipecrv requested a review from AlenkaF as a code owner May 30, 2023 02:35

github-actions Bot added Component: C++ Component: Python awaiting review Awaiting review labels May 30, 2023

felipecrv added 4 commits May 29, 2023 23:44

Start with the 64-bit-aligned version of MurmurHash2

29be7f2

As a basis for what is going to be a bitmap hashing function.

A hash function for bitmaps that can handle offset and length

9f5758f

Disable slow tests of ComputeBitmapHash

20626cf

scalar.cc: Consider the length and offset when hashing the validity b…

b18c09e

…itmap

felipecrv force-pushed the hash_scalar_fix branch from 55b8368 to b18c09e Compare May 30, 2023 02:44

Add export annotation for Windows builds

b6d6671

pitrou reviewed May 30, 2023

View reviewed changes

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 30, 2023

felipecrv commented May 30, 2023

View reviewed changes

Comment thread cpp/src/arrow/util/hashing_test.cc Outdated

felipecrv added 3 commits May 30, 2023 13:11

Add big-endian support to MurmurHashBitmap64A

b7c9216

Label the tests correctly

b797450

Replace the unrolled loops with BitmapWordReader

0e7546c

felipecrv mentioned this pull request May 30, 2023

[C++] Handle only relevant slices of child arrays when hashing scalars from ListArrays #35830

Open

5 tasks

felipecrv added 2 commits May 30, 2023 19:29

Make the last round shorter like in original MurmurHash

bee4183

Review the tests

1ea600f

- Simplify the tests by dropping many cases - Add tests on the C++ side as well

felipecrv requested a review from pitrou May 31, 2023 12:49

pitrou reviewed May 31, 2023

View reviewed changes

Comment thread cpp/src/arrow/scalar_test.cc

pitrou reviewed May 31, 2023

View reviewed changes

felipecrv and others added 3 commits May 31, 2023 14:27

Checks that hashing is selective

6b29a35

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Speed-up the tests

1f852e3

Removing redundant length parameter

9ee3d1f

felipecrv requested a review from pitrou May 31, 2023 17:37

fixup! Checks that hashing is selective

d2c4a07

Remove unused macro

7e84649

pitrou approved these changes Jun 1, 2023

View reviewed changes

pitrou changed the title ~~GH-35360: Take offset into account in ScalarHashImpl::ArrayHash()~~ GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash() Jun 1, 2023

pitrou merged commit 3299d12 into apache:main Jun 1, 2023

felipecrv deleted the hash_scalar_fix branch June 1, 2023 16:44

	if (bit_offset \|\| dest_bit_offset) {
	auto reader = internal::BitmapWordReader<uint64_t>(data, offset, length);
	auto writer = internal::BitmapWordWriter<uint64_t>(dest, dest_offset, length);

	auto nwords = reader.words();
	while (nwords--) {
	auto word = reader.NextWord();
	writer.PutNextWord(mode == TransferMode::Invert ? ~word : word);
	}
	auto nbytes = reader.trailing_bytes();
	while (nbytes--) {
	int valid_bits;
	auto byte = reader.NextTrailingByte(valid_bits);
	writer.PutNextTrailingByte(mode == TransferMode::Invert ? ~byte : byte, valid_bits);
	}

Conversation

felipecrv commented May 30, 2023 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented May 30, 2023

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented May 31, 2023

Uh oh!

felipecrv commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ursabot commented Jun 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felipecrv commented May 30, 2023 •

edited by github-actions Bot

Loading

felipecrv commented Jun 1, 2023 •

edited

Loading