GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash()#35814
GH-35360: [C++] Take offset into account in ScalarHashImpl::ArrayHash()#35814pitrou merged 15 commits intoapache:mainfrom
Conversation
As a basis for what is going to be a bitmap hashing function.
55b8368 to
b18c09e
Compare
pitrou
left a comment
There was a problem hiding this comment.
Thanks @felipecrv . I posted some comments below.
| /// \param seed The seed for the hash function (useful when chaining hash functions). | ||
| /// \param bits_offset The offset in bits relative to the start of the bitmap. | ||
| /// \param num_bits The number of bits after the offset to be hashed. | ||
| uint64_t MurmurHashBitmap64A(const uint8_t* key, uint64_t seed, uint64_t bits_offset, |
There was a problem hiding this comment.
We already vendor XXHash64, why add MurmurHash as well?
There was a problem hiding this comment.
This is MurmurHash modified to take bits starting on non-byte boundaries.
| /// \param seed The seed for the hash function (useful when chaining hash functions). | ||
| /// \param bits_offset The offset in bits relative to the start of the bitmap. | ||
| /// \param num_bits The number of bits after the offset to be hashed. | ||
| ARROW_EXPORT hash_t ComputeBitmapHash(const uint8_t* bitmap, int64_t length, hash_t seed, |
There was a problem hiding this comment.
Why pass length in addition to num_bits?
There was a problem hiding this comment.
Also, no need to pass a seed since we can easily combine separate hash values.
There was a problem hiding this comment.
The length in bytes allows me to DCHECK if the it's safe to read that number of bits within that range of bytes. I can remove it if you don't think this check is worth it.
Also, no need to pass a seed since we can easily combine separate hash values.
This way of combining hashes is part of the MurmurHash design and it is inspired by how block ciphers are chained to guarantee that a single bit change early in the input has a huge effect on the output (Avalanche Effect).
The xor we are using to combine hashes in our hash functions is modifying only the LSB bits (biasing them), probably making the hash output worse as a key for hash tables.
Type::Hash: 0xfc0a5d5d3a4a5d05
StdHash: 0xfc0a5d5d3a4a5d04 (from 1)
StdHash: 0xfc0a5d5d3a4a5d06 (from 2)
ComputeBitmapHash: 0xacf1b1adcf3c8bc5
StdHash: 0xacf1b1adcf3c8bc5 (from 0)
StdHash: 0xacf1b1adcf3c8bc7 (from 2)
ArrayHash: 0xacf1b1adcf3c8bc7
ArrayHash: 0xacf1b1adcf3c8bc7
There was a problem hiding this comment.
Yes, I don't think length is useful. I'm not fond of having MurmurHash in our codebase, but I understand the argument of it being easy to inline into the loop (XXH64 does have streaming APIs, but they would probably require a bit more code).
There was a problem hiding this comment.
OK. I'm removing the length parameter now.
| h ^= (k); \ | ||
| h *= m | ||
|
|
||
| // Shift key pointer by as many words as possible. |
There was a problem hiding this comment.
You don't need to do all this manually. BitmapWordReader or BitmapUInt64Reader can automate most of this for you. Another possibility is VisitBitBlocksVoid...
There was a problem hiding this comment.
I considered it, but it's not the same. For hashing, when the input is not word-aligned, I have to load a partial word from current word and next before I pass it into the hash mixing for EVERY block. This is what makes the hash depends solely on the bit values. To be able to use the bitmap readers the hash algorithm would have to be a Rolling Hash at the bit level which I suspect would be expensive to compute.
There was a problem hiding this comment.
You seem to be misunderstanding what those utilities do? You could take a look at how they are used, for example:
arrow/cpp/src/arrow/util/bitmap_ops.cc
Lines 127 to 141 in 431785f
There was a problem hiding this comment.
Alright. Now I see what you mean.
| } | ||
|
|
||
| TEST(TestBitmapHash, Empty) { | ||
| BooleanBuilder builder; |
There was a problem hiding this comment.
This test looks a bit complicated. Would it simplify to:
- generate random boolean arrays
- slice them and check hashing the slice vs. hashing an aligned copy of the buffer?
There was a problem hiding this comment.
When one of these tests fail, it's easier to see which part of the function is not handling input correctly. Random arrays might not necessarily cover all the range checks.
There was a problem hiding this comment.
And the 2,3,5... block was easy to spot in the debugger to see what was going on by printing the words in binary.
There was a problem hiding this comment.
As you prefer, though it would be nice to find a way to make the test more compact and easier to read.
- Simplify the tests by dropping many cases - Add tests on the C++ side as well
| } | ||
| const auto hash_of_block = HashDataBitmap(*block_of_bools->data()); | ||
|
|
||
| const auto kStep = 9; |
There was a problem hiding this comment.
Looks like simply bumping this to 13 makes the test 2x faster (which will help on instrumentation-heavy builds such as ASAN or Valgrind).
There was a problem hiding this comment.
It's a cubic speed-up. Bumping it to 13.
There was a problem hiding this comment.
felipeo@thinkpad: ~/code/arrow/cpp/ninja (hash_scalar_fix $%>)$ ninja arrow-utility-test && ./**/arrow-utility-test --gtest_break_on_failure --gtest_filter="*BitmapHash*"
[15/15] Linking CXX executable debug/arrow-utility-test
Note: Google Test filter = *BitmapHash*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from BitmapHashTest
[ RUN ] BitmapHashTest.SmallInputs
[ OK ] BitmapHashTest.SmallInputs (107 ms)
[ RUN ] BitmapHashTest.LongerInputs
[ OK ] BitmapHashTest.LongerInputs (121 ms)
[----------] 2 tests from BitmapHashTest (228 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (228 ms total)
[ PASSED ] 2 tests.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
Hmm, there's a bunch of CI failures which look related. |
@pitrou it was a bad index in the hash selectivity test code. It's now fixed and CI is green. |
|
Benchmark runs are scheduled for baseline = d20a1d1 and contender = 3299d12. 3299d12 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Rationale for this change
A fix for #35360.
What changes are included in this PR?
Are these changes tested?
Yes. By unit tests.
Are there any user-facing changes?
No.