GH-47112: [Parquet][C++] Rle BitPacked parser#47294
Conversation
|
|
|
Work in progress, this is currently a split of the decoder in a parser and a decoder, but it is not plugged in. |
2f28e67 to
d4ccc46
Compare
b8e6e38 to
a08ce09
Compare
|
Some benchmarks on Linux x86_64 cloud instance with 8 CPU 16 Gb memory and dependencies/compilers from Conda-Forge.
|
|
@pitrou this is ready for review |
026be9a to
0344842
Compare
| while ((v & 0xFFFFFFFFFFFFFF80ULL) != 0ULL) { | ||
| result &= PutAligned<uint8_t>(static_cast<uint8_t>((v & 0x7F) | 0x80), 1); | ||
| v >>= 7; | ||
| constexpr auto kMaxBytes = bit_util::MaxLEB128ByteLenFor<decltype(v)>; |
There was a problem hiding this comment.
| constexpr auto kMaxBytes = bit_util::MaxLEB128ByteLenFor<decltype(v)>; | |
| constexpr auto kMaxBytes = kMaxVlqByteLengthForInt64; |
There was a problem hiding this comment.
That's member of the reader that was kept for compatibility.
|
|
||
| // Need to check if there are bits that would overflow the output. | ||
| // Also checks that there is no continuation. | ||
| if (ARROW_PREDICT_FALSE((byte & kHighForbiddenMask) != 0)) { |
There was a problem hiding this comment.
In my opinion, due to right shift of uint64_t, high bit remains bit 0 when original BitWriter::PutVlqInt(uint64_t v) writes the last byte. In which case should we treat the last byte specially in WriteLEB128 and ParseLeadingLEB128 ?
There was a problem hiding this comment.
Take a uint32, that four bytes, it can be written in up to five bytes in LEB128.
But not all five bytes LEB128 fit in a uint32.
On the last byte, a possible 01111111 would be shifted by 4*7=28 overflowing some of the bits.
I added a test for that.
Writing does not have this issue.
|
There is a fuzz regression failure which should probably be addressed @AntoinePrv : https://github.com/apache/arrow/actions/runs/17266367915/job/48999342089?pr=47294#step:7:6933 |
|
I ran the benchmarks locally with an AMD Zen 2 CPU on Ubuntu 24.04: There are a couple significant speedups and a couple significant regressions, but the overall picture is rather reassuring: this refactoring is globally neutral performance-wise. |
pitrou
left a comment
There was a problem hiding this comment.
This is a partial code review, I'll finish tomorrow.
| out += read; | ||
|
|
||
| // Stop reading and store remaining decoder | ||
| if (ARROW_PREDICT_FALSE(values_read == batch_size || read == 0)) { |
There was a problem hiding this comment.
Not sure about ARROW_PREDICT_FALSE either here. Its probability might depend on RLE stream and batching patterns in the upper layers.
pitrou
left a comment
There was a problem hiding this comment.
Sorry, we've been a bit carried away. This is looking excellent now. I've rebased on git main to up-to-date CI results.
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 79f9764. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change ### What changes are included in this PR? New independent abstractions: - A `BitPackedRun` to describe the encoded bytes in a bit packed run. - A minimal `BitPackedDecoder` that can decode this type of run (no dict/spaced methods). - A `RleRun` to describe the encoded value in a RLE run. - A minimal `RleDecoder` that can decode this type of run (no dict/spaced methods). - A `RleBitPackedParser` that read the encoded headers and emits different runs. These new abstractions are then plugged into `RleBitPackedDecoder` (formerly `RleDecode`) to keep the compatibility with the rest of Arrow (improvements to using the parser independently can come in follow-up PR). Misc changes: - Separation of LEB128 reading/writing from `BitReader` into a free functions, and add check for a special case for handling undefined behavior overflow. ### Are these changes tested? Yes, on top of the existing tests, many more unit tests have been added. ### Are there any user-facing changes? API changes to internal classes. * GitHub Issue: apache#47112 Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
|
We're experiencing a performance regression with the Parquet reader on our internal program. We are still locating the regression and suspecting it's related to this PR. I checked the full Conbench report, which mentions several unstable performance issues, all seemingly related to the changes in this PR. However, the URLs in the report are no longer accessible. |
|
@HuaHuaY it is possible this PR (and some follow-up changes found by fuzzing) introduce light regressions in some cases but it should not be much (let me know what you find). |
|
We tested using a Parquet file with data generated from SSB Flat. We found that the less the number of distinct values, the more the performance regression.
Both are compiled in release mode with ❯ env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu ARROW_RUNTIME_SIMD_LEVEL=AVX2 hyperfine -w 5 -r 20 --sort mean-time "cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 /dev/shm/TableSink0" "cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 /dev/shm/TableSink0"
Benchmark 1: cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 /dev/shm/TableSink0
Time (mean ± σ): 31.1 ms ± 0.3 ms [User: 27.2 ms, System: 3.7 ms]
Range (min … max): 30.7 ms … 31.8 ms 20 runs
Benchmark 2: cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 /dev/shm/TableSink0
Time (mean ± σ): 22.8 ms ± 0.4 ms [User: 19.3 ms, System: 3.3 ms]
Range (min … max): 22.2 ms … 23.3 ms 20 runs
Summary
cpp/out/build/ninja-release/release/parquet-scan-old --columns=6 /dev/shm/TableSink0 ran
1.37 ± 0.02 times faster than cpp/out/build/ninja-release/release/parquet-scan-main --columns=6 /dev/shm/TableSink0
❯ env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LLVM_ROOT/lib/x86_64-unknown-linux-gnu cpp/out/build/ninja-release/release/parquet-reader --only-metadata /dev/shm/TableSink0
File Name: /dev/shm/TableSink0
Version: 2.6
Created By: cz-cpp version BuildInfo:GitBranch:release/20240820_rc8,GitVersion:73a4383,BuildTime:1725298762,CloudEnv:ALIYUN
Total rows: 6250733
Number of RowGroups: 1
Number of Real Columns: 40
Number of Columns: 40
Number of Selected Columns: 40
......
Column 6: lo_orderpriority (BYTE_ARRAY / String / UTF8)
......
--- Row Group: 0 ---
--- Total Bytes: 1016215218 ---
--- Total Compressed Bytes: 552911018 ---
--- Sort Columns:
column_idx: 5, descending: 0, nulls_first: 1
column_idx: 0, descending: 0, nulls_first: 1
--- Rows: 6250733 ---
......
Column 6
Values: 6250733, Null Values: 0, Distinct Values: 5
Max (exact: unknown): 5-LOW, Min (exact: unknown): 1-URGENT
Compression: LZ4_RAW, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY
Uncompressed Size: 2267694, Compressed Size: 2092132
...... |
Update from @HuaHuaY's comment: we just located the performance regression comes from the API change of the internal |
This should not be the case, where do you see it being left to |
|
That's in our internal codebase where the |
I tested by arrow's |
Rationale for this change
What changes are included in this PR?
New independent abstractions:
BitPackedRunto describe the encoded bytes in a bit packed run.BitPackedDecoderthat can decode this type of run (no dict/spaced methods).RleRunto describe the encoded value in a RLE run.RleDecoderthat can decode this type of run (no dict/spaced methods).RleBitPackedParserthat read the encoded headers and emits different runs.These new abstractions are then plugged into
RleBitPackedDecoder(formerlyRleDecode) to keep the compatibility with the rest of Arrow (improvements to using the parser independently can come in follow-up PR).Misc changes:
BitReaderinto a free functions, and add check for a special case for handling undefined behavior overflow.Are these changes tested?
Yes, on top of the existing tests, many more unit tests have been added.
Are there any user-facing changes?
API changes to internal classes.