[C++][Parquet] Refactor RLE decoding by extract a RLE parser

### Describe the enhancement requested

The Parquet [Bit-packed RLE encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3) is used in two performance-critical situations:
1. definition and repetition levels, which are present on many or even most columns (when writing nullable Arrow data, for example, definition levels are always generated even if there are no nulls)
2. dictionary indices for [Dictionary encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)

Both these areas lack performance currently, and it has [been difficult to optimize](github.com/apache/arrow/issues/40845).
The RLE decoder is currently spread over two facilities: the [RleDecoder](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/rle_encoding_internal.h#L87-L88) does the actual decoding to a linear array of integers, while the [BitReader](https://github.com/apache/arrow/blob/460528956fd146a357bcb1afbed345c2ab9312ba/cpp/src/arrow/util/bit_stream_utils_internal.h#L125-L128) hosts some low-level reading routines.

This issue is to refactor the `RleDecoder` into a parser+decoder. The parser would be a pure event-driven facility to decompose a RLE stream into its individual tokens (. The parser does not need to use the `BitReader` class, it can just reimplement the corresponding operations.

(an inspiration for event-driven parser APIs can be found in [RapidJSON](https://rapidjson.org/md_doc_sax.html), though the RLE parser would be much simpler as there are only two different kinds of tokens in the RLE stream: literal runs and repeated runs)

This refactoring will probably not bring any performance benefits in itself (though who knows?). However, once we have a RLE parser, we can then avoid the decoder in some situations.

For example, a very common situation is OPTIONAL columns in Parquet, where definition levels can have only two values: 0 and 1. Currently, we decode such levels into 16-bit integers and then re-encode them into a bitmap. This is wasteful and pointless, while we could generate the bitmap directly from the RLE stream.


### Component(s)

C++, Parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Refactor RLE decoding by extract a RLE parser #47112

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[C++][Parquet] Refactor RLE decoding by extract a RLE parser #47112

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions