Skip to content

[C++][Parquet] Refactor RLE decoding by extract a RLE parser #47112

@pitrou

Description

@pitrou

Describe the enhancement requested

The Parquet Bit-packed RLE encoding is used in two performance-critical situations:

  1. definition and repetition levels, which are present on many or even most columns (when writing nullable Arrow data, for example, definition levels are always generated even if there are no nulls)
  2. dictionary indices for Dictionary encoding

Both these areas lack performance currently, and it has been difficult to optimize.
The RLE decoder is currently spread over two facilities: the RleDecoder does the actual decoding to a linear array of integers, while the BitReader hosts some low-level reading routines.

This issue is to refactor the RleDecoder into a parser+decoder. The parser would be a pure event-driven facility to decompose a RLE stream into its individual tokens (. The parser does not need to use the BitReader class, it can just reimplement the corresponding operations.

(an inspiration for event-driven parser APIs can be found in RapidJSON, though the RLE parser would be much simpler as there are only two different kinds of tokens in the RLE stream: literal runs and repeated runs)

This refactoring will probably not bring any performance benefits in itself (though who knows?). However, once we have a RLE parser, we can then avoid the decoder in some situations.

For example, a very common situation is OPTIONAL columns in Parquet, where definition levels can have only two values: 0 and 1. Currently, we decode such levels into 16-bit integers and then re-encode them into a bitmap. This is wasteful and pointless, while we could generate the bitmap directly from the RLE stream.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions