Skip to content

[C++][Parquet] Overly strict check in LevelDecoder::SetData #48234

@pitrou

Description

@pitrou

Describe the bug, including details regarding any error messages, version, and platform.

The data_size - 4 check in this snippet does not have a rational justification and looks overly strict:

case Encoding::BIT_PACKED: {
int num_bits = 0;
if (MultiplyWithOverflow(num_buffered_values, bit_width_, &num_bits)) {
throw ParquetException(
"Number of buffered values too large (corrupt data page?)");
}
num_bytes = static_cast<int32_t>(bit_util::BytesForBits(num_bits));
if (num_bytes < 0 || num_bytes > data_size - 4) {
throw ParquetException("Received invalid number of bytes (corrupt data page?)");
}
if (!bit_packed_decoder_) {
bit_packed_decoder_ =
std::make_unique<::arrow::bit_util::BitReader>(data, num_bytes);
} else {
bit_packed_decoder_->Reset(data, num_bytes);
}
return num_bytes;
}

It probably doesn't matter in most cases, as there are at least 4 bytes of encoded page values after the levels, but in some fringe cases (such as all/most values being null) there might not.

The original chance from data_size to data_size - 4 was made by me in #6848 and it was probably based on a misunderstanding, or just blindly copying the similar check for Encoding::RLE (which, unlike Encoding::BIT_PACKED, does have a 4-byte length header).

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions