GH-48206: [C++][Parquet] Fix statistics logic on s390x#48207
GH-48206: [C++][Parquet] Fix statistics logic on s390x#48207Vishwanatha-HD wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
69807c0 to
fc2f62c
Compare
cpp/src/parquet/statistics.cc
Outdated
| if constexpr (std::is_same_v<DType, Int32Type>) { | ||
| uint32_t u; | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } else if constexpr (std::is_same_v<DType, Int64Type>) { | ||
| uint64_t u; | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } else if constexpr (std::is_same_v<DType, FloatType>) { | ||
| uint32_t u; | ||
| static_assert(sizeof(u) == sizeof(float), "size"); | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } else if constexpr (std::is_same_v<DType, DoubleType>) { | ||
| uint64_t u; | ||
| static_assert(sizeof(u) == sizeof(double), "size"); | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Can we do this in XXXEncoder::Put() instead of here?
There was a problem hiding this comment.
@kou.. I tried modifying the XXXEncoder::Put() function to have this functionality.. Unfortunately, it didnt work.
|
Statistics tend to surface all sorts of subtle endian quirks, so it’s always interesting to see how different approaches handle those edge cases. Running things on s390x, I’ve found that the most stable behavior usually comes from treating every numeric value—whether it’s a 32-bit int, a float, or the three-limb INT96—as if it should be serialized in LE form no matter what the host is doing. Once everything passes through that single normalization step, the defaults, comparisons, and encoder paths all line up cleanly across architectures. Here, the explicit BE branches for Int32/Int64/Float/Double make the intention clear and should work fine, though it does mean LE and BE end up taking two quite different routes through the code. That can occasionally lead to tiny differences across platforms, especially when stats pages mix types or include INT96 timestamps. Not raising an issue with the logic—just sharing patterns that have helped keep stats round-trips consistent across hosts. |
cpp/src/parquet/statistics.cc
Outdated
| void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const { | ||
| #if ARROW_LITTLE_ENDIAN | ||
| auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_); | ||
| encoder->Put(&src, 1); |
There was a problem hiding this comment.
Why not call ToLittleEndian here? The added code below seems overly complicated and it's not obvious why it should be behind a #if guard.
There was a problem hiding this comment.
I have the same question. Could you elaborate why MakeTypedEncoder() cannot be used for these four types?
Performance or functional reasons?
There was a problem hiding this comment.
Hi @pitrou & @kiszk
As you both know, the expectation from the PlainEncode is that
Convert the value src into a sequence of bytes and make sure those bytes are little-endian (required by Parquet). Because different numeric types have different sizes, alignment requirements and representations, and Parquet requires that each one be written in strict little-endian format, byte-for-byte, we need to handle the individual types seperately..
If you want I can come up with an optimized version of PlainEncode as shown below. This will simplify the main function since the helper function is handling most of the logic..
PlainEncoder:
template <typename T>
T ToLittleEndianValue(const T& value) {
#if ARROW_LITTLE_ENDIAN
return value;
#else
// Integer → swap using Arrow helpers
if constexpr (std::is_integral_v<T>) {
return arrow::bit_util::ToLittleEndian(value);
}
// Floating point → reinterpret as integer, swap, reinterpret back.
else if constexpr (std::is_floating_point_v<T>) {
using UInt =
std::conditional_t<sizeof(T) == 4, uint32_t,
std::conditional_t<sizeof(T) == 8, uint64_t, void>>;
UInt u;
std::memcpy(&u, &value, sizeof(u));
u = arrow::bit_util::ToLittleEndian(u);
T out;
std::memcpy(&out, &u, sizeof(out));
return out;
}
// Otherwise: return as-is (variable-length types handled by encoder)
else {
return value;
}
#endif
}Simplified main PlainEncode function >>>>>
template <typename DType>
void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
using CType = typename DType::c_type;
if constexpr (std::is_arithmetic_v<CType>) {
// fixed-width numeric fast path (works for LE and BE)
CType le_value = ToLittleEndianValue(src);
dst->assign(reinterpret_cast<const char*>(&le_value), sizeof(le_value));
return;
}
// Fallback for non-numeric types
auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
encoder->Put(&src, 1);
auto buffer = encoder->FlushValues();
dst->assign(reinterpret_cast<const char*>(buffer->data()),
static_cast<size_t>(buffer->size()));
}Similarly, I can write a generic FromLittleEndian helper function and simplify the PlainDecode function.. Please let me know.. Thanks..
There was a problem hiding this comment.
I'm ok with optimizing/simplifying the encode/decode functions, because it's counter-productive to instantiate an encoder/decoder just for a single PLAIN value.
There was a problem hiding this comment.
Thanks for the clarification. I am also fine with simplifying the encode/decode functions. It is easy to understand and maintain.
fc2f62c to
1a2d962
Compare
Vishwanatha-HD
left a comment
There was a problem hiding this comment.
I have addressed all the review comments. Thanks..
cpp/src/parquet/statistics.cc
Outdated
| void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const { | ||
| #if ARROW_LITTLE_ENDIAN | ||
| auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_); | ||
| encoder->Put(&src, 1); |
There was a problem hiding this comment.
Hi @pitrou & @kiszk
As you both know, the expectation from the PlainEncode is that
Convert the value src into a sequence of bytes and make sure those bytes are little-endian (required by Parquet). Because different numeric types have different sizes, alignment requirements and representations, and Parquet requires that each one be written in strict little-endian format, byte-for-byte, we need to handle the individual types seperately..
If you want I can come up with an optimized version of PlainEncode as shown below. This will simplify the main function since the helper function is handling most of the logic..
PlainEncoder:
template <typename T>
T ToLittleEndianValue(const T& value) {
#if ARROW_LITTLE_ENDIAN
return value;
#else
// Integer → swap using Arrow helpers
if constexpr (std::is_integral_v<T>) {
return arrow::bit_util::ToLittleEndian(value);
}
// Floating point → reinterpret as integer, swap, reinterpret back.
else if constexpr (std::is_floating_point_v<T>) {
using UInt =
std::conditional_t<sizeof(T) == 4, uint32_t,
std::conditional_t<sizeof(T) == 8, uint64_t, void>>;
UInt u;
std::memcpy(&u, &value, sizeof(u));
u = arrow::bit_util::ToLittleEndian(u);
T out;
std::memcpy(&out, &u, sizeof(out));
return out;
}
// Otherwise: return as-is (variable-length types handled by encoder)
else {
return value;
}
#endif
}Simplified main PlainEncode function >>>>>
template <typename DType>
void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
using CType = typename DType::c_type;
if constexpr (std::is_arithmetic_v<CType>) {
// fixed-width numeric fast path (works for LE and BE)
CType le_value = ToLittleEndianValue(src);
dst->assign(reinterpret_cast<const char*>(&le_value), sizeof(le_value));
return;
}
// Fallback for non-numeric types
auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
encoder->Put(&src, 1);
auto buffer = encoder->FlushValues();
dst->assign(reinterpret_cast<const char*>(buffer->data()),
static_cast<size_t>(buffer->size()));
}Similarly, I can write a generic FromLittleEndian helper function and simplify the PlainDecode function.. Please let me know.. Thanks..
cpp/src/parquet/statistics.cc
Outdated
| if constexpr (std::is_same_v<DType, Int32Type>) { | ||
| uint32_t u; | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } else if constexpr (std::is_same_v<DType, Int64Type>) { | ||
| uint64_t u; | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } else if constexpr (std::is_same_v<DType, FloatType>) { | ||
| uint32_t u; | ||
| static_assert(sizeof(u) == sizeof(float), "size"); | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } else if constexpr (std::is_same_v<DType, DoubleType>) { | ||
| uint64_t u; | ||
| static_assert(sizeof(u) == sizeof(double), "size"); | ||
| std::memcpy(&u, &src, sizeof(u)); | ||
| u = ::arrow::bit_util::ToLittleEndian(u); | ||
| dst->assign(reinterpret_cast<const char*>(&u), sizeof(u)); | ||
| return; | ||
| } |
There was a problem hiding this comment.
@kou.. I tried modifying the XXXEncoder::Put() function to have this functionality.. Unfortunately, it didnt work.
Vishwanatha-HD
left a comment
There was a problem hiding this comment.
I have addressed all the review comments.. Thanks..
1a2d962 to
7eaa4bd
Compare
7eaa4bd to
a6f2bf9
Compare
Rationale for this change
This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the Statistics logic.
What changes are included in this PR?
The fix includes changes to following file:
cpp/src/parquet/statistics.cc
Are these changes tested?
Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.
Are there any user-facing changes?
No
GitHub main Issue link: #48151