GH-48206: [C++][Parquet] Fix statistics logic on s390x by Vishwanatha-HD · Pull Request #48207 · apache/arrow

Vishwanatha-HD · 2025-11-21T15:01:46Z

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the Statistics logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/parquet/statistics.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix Statistics logic to enable Parquet DB support on Big-Endian (s390x) systems #48206

github-actions · 2025-11-21T15:02:15Z

⚠️ GitHub issue #48206 has been automatically assigned in GitHub to PR creator.

cpp/src/parquet/statistics.cc

kou · 2025-11-22T13:40:06Z

cpp/src/parquet/statistics.cc

+  if constexpr (std::is_same_v<DType, Int32Type>) {
+    uint32_t u;
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, Int64Type>) {
+    uint64_t u;
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, FloatType>) {
+    uint32_t u;
+    static_assert(sizeof(u) == sizeof(float), "size");
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, DoubleType>) {
+    uint64_t u;
+    static_assert(sizeof(u) == sizeof(double), "size");
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  }


Can we do this in XXXEncoder::Put() instead of here?

@kou.. I tried modifying the XXXEncoder::Put() function to have this functionality.. Unfortunately, it didnt work.

k8ika0s · 2025-11-23T22:40:31Z

@Vishwanatha-HD

Statistics tend to surface all sorts of subtle endian quirks, so it’s always interesting to see how different approaches handle those edge cases.

Running things on s390x, I’ve found that the most stable behavior usually comes from treating every numeric value—whether it’s a 32-bit int, a float, or the three-limb INT96—as if it should be serialized in LE form no matter what the host is doing. Once everything passes through that single normalization step, the defaults, comparisons, and encoder paths all line up cleanly across architectures.

Here, the explicit BE branches for Int32/Int64/Float/Double make the intention clear and should work fine, though it does mean LE and BE end up taking two quite different routes through the code. That can occasionally lead to tiny differences across platforms, especially when stats pages mix types or include INT96 timestamps.

Not raising an issue with the logic—just sharing patterns that have helped keep stats round-trips consistent across hosts.

pitrou · 2025-11-24T13:54:28Z

cpp/src/parquet/statistics.cc

 void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
+#if ARROW_LITTLE_ENDIAN
  auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
  encoder->Put(&src, 1);


Why not call ToLittleEndian here? The added code below seems overly complicated and it's not obvious why it should be behind a #if guard.

I have the same question. Could you elaborate why MakeTypedEncoder() cannot be used for these four types?
Performance or functional reasons?

Hi @pitrou & @kiszk
As you both know, the expectation from the PlainEncode is that

Convert the value src into a sequence of bytes and make sure those bytes are little-endian (required by Parquet). Because different numeric types have different sizes, alignment requirements and representations, and Parquet requires that each one be written in strict little-endian format, byte-for-byte, we need to handle the individual types seperately..

If you want I can come up with an optimized version of PlainEncode as shown below. This will simplify the main function since the helper function is handling most of the logic..

PlainEncoder:

template <typename T> T ToLittleEndianValue(const T& value) { #if ARROW_LITTLE_ENDIAN return value; #else // Integer → swap using Arrow helpers if constexpr (std::is_integral_v<T>) { return arrow::bit_util::ToLittleEndian(value); } // Floating point → reinterpret as integer, swap, reinterpret back. else if constexpr (std::is_floating_point_v<T>) { using UInt = std::conditional_t<sizeof(T) == 4, uint32_t, std::conditional_t<sizeof(T) == 8, uint64_t, void>>; UInt u; std::memcpy(&u, &value, sizeof(u)); u = arrow::bit_util::ToLittleEndian(u); T out; std::memcpy(&out, &u, sizeof(out)); return out; } // Otherwise: return as-is (variable-length types handled by encoder) else { return value; } #endif }

Simplified main PlainEncode function >>>>>

template <typename DType> void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const { using CType = typename DType::c_type; if constexpr (std::is_arithmetic_v<CType>) { // fixed-width numeric fast path (works for LE and BE) CType le_value = ToLittleEndianValue(src); dst->assign(reinterpret_cast<const char*>(&le_value), sizeof(le_value)); return; } // Fallback for non-numeric types auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_); encoder->Put(&src, 1); auto buffer = encoder->FlushValues(); dst->assign(reinterpret_cast<const char*>(buffer->data()), static_cast<size_t>(buffer->size())); }

Similarly, I can write a generic FromLittleEndian helper function and simplify the PlainDecode function.. Please let me know.. Thanks..

I'm ok with optimizing/simplifying the encode/decode functions, because it's counter-productive to instantiate an encoder/decoder just for a single PLAIN value.

Thanks for the clarification. I am also fine with simplifying the encode/decode functions. It is easy to understand and maintain.

@pitrou & @kiszk..
As suggested by you both, I have optimized and simplified the PlainEncode & PlainDecode functions.. I have tested my changes both on s390x and on x86 machines. They are working fine. Please have a review.. Thanks..

cpp/src/parquet/statistics.cc

Vishwanatha-HD

I have addressed all the review comments. Thanks..

Vishwanatha-HD · 2025-11-26T11:20:20Z

cpp/src/parquet/statistics.cc

 void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
+#if ARROW_LITTLE_ENDIAN
  auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
  encoder->Put(&src, 1);


Hi @pitrou & @kiszk
As you both know, the expectation from the PlainEncode is that

Convert the value src into a sequence of bytes and make sure those bytes are little-endian (required by Parquet). Because different numeric types have different sizes, alignment requirements and representations, and Parquet requires that each one be written in strict little-endian format, byte-for-byte, we need to handle the individual types seperately..

If you want I can come up with an optimized version of PlainEncode as shown below. This will simplify the main function since the helper function is handling most of the logic..

PlainEncoder:

template <typename T> T ToLittleEndianValue(const T& value) { #if ARROW_LITTLE_ENDIAN return value; #else // Integer → swap using Arrow helpers if constexpr (std::is_integral_v<T>) { return arrow::bit_util::ToLittleEndian(value); } // Floating point → reinterpret as integer, swap, reinterpret back. else if constexpr (std::is_floating_point_v<T>) { using UInt = std::conditional_t<sizeof(T) == 4, uint32_t, std::conditional_t<sizeof(T) == 8, uint64_t, void>>; UInt u; std::memcpy(&u, &value, sizeof(u)); u = arrow::bit_util::ToLittleEndian(u); T out; std::memcpy(&out, &u, sizeof(out)); return out; } // Otherwise: return as-is (variable-length types handled by encoder) else { return value; } #endif }

Simplified main PlainEncode function >>>>>

template <typename DType> void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const { using CType = typename DType::c_type; if constexpr (std::is_arithmetic_v<CType>) { // fixed-width numeric fast path (works for LE and BE) CType le_value = ToLittleEndianValue(src); dst->assign(reinterpret_cast<const char*>(&le_value), sizeof(le_value)); return; } // Fallback for non-numeric types auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_); encoder->Put(&src, 1); auto buffer = encoder->FlushValues(); dst->assign(reinterpret_cast<const char*>(buffer->data()), static_cast<size_t>(buffer->size())); }

Similarly, I can write a generic FromLittleEndian helper function and simplify the PlainDecode function.. Please let me know.. Thanks..

Vishwanatha-HD · 2025-11-26T11:22:14Z

cpp/src/parquet/statistics.cc

+  if constexpr (std::is_same_v<DType, Int32Type>) {
+    uint32_t u;
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, Int64Type>) {
+    uint64_t u;
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, FloatType>) {
+    uint32_t u;
+    static_assert(sizeof(u) == sizeof(float), "size");
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  } else if constexpr (std::is_same_v<DType, DoubleType>) {
+    uint64_t u;
+    static_assert(sizeof(u) == sizeof(double), "size");
+    std::memcpy(&u, &src, sizeof(u));
+    u = ::arrow::bit_util::ToLittleEndian(u);
+    dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
+    return;
+  }


@kou.. I tried modifying the XXXEncoder::Put() function to have this functionality.. Unfortunately, it didnt work.

cpp/src/parquet/statistics.cc

Vishwanatha-HD

I have addressed all the review comments.. Thanks..

…390x

Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 15:01

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Fix Statistics logic to enable Parquet DB support on Big-Endian (s390x) systems #48206

Open

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

k8ika0s mentioned this pull request Nov 21, 2025

GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212

Closed

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

Vishwanatha-HD force-pushed the fixStatistics branch from 69807c0 to fc2f62c Compare November 22, 2025 05:01

kou reviewed Nov 22, 2025

View reviewed changes

pitrou reviewed Nov 24, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 24, 2025

pitrou reviewed Nov 24, 2025

View reviewed changes

cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved

kou changed the title ~~GH-48206 Fix Statistics logic to enable Parquet DB support on s390x~~ GH-48206: [C++] Fix Statistics logic to enable Parquet DB support on s390x Nov 25, 2025

kou changed the title ~~GH-48206: [C++] Fix Statistics logic to enable Parquet DB support on s390x~~ GH-48206: [C++][Parquet] Fix statistics logic on s390x Nov 25, 2025

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 25, 2025

Vishwanatha-HD force-pushed the fixStatistics branch from fc2f62c to 1a2d962 Compare November 26, 2025 12:37

Vishwanatha-HD commented Nov 26, 2025

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 26, 2025

Vishwanatha-HD commented Nov 26, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 27, 2025

Vishwanatha-HD force-pushed the fixStatistics branch from 1a2d962 to 7eaa4bd Compare November 28, 2025 15:10

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 28, 2025

apacheGH-48206 Fix Statistics logic to enable Parquet DB support on s…

a6f2bf9

…390x

Vishwanatha-HD force-pushed the fixStatistics branch from 7eaa4bd to a6f2bf9 Compare November 29, 2025 13:18

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Uh oh!

kou Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Vishwanatha-HD Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

pitrou Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

kiszk Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vishwanatha-HD Nov 26, 2025 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

kiszk Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Vishwanatha-HD Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Vishwanatha-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Vishwanatha-HD Nov 26, 2025 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vishwanatha-HD Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Vishwanatha-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited by github-actions bot

Loading

kiszk Nov 25, 2025 •

edited

Loading

Vishwanatha-HD Nov 26, 2025 •

edited by kou

Loading

Vishwanatha-HD Nov 26, 2025 •

edited by kou

Loading