Skip to content

GH-48206: [C++][Parquet] Fix statistics logic on s390x#48207

Open
Vishwanatha-HD wants to merge 1 commit intoapache:mainfrom
Vishwanatha-HD:fixStatistics
Open

GH-48206: [C++][Parquet] Fix statistics logic on s390x#48207
Vishwanatha-HD wants to merge 1 commit intoapache:mainfrom
Vishwanatha-HD:fixStatistics

Conversation

@Vishwanatha-HD
Copy link
Copy Markdown
Contributor

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the Statistics logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/parquet/statistics.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #48206 has been automatically assigned in GitHub to PR creator.

Comment on lines +937 to +963
if constexpr (std::is_same_v<DType, Int32Type>) {
uint32_t u;
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
} else if constexpr (std::is_same_v<DType, Int64Type>) {
uint64_t u;
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
} else if constexpr (std::is_same_v<DType, FloatType>) {
uint32_t u;
static_assert(sizeof(u) == sizeof(float), "size");
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
} else if constexpr (std::is_same_v<DType, DoubleType>) {
uint64_t u;
static_assert(sizeof(u) == sizeof(double), "size");
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in XXXEncoder::Put() instead of here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou.. I tried modifying the XXXEncoder::Put() function to have this functionality.. Unfortunately, it didnt work.

@k8ika0s
Copy link
Copy Markdown

k8ika0s commented Nov 23, 2025

@Vishwanatha-HD

Statistics tend to surface all sorts of subtle endian quirks, so it’s always interesting to see how different approaches handle those edge cases.

Running things on s390x, I’ve found that the most stable behavior usually comes from treating every numeric value—whether it’s a 32-bit int, a float, or the three-limb INT96—as if it should be serialized in LE form no matter what the host is doing. Once everything passes through that single normalization step, the defaults, comparisons, and encoder paths all line up cleanly across architectures.

Here, the explicit BE branches for Int32/Int64/Float/Double make the intention clear and should work fine, though it does mean LE and BE end up taking two quite different routes through the code. That can occasionally lead to tiny differences across platforms, especially when stats pages mix types or include INT96 timestamps.

Not raising an issue with the logic—just sharing patterns that have helped keep stats round-trips consistent across hosts.

void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
#if ARROW_LITTLE_ENDIAN
auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
encoder->Put(&src, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call ToLittleEndian here? The added code below seems overly complicated and it's not obvious why it should be behind a #if guard.

Copy link
Copy Markdown
Member

@kiszk kiszk Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. Could you elaborate why MakeTypedEncoder() cannot be used for these four types?
Performance or functional reasons?

Copy link
Copy Markdown
Contributor Author

@Vishwanatha-HD Vishwanatha-HD Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pitrou & @kiszk
As you both know, the expectation from the PlainEncode is that

Convert the value src into a sequence of bytes and make sure those bytes are little-endian (required by Parquet). Because different numeric types have different sizes, alignment requirements and representations, and Parquet requires that each one be written in strict little-endian format, byte-for-byte, we need to handle the individual types seperately..

If you want I can come up with an optimized version of PlainEncode as shown below. This will simplify the main function since the helper function is handling most of the logic..

PlainEncoder:

template <typename T>
T ToLittleEndianValue(const T& value) {
#if ARROW_LITTLE_ENDIAN
  return value;
#else
  // Integer → swap using Arrow helpers
  if constexpr (std::is_integral_v<T>) {
    return arrow::bit_util::ToLittleEndian(value);
  }

  // Floating point → reinterpret as integer, swap, reinterpret back.
  else if constexpr (std::is_floating_point_v<T>) {
    using UInt =
        std::conditional_t<sizeof(T) == 4, uint32_t,
        std::conditional_t<sizeof(T) == 8, uint64_t, void>>;

    UInt u;
    std::memcpy(&u, &value, sizeof(u));
    u = arrow::bit_util::ToLittleEndian(u);

    T out;
    std::memcpy(&out, &u, sizeof(out));
    return out;
  }

  // Otherwise: return as-is (variable-length types handled by encoder)
  else {
    return value;
  }
#endif
}

Simplified main PlainEncode function >>>>>

template <typename DType>
void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
  using CType = typename DType::c_type;

  if constexpr (std::is_arithmetic_v<CType>) {
    // fixed-width numeric fast path (works for LE and BE)
    CType le_value = ToLittleEndianValue(src);
    dst->assign(reinterpret_cast<const char*>(&le_value), sizeof(le_value));
    return;
  }

  // Fallback for non-numeric types
  auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
  encoder->Put(&src, 1);
  auto buffer = encoder->FlushValues();
  dst->assign(reinterpret_cast<const char*>(buffer->data()),
              static_cast<size_t>(buffer->size()));
}

Similarly, I can write a generic FromLittleEndian helper function and simplify the PlainDecode function.. Please let me know.. Thanks..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with optimizing/simplifying the encode/decode functions, because it's counter-productive to instantiate an encoder/decoder just for a single PLAIN value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I am also fine with simplifying the encode/decode functions. It is easy to understand and maintain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou & @kiszk..
As suggested by you both, I have optimized and simplified the PlainEncode & PlainDecode functions.. I have tested my changes both on s390x and on x86 machines. They are working fine. Please have a review.. Thanks..

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 24, 2025
@kou kou changed the title GH-48206 Fix Statistics logic to enable Parquet DB support on s390x GH-48206: [C++] Fix Statistics logic to enable Parquet DB support on s390x Nov 25, 2025
@kou kou changed the title GH-48206: [C++] Fix Statistics logic to enable Parquet DB support on s390x GH-48206: [C++][Parquet] Fix statistics logic on s390x Nov 25, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 25, 2025
Copy link
Copy Markdown
Contributor Author

@Vishwanatha-HD Vishwanatha-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed all the review comments. Thanks..

void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
#if ARROW_LITTLE_ENDIAN
auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
encoder->Put(&src, 1);
Copy link
Copy Markdown
Contributor Author

@Vishwanatha-HD Vishwanatha-HD Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pitrou & @kiszk
As you both know, the expectation from the PlainEncode is that

Convert the value src into a sequence of bytes and make sure those bytes are little-endian (required by Parquet). Because different numeric types have different sizes, alignment requirements and representations, and Parquet requires that each one be written in strict little-endian format, byte-for-byte, we need to handle the individual types seperately..

If you want I can come up with an optimized version of PlainEncode as shown below. This will simplify the main function since the helper function is handling most of the logic..

PlainEncoder:

template <typename T>
T ToLittleEndianValue(const T& value) {
#if ARROW_LITTLE_ENDIAN
  return value;
#else
  // Integer → swap using Arrow helpers
  if constexpr (std::is_integral_v<T>) {
    return arrow::bit_util::ToLittleEndian(value);
  }

  // Floating point → reinterpret as integer, swap, reinterpret back.
  else if constexpr (std::is_floating_point_v<T>) {
    using UInt =
        std::conditional_t<sizeof(T) == 4, uint32_t,
        std::conditional_t<sizeof(T) == 8, uint64_t, void>>;

    UInt u;
    std::memcpy(&u, &value, sizeof(u));
    u = arrow::bit_util::ToLittleEndian(u);

    T out;
    std::memcpy(&out, &u, sizeof(out));
    return out;
  }

  // Otherwise: return as-is (variable-length types handled by encoder)
  else {
    return value;
  }
#endif
}

Simplified main PlainEncode function >>>>>

template <typename DType>
void TypedStatisticsImpl<DType>::PlainEncode(const T& src, std::string* dst) const {
  using CType = typename DType::c_type;

  if constexpr (std::is_arithmetic_v<CType>) {
    // fixed-width numeric fast path (works for LE and BE)
    CType le_value = ToLittleEndianValue(src);
    dst->assign(reinterpret_cast<const char*>(&le_value), sizeof(le_value));
    return;
  }

  // Fallback for non-numeric types
  auto encoder = MakeTypedEncoder<DType>(Encoding::PLAIN, false, descr_, pool_);
  encoder->Put(&src, 1);
  auto buffer = encoder->FlushValues();
  dst->assign(reinterpret_cast<const char*>(buffer->data()),
              static_cast<size_t>(buffer->size()));
}

Similarly, I can write a generic FromLittleEndian helper function and simplify the PlainDecode function.. Please let me know.. Thanks..

Comment on lines +937 to +963
if constexpr (std::is_same_v<DType, Int32Type>) {
uint32_t u;
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
} else if constexpr (std::is_same_v<DType, Int64Type>) {
uint64_t u;
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
} else if constexpr (std::is_same_v<DType, FloatType>) {
uint32_t u;
static_assert(sizeof(u) == sizeof(float), "size");
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
} else if constexpr (std::is_same_v<DType, DoubleType>) {
uint64_t u;
static_assert(sizeof(u) == sizeof(double), "size");
std::memcpy(&u, &src, sizeof(u));
u = ::arrow::bit_util::ToLittleEndian(u);
dst->assign(reinterpret_cast<const char*>(&u), sizeof(u));
return;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou.. I tried modifying the XXXEncoder::Put() function to have this functionality.. Unfortunately, it didnt work.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 26, 2025
Copy link
Copy Markdown
Contributor Author

@Vishwanatha-HD Vishwanatha-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed all the review comments.. Thanks..

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Nov 27, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants