GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x by Vishwanatha-HD · Pull Request #48200 · apache/arrow

Vishwanatha-HD · 2025-11-21T12:11:36Z

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes all the remaining testcase issues. The fix mainly involves the byte swapping in order to take care of endianness issues.

What changes are included in this PR?

The fix includes changes to following testcase files:
cpp/src/arrow/dataset/file_parquet_test.cc
cpp/src/arrow/util/byte_stream_split_test.cc
cpp/src/parquet/arrow/arrow_reader_writer_test.cc
cpp/src/parquet/column_writer_test.cc
cpp/src/parquet/encoding_test.cc
cpp/src/parquet/level_conversion_test.cc
cpp/src/parquet/metadata_test.cc
cpp/src/parquet/reader_test.cc
cpp/src/parquet/statistics_test.cc
cpp/src/parquet/types_test.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix all the Testcase issues to enable Parquet DB support on s390x #48198

github-actions · 2025-11-21T12:12:01Z

⚠️ GitHub issue #48198 has been automatically assigned in GitHub to PR creator.

k8ika0s · 2025-11-23T22:47:31Z

@Vishwanatha-HD

Looking over this batch of tests, a few patterns in how the expected bytes are formed caught my eye. Nothing alarming — just places where BE ends up taking a slightly different journey than LE depending on where the normalization happens.

• Half-floats: the assertions use the raw bits from the array, so BE naturally ends up comparing against host-order patterns unless those values get flipped before they reach the check.

• INT96 and other scalar stats: some expected strings are built straight from native-order limbs, which means the two architectures diverge a bit unless they’re pushed through one consistent conversion point.

• ByteStreamSplit: since the inputs to the encoder come in as native-order integers, the resulting split streams follow whatever the host layout is unless they’re normalized up front.

• Page-index and stats checks: a few comparisons still assume host-order for floats/doubles, while others already expect swapped limbs.

Individually these are all reasonable, but they do lead to small differences on BE because the tests anchor themselves at slightly different spots in the pipeline. Pulling the expectations toward one shared byte layout in a couple of places would likely smooth that out across hosts.

Vishwanatha-HD · 2025-11-24T13:28:03Z

@Vishwanatha-HD

Looking over this batch of tests, a few patterns in how the expected bytes are formed caught my eye. Nothing alarming — just places where BE ends up taking a slightly different journey than LE depending on where the normalization happens.

• Half-floats: the assertions use the raw bits from the array, so BE naturally ends up comparing against host-order patterns unless those values get flipped before they reach the check.

• INT96 and other scalar stats: some expected strings are built straight from native-order limbs, which means the two architectures diverge a bit unless they’re pushed through one consistent conversion point.

• ByteStreamSplit: since the inputs to the encoder come in as native-order integers, the resulting split streams follow whatever the host layout is unless they’re normalized up front.

• Page-index and stats checks: a few comparisons still assume host-order for floats/doubles, while others already expect swapped limbs.

Individually these are all reasonable, but they do lead to small differences on BE because the tests anchor themselves at slightly different spots in the pipeline. Pulling the expectations toward one shared byte layout in a couple of places would likely smooth that out across hosts.

@k8ika0s.. Thanks for your review comments.. But please note that we are already taking care of the conversion required in encoders and decoders.. Hence we need to take care of the corresponding testcases accordingly..
I have tested with my changes on the s390x systems and also on Openshift AI workloads.. It works properly.. Hence there is no concern with these changes..

…Parquet DB support on s390x

Vishwanatha-HD · 2026-06-03T11:49:59Z

@pitrou @kou @kiszk @zanmato1984
Hi All,
I know its been long time since I have my PRs opened.. Can you please help me with review and merging to the upstream. Is there anything more that you people want me to do it, I would be happy to work on it. Please suggest. Thanks.

Vishwanatha-HD · 2026-06-03T12:15:42Z

I verified that with my fix all the 113 testcase passes without any issues.

kou · 2026-06-03T21:27:29Z

Could you check CI failures?

Copilot

Pull request overview

This PR updates C++ Parquet/Arrow test suites to behave correctly on big-endian (s390x) by ensuring test data that represents Parquet-encoded values is consistently expressed in little-endian form (or by avoiding endianness-dependent assumptions in test expectations). This supports the broader effort to enable Parquet DB support on big-endian platforms by eliminating endianness-related test failures/hangs.

Changes:

Convert test-constructed “encoded” min/max (and similar) values to little-endian byte representations before comparing/formatting.
Make INT96 and ByteStreamSplit-related tests endian-aware and/or endian-independent.
Adjust a few endian-guarded test cases/macros for correctness and portability.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
cpp/src/parquet/types_test.cc	Updates statistic-formatting and INT96 timestamp tests to avoid endianness-dependent constructions.
cpp/src/parquet/statistics_test.cc	Writes encoded statistics values in little-endian form and updates helper encoding for cross-endian correctness.
cpp/src/parquet/reader_test.cc	Tweaks ByteStreamSplit integration assertions for endian-specific behavior.
cpp/src/parquet/metadata_test.cc	Writes encoded statistics min/max as little-endian bytes when building metadata in tests.
cpp/src/parquet/level_conversion_test.cc	Fixes endian macro usage to avoid incorrectly enabling little-endian-only tests.
cpp/src/parquet/encoding_test.cc	Updates ByteStreamSplit encode/decode expected values to reflect native-order outputs after decoding / before encoding.
cpp/src/parquet/column_writer_test.cc	Adjusts RLE level encoding length prefix handling and endianness for the prefix.
cpp/src/arrow/util/byte_stream_split_test.cc	Updates the reference ByteStreamSplit encoder to account for big-endian byte ordering.
cpp/src/arrow/dataset/file_parquet_test.cc	Ensures Parquet statistics bytes are constructed in little-endian form in dataset tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  int32_t int_min = 1024;
  int32_t int_max = 2048;
-  smin = std::string(reinterpret_cast<char*>(&int_min), sizeof(int32_t));
-  smax = std::string(reinterpret_cast<char*>(&int_max), sizeof(int32_t));
-  ASSERT_STREQ("1024", FormatStatValue(Type::INT32, smin).c_str());
-  ASSERT_STREQ("2048", FormatStatValue(Type::INT32, smax).c_str());
+  int32_t int_min_le = ::arrow::bit_util::ToLittleEndian(int_min);
+  int32_t int_max_le = ::arrow::bit_util::ToLittleEndian(int_max);
+  smin = std::string(reinterpret_cast<char*>(&int_min_le), sizeof(int32_t));


+    // Values are in native byte order after decoding
+    const std::vector<uint32_t> expected_output{0xAA774411U, 0xBB885522U, 0xCC996633U};
+    CheckDecode(span{data}, span{expected_output});


+    // Values are in native byte order after decoding
+    const std::vector<uint64_t> expected_output{0x7755CCAA331137DEULL,
+                                                0x8866DDBB442213C0ULL};
+    CheckDecode(span{data}, span{expected_output});


+#if ARROW_LITTLE_ENDIAN
    encoder.Init(encoding, max_level, num_levels, bytes.data() + sizeof(int32_t),
                 static_cast<int>(bytes.size()));
-
+#else
+    encoder.Init(encoding, max_level, num_levels, bytes.data() + sizeof(int32_t),


Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 12:11

github-actions Bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

This was referenced Nov 21, 2025

[C++][Parquet] Fix all the Testcase issues to enable Parquet DB support on s390x #48198

Open

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

Vishwanatha-HD force-pushed the fixTestCaseIssues branch from f0370d0 to 21c045e Compare November 22, 2025 04:33

apacheGH-48198: [C++][Parquet] Fix all the testcase issues to enable …

7892880

…Parquet DB support on s390x

Vishwanatha-HD force-pushed the fixTestCaseIssues branch from 21c045e to 7892880 Compare November 29, 2025 13:28

Vishwanatha-HD added 3 commits March 24, 2026 12:04

Merge branch 'main' into fixTestCaseIssues

849e449

Remove unused endian.h include from test file

c80bc5b

Merge branch 'main' into fixTestCaseIssues

57cbfe6

kou changed the title ~~GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parque…~~ GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x Jun 3, 2026

kou requested a review from Copilot June 3, 2026 21:26

Copilot started reviewing on behalf of kou June 3, 2026 21:26 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Merge branch 'apache:main' into fixTestCaseIssues

d1bee9e

Vishwanatha-HD requested a review from pitrou as a code owner June 15, 2026 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x#48200

GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x#48200
Vishwanatha-HD wants to merge 5 commits into
apache:mainfrom
Vishwanatha-HD:fixTestCaseIssues

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Vishwanatha-HD commented Nov 24, 2025

Uh oh!

Vishwanatha-HD commented Jun 3, 2026

Uh oh!

Vishwanatha-HD commented Jun 3, 2026

Uh oh!

kou commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Vishwanatha-HD commented Nov 24, 2025

Uh oh!

Vishwanatha-HD commented Jun 3, 2026

Uh oh!

Vishwanatha-HD commented Jun 3, 2026

Uh oh!

kou commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading