Skip to content

GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x#48200

Open
Vishwanatha-HD wants to merge 5 commits into
apache:mainfrom
Vishwanatha-HD:fixTestCaseIssues
Open

GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x#48200
Vishwanatha-HD wants to merge 5 commits into
apache:mainfrom
Vishwanatha-HD:fixTestCaseIssues

Conversation

@Vishwanatha-HD

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

Copy link
Copy Markdown
Contributor

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes all the remaining testcase issues. The fix mainly involves the byte swapping in order to take care of endianness issues.

What changes are included in this PR?

The fix includes changes to following testcase files:
cpp/src/arrow/dataset/file_parquet_test.cc
cpp/src/arrow/util/byte_stream_split_test.cc
cpp/src/parquet/arrow/arrow_reader_writer_test.cc
cpp/src/parquet/column_writer_test.cc
cpp/src/parquet/encoding_test.cc
cpp/src/parquet/level_conversion_test.cc
cpp/src/parquet/metadata_test.cc
cpp/src/parquet/reader_test.cc
cpp/src/parquet/statistics_test.cc
cpp/src/parquet/types_test.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #48198 has been automatically assigned in GitHub to PR creator.

@k8ika0s

k8ika0s commented Nov 23, 2025

Copy link
Copy Markdown

@Vishwanatha-HD

Looking over this batch of tests, a few patterns in how the expected bytes are formed caught my eye. Nothing alarming — just places where BE ends up taking a slightly different journey than LE depending on where the normalization happens.

• Half-floats: the assertions use the raw bits from the array, so BE naturally ends up comparing against host-order patterns unless those values get flipped before they reach the check.

• INT96 and other scalar stats: some expected strings are built straight from native-order limbs, which means the two architectures diverge a bit unless they’re pushed through one consistent conversion point.

• ByteStreamSplit: since the inputs to the encoder come in as native-order integers, the resulting split streams follow whatever the host layout is unless they’re normalized up front.

• Page-index and stats checks: a few comparisons still assume host-order for floats/doubles, while others already expect swapped limbs.

Individually these are all reasonable, but they do lead to small differences on BE because the tests anchor themselves at slightly different spots in the pipeline. Pulling the expectations toward one shared byte layout in a couple of places would likely smooth that out across hosts.

@Vishwanatha-HD

Copy link
Copy Markdown
Contributor Author

@Vishwanatha-HD

Looking over this batch of tests, a few patterns in how the expected bytes are formed caught my eye. Nothing alarming — just places where BE ends up taking a slightly different journey than LE depending on where the normalization happens.

• Half-floats: the assertions use the raw bits from the array, so BE naturally ends up comparing against host-order patterns unless those values get flipped before they reach the check.

• INT96 and other scalar stats: some expected strings are built straight from native-order limbs, which means the two architectures diverge a bit unless they’re pushed through one consistent conversion point.

• ByteStreamSplit: since the inputs to the encoder come in as native-order integers, the resulting split streams follow whatever the host layout is unless they’re normalized up front.

• Page-index and stats checks: a few comparisons still assume host-order for floats/doubles, while others already expect swapped limbs.

Individually these are all reasonable, but they do lead to small differences on BE because the tests anchor themselves at slightly different spots in the pipeline. Pulling the expectations toward one shared byte layout in a couple of places would likely smooth that out across hosts.

@k8ika0s.. Thanks for your review comments.. But please note that we are already taking care of the conversion required in encoders and decoders.. Hence we need to take care of the corresponding testcases accordingly..
I have tested with my changes on the s390x systems and also on Openshift AI workloads.. It works properly.. Hence there is no concern with these changes..

@Vishwanatha-HD

Copy link
Copy Markdown
Contributor Author

@pitrou @kou @kiszk @zanmato1984
Hi All,
I know its been long time since I have my PRs opened.. Can you please help me with review and merging to the upstream. Is there anything more that you people want me to do it, I would be happy to work on it. Please suggest. Thanks.

@Vishwanatha-HD

Copy link
Copy Markdown
Contributor Author

I verified that with my fix all the 113 testcase passes without any issues.

@kou kou changed the title GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parque… GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x Jun 3, 2026
@kou kou requested a review from Copilot June 3, 2026 21:26
@kou

kou commented Jun 3, 2026

Copy link
Copy Markdown
Member

Could you check CI failures?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates C++ Parquet/Arrow test suites to behave correctly on big-endian (s390x) by ensuring test data that represents Parquet-encoded values is consistently expressed in little-endian form (or by avoiding endianness-dependent assumptions in test expectations). This supports the broader effort to enable Parquet DB support on big-endian platforms by eliminating endianness-related test failures/hangs.

Changes:

  • Convert test-constructed “encoded” min/max (and similar) values to little-endian byte representations before comparing/formatting.
  • Make INT96 and ByteStreamSplit-related tests endian-aware and/or endian-independent.
  • Adjust a few endian-guarded test cases/macros for correctness and portability.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cpp/src/parquet/types_test.cc Updates statistic-formatting and INT96 timestamp tests to avoid endianness-dependent constructions.
cpp/src/parquet/statistics_test.cc Writes encoded statistics values in little-endian form and updates helper encoding for cross-endian correctness.
cpp/src/parquet/reader_test.cc Tweaks ByteStreamSplit integration assertions for endian-specific behavior.
cpp/src/parquet/metadata_test.cc Writes encoded statistics min/max as little-endian bytes when building metadata in tests.
cpp/src/parquet/level_conversion_test.cc Fixes endian macro usage to avoid incorrectly enabling little-endian-only tests.
cpp/src/parquet/encoding_test.cc Updates ByteStreamSplit encode/decode expected values to reflect native-order outputs after decoding / before encoding.
cpp/src/parquet/column_writer_test.cc Adjusts RLE level encoding length prefix handling and endianness for the prefix.
cpp/src/arrow/util/byte_stream_split_test.cc Updates the reference ByteStreamSplit encoder to account for big-endian byte ordering.
cpp/src/arrow/dataset/file_parquet_test.cc Ensures Parquet statistics bytes are constructed in little-endian form in dataset tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 78 to +82
int32_t int_min = 1024;
int32_t int_max = 2048;
smin = std::string(reinterpret_cast<char*>(&int_min), sizeof(int32_t));
smax = std::string(reinterpret_cast<char*>(&int_max), sizeof(int32_t));
ASSERT_STREQ("1024", FormatStatValue(Type::INT32, smin).c_str());
ASSERT_STREQ("2048", FormatStatValue(Type::INT32, smax).c_str());
int32_t int_min_le = ::arrow::bit_util::ToLittleEndian(int_min);
int32_t int_max_le = ::arrow::bit_util::ToLittleEndian(int_max);
smin = std::string(reinterpret_cast<char*>(&int_min_le), sizeof(int32_t));
Comment on lines +1611 to +1613
// Values are in native byte order after decoding
const std::vector<uint32_t> expected_output{0xAA774411U, 0xBB885522U, 0xCC996633U};
CheckDecode(span{data}, span{expected_output});
Comment on lines +1618 to +1621
// Values are in native byte order after decoding
const std::vector<uint64_t> expected_output{0x7755CCAA331137DEULL,
0x8866DDBB442213C0ULL};
CheckDecode(span{data}, span{expected_output});
Comment on lines +1277 to +1281
#if ARROW_LITTLE_ENDIAN
encoder.Init(encoding, max_level, num_levels, bytes.data() + sizeof(int32_t),
static_cast<int>(bytes.size()));

#else
encoder.Init(encoding, max_level, num_levels, bytes.data() + sizeof(int32_t),
@Vishwanatha-HD Vishwanatha-HD requested a review from pitrou as a code owner June 15, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants