GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x#48200
GH-48198: [C++][Parquet] Fix all the testcase issues to enable Parquet DB support on s390x#48200Vishwanatha-HD wants to merge 5 commits into
Conversation
|
|
f0370d0 to
21c045e
Compare
|
Looking over this batch of tests, a few patterns in how the expected bytes are formed caught my eye. Nothing alarming — just places where BE ends up taking a slightly different journey than LE depending on where the normalization happens. • Half-floats: the assertions use the raw bits from the array, so BE naturally ends up comparing against host-order patterns unless those values get flipped before they reach the check. • INT96 and other scalar stats: some expected strings are built straight from native-order limbs, which means the two architectures diverge a bit unless they’re pushed through one consistent conversion point. • ByteStreamSplit: since the inputs to the encoder come in as native-order integers, the resulting split streams follow whatever the host layout is unless they’re normalized up front. • Page-index and stats checks: a few comparisons still assume host-order for floats/doubles, while others already expect swapped limbs. Individually these are all reasonable, but they do lead to small differences on BE because the tests anchor themselves at slightly different spots in the pipeline. Pulling the expectations toward one shared byte layout in a couple of places would likely smooth that out across hosts. |
@k8ika0s.. Thanks for your review comments.. But please note that we are already taking care of the conversion required in encoders and decoders.. Hence we need to take care of the corresponding testcases accordingly.. |
…Parquet DB support on s390x
21c045e to
7892880
Compare
|
@pitrou @kou @kiszk @zanmato1984 |
|
I verified that with my fix all the 113 testcase passes without any issues. |
|
Could you check CI failures? |
There was a problem hiding this comment.
Pull request overview
This PR updates C++ Parquet/Arrow test suites to behave correctly on big-endian (s390x) by ensuring test data that represents Parquet-encoded values is consistently expressed in little-endian form (or by avoiding endianness-dependent assumptions in test expectations). This supports the broader effort to enable Parquet DB support on big-endian platforms by eliminating endianness-related test failures/hangs.
Changes:
- Convert test-constructed “encoded” min/max (and similar) values to little-endian byte representations before comparing/formatting.
- Make INT96 and ByteStreamSplit-related tests endian-aware and/or endian-independent.
- Adjust a few endian-guarded test cases/macros for correctness and portability.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/parquet/types_test.cc | Updates statistic-formatting and INT96 timestamp tests to avoid endianness-dependent constructions. |
| cpp/src/parquet/statistics_test.cc | Writes encoded statistics values in little-endian form and updates helper encoding for cross-endian correctness. |
| cpp/src/parquet/reader_test.cc | Tweaks ByteStreamSplit integration assertions for endian-specific behavior. |
| cpp/src/parquet/metadata_test.cc | Writes encoded statistics min/max as little-endian bytes when building metadata in tests. |
| cpp/src/parquet/level_conversion_test.cc | Fixes endian macro usage to avoid incorrectly enabling little-endian-only tests. |
| cpp/src/parquet/encoding_test.cc | Updates ByteStreamSplit encode/decode expected values to reflect native-order outputs after decoding / before encoding. |
| cpp/src/parquet/column_writer_test.cc | Adjusts RLE level encoding length prefix handling and endianness for the prefix. |
| cpp/src/arrow/util/byte_stream_split_test.cc | Updates the reference ByteStreamSplit encoder to account for big-endian byte ordering. |
| cpp/src/arrow/dataset/file_parquet_test.cc | Ensures Parquet statistics bytes are constructed in little-endian form in dataset tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int32_t int_min = 1024; | ||
| int32_t int_max = 2048; | ||
| smin = std::string(reinterpret_cast<char*>(&int_min), sizeof(int32_t)); | ||
| smax = std::string(reinterpret_cast<char*>(&int_max), sizeof(int32_t)); | ||
| ASSERT_STREQ("1024", FormatStatValue(Type::INT32, smin).c_str()); | ||
| ASSERT_STREQ("2048", FormatStatValue(Type::INT32, smax).c_str()); | ||
| int32_t int_min_le = ::arrow::bit_util::ToLittleEndian(int_min); | ||
| int32_t int_max_le = ::arrow::bit_util::ToLittleEndian(int_max); | ||
| smin = std::string(reinterpret_cast<char*>(&int_min_le), sizeof(int32_t)); |
| // Values are in native byte order after decoding | ||
| const std::vector<uint32_t> expected_output{0xAA774411U, 0xBB885522U, 0xCC996633U}; | ||
| CheckDecode(span{data}, span{expected_output}); |
| // Values are in native byte order after decoding | ||
| const std::vector<uint64_t> expected_output{0x7755CCAA331137DEULL, | ||
| 0x8866DDBB442213C0ULL}; | ||
| CheckDecode(span{data}, span{expected_output}); |
| #if ARROW_LITTLE_ENDIAN | ||
| encoder.Init(encoding, max_level, num_levels, bytes.data() + sizeof(int32_t), | ||
| static_cast<int>(bytes.size())); | ||
|
|
||
| #else | ||
| encoder.Init(encoding, max_level, num_levels, bytes.data() + sizeof(int32_t), |
Rationale for this change
This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes all the remaining testcase issues. The fix mainly involves the byte swapping in order to take care of endianness issues.
What changes are included in this PR?
The fix includes changes to following testcase files:
cpp/src/arrow/dataset/file_parquet_test.cc
cpp/src/arrow/util/byte_stream_split_test.cc
cpp/src/parquet/arrow/arrow_reader_writer_test.cc
cpp/src/parquet/column_writer_test.cc
cpp/src/parquet/encoding_test.cc
cpp/src/parquet/level_conversion_test.cc
cpp/src/parquet/metadata_test.cc
cpp/src/parquet/reader_test.cc
cpp/src/parquet/statistics_test.cc
cpp/src/parquet/types_test.cc
Are these changes tested?
Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.
Are there any user-facing changes?
No
GitHub main Issue link: #48151