ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction by wesm · Pull Request #34 · apache/arrow
changed the title
ARROW-62: Clarify interpretation of set bits in null bitmaps, indicate bit-endianness
ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction
wesm
deleted the
ARROW-62
branch
wesm added a commit to wesm/arrow that referenced this pull request
Sep 8, 2018Requires PARQUET-485 (apache#32) The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with other implementations of Parquet. This patch adds an implementation of plain encoding and uses BitReader instead of RleDecoder to decode plain-encoded boolean data. Unit tests to verify. Also closes PR apache#12. Thanks to @edani for reporting. Author: Wes McKinney <wes@cloudera.com> Closes apache#34 from wesm/PARQUET-454 and squashes the following commits: 01cb5a7 [Wes McKinney] Use a seed in the data generation 0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding. Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
kou pushed a commit that referenced this pull request
May 10, 2020This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point. I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`. ``` $ git log | head -1 commit ed5f534 % ctest ... Start 1: arrow-array-test 1/51 Test #1: arrow-array-test ..................... Passed 4.62 sec Start 2: arrow-buffer-test 2/51 Test #2: arrow-buffer-test .................... Passed 0.14 sec Start 3: arrow-extension-type-test 3/51 Test #3: arrow-extension-type-test ............ Passed 0.12 sec Start 4: arrow-misc-test 4/51 Test #4: arrow-misc-test ...................... Passed 0.14 sec Start 5: arrow-public-api-test 5/51 Test #5: arrow-public-api-test ................ Passed 0.12 sec Start 6: arrow-scalar-test 6/51 Test #6: arrow-scalar-test .................... Passed 0.13 sec Start 7: arrow-type-test 7/51 Test #7: arrow-type-test ...................... Passed 0.14 sec Start 8: arrow-table-test 8/51 Test #8: arrow-table-test ..................... Passed 0.13 sec Start 9: arrow-tensor-test 9/51 Test #9: arrow-tensor-test .................... Passed 0.13 sec Start 10: arrow-sparse-tensor-test 10/51 Test #10: arrow-sparse-tensor-test ............. Passed 0.16 sec Start 11: arrow-stl-test 11/51 Test #11: arrow-stl-test ....................... Passed 0.12 sec Start 12: arrow-concatenate-test 12/51 Test #12: arrow-concatenate-test ............... Passed 0.53 sec Start 13: arrow-diff-test 13/51 Test #13: arrow-diff-test ...................... Passed 1.45 sec Start 14: arrow-c-bridge-test 14/51 Test #14: arrow-c-bridge-test .................. Passed 0.18 sec Start 15: arrow-io-buffered-test 15/51 Test #15: arrow-io-buffered-test ............... Passed 0.20 sec Start 16: arrow-io-compressed-test 16/51 Test #16: arrow-io-compressed-test ............. Passed 3.48 sec Start 17: arrow-io-file-test 17/51 Test #17: arrow-io-file-test ................... Passed 0.74 sec Start 18: arrow-io-hdfs-test 18/51 Test #18: arrow-io-hdfs-test ................... Passed 0.12 sec Start 19: arrow-io-memory-test 19/51 Test #19: arrow-io-memory-test ................. Passed 2.77 sec Start 20: arrow-utility-test 20/51 Test #20: arrow-utility-test ...................***Failed 5.65 sec Start 21: arrow-threading-utility-test 21/51 Test #21: arrow-threading-utility-test ......... Passed 1.34 sec Start 22: arrow-compute-compute-test 22/51 Test #22: arrow-compute-compute-test ........... Passed 0.13 sec Start 23: arrow-compute-boolean-test 23/51 Test #23: arrow-compute-boolean-test ........... Passed 0.15 sec Start 24: arrow-compute-cast-test 24/51 Test #24: arrow-compute-cast-test .............. Passed 0.22 sec Start 25: arrow-compute-hash-test 25/51 Test #25: arrow-compute-hash-test .............. Passed 2.61 sec Start 26: arrow-compute-isin-test 26/51 Test #26: arrow-compute-isin-test .............. Passed 0.81 sec Start 27: arrow-compute-match-test 27/51 Test #27: arrow-compute-match-test ............. Passed 0.40 sec Start 28: arrow-compute-sort-to-indices-test 28/51 Test #28: arrow-compute-sort-to-indices-test ... Passed 3.33 sec Start 29: arrow-compute-nth-to-indices-test 29/51 Test #29: arrow-compute-nth-to-indices-test .... Passed 1.51 sec Start 30: arrow-compute-util-internal-test 30/51 Test #30: arrow-compute-util-internal-test ..... Passed 0.13 sec Start 31: arrow-compute-add-test 31/51 Test #31: arrow-compute-add-test ............... Passed 0.12 sec Start 32: arrow-compute-aggregate-test 32/51 Test #32: arrow-compute-aggregate-test ......... Passed 14.70 sec Start 33: arrow-compute-compare-test 33/51 Test #33: arrow-compute-compare-test ........... Passed 7.96 sec Start 34: arrow-compute-take-test 34/51 Test #34: arrow-compute-take-test .............. Passed 4.80 sec Start 35: arrow-compute-filter-test 35/51 Test #35: arrow-compute-filter-test ............ Passed 8.23 sec Start 36: arrow-dataset-dataset-test 36/51 Test #36: arrow-dataset-dataset-test ........... Passed 0.25 sec Start 37: arrow-dataset-discovery-test 37/51 Test #37: arrow-dataset-discovery-test ......... Passed 0.13 sec Start 38: arrow-dataset-file-ipc-test 38/51 Test #38: arrow-dataset-file-ipc-test .......... Passed 0.21 sec Start 39: arrow-dataset-file-test 39/51 Test #39: arrow-dataset-file-test .............. Passed 0.12 sec Start 40: arrow-dataset-filter-test 40/51 Test #40: arrow-dataset-filter-test ............ Passed 0.16 sec Start 41: arrow-dataset-partition-test 41/51 Test #41: arrow-dataset-partition-test ......... Passed 0.13 sec Start 42: arrow-dataset-scanner-test 42/51 Test #42: arrow-dataset-scanner-test ........... Passed 0.20 sec Start 43: arrow-filesystem-test 43/51 Test #43: arrow-filesystem-test ................ Passed 1.62 sec Start 44: arrow-hdfs-test 44/51 Test #44: arrow-hdfs-test ...................... Passed 0.13 sec Start 45: arrow-feather-test 45/51 Test #45: arrow-feather-test ................... Passed 0.91 sec Start 46: arrow-ipc-read-write-test 46/51 Test #46: arrow-ipc-read-write-test ............ Passed 5.77 sec Start 47: arrow-ipc-json-simple-test 47/51 Test #47: arrow-ipc-json-simple-test ........... Passed 0.16 sec Start 48: arrow-ipc-json-test 48/51 Test #48: arrow-ipc-json-test .................. Passed 0.27 sec Start 49: arrow-json-integration-test 49/51 Test #49: arrow-json-integration-test .......... Passed 0.13 sec Start 50: arrow-json-test 50/51 Test #50: arrow-json-test ...................... Passed 0.26 sec Start 51: arrow-orc-adapter-test 51/51 Test #51: arrow-orc-adapter-test ............... Passed 1.92 sec 98% tests passed, 1 tests failed out of 51 Label Time Summary: arrow-tests = 27.38 sec (27 tests) arrow_compute = 45.11 sec (14 tests) arrow_dataset = 1.21 sec (7 tests) arrow_ipc = 6.20 sec (3 tests) unittest = 79.91 sec (51 tests) Total Test time (real) = 79.99 sec The following tests FAILED: 20 - arrow-utility-test (Failed) Errors while running CTest ``` Closes #7142 from kiszk/ARROW-8754 Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request
Oct 8, 2021zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request
Feb 8, 2022zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request
Mar 3, 2022rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request
Mar 23, 2022zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request
Apr 26, 2022rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request
May 27, 2022cbb330 added a commit to cbb330/arrow that referenced this pull request
Feb 20, 2026) Implemented temporal type (DATE32, TIMESTAMP) predicate pushdown for ORC files. **Adapter Changes (adapter.cc - 48 lines):** - Added liborc::DATE case handler - Extracts min/max from IntegerColumnStatistics (days since epoch) - Returns Date32Scalar for Arrow integration - Handles int64 to int32 conversion for days - Added liborc::TIMESTAMP and TIMESTAMP_INSTANT case handlers - Extracts min/max from IntegerColumnStatistics (nanoseconds since epoch) - Returns TimestampScalar with NANO unit - Supports both TIMESTAMP (local) and TIMESTAMP_INSTANT (UTC) variants - Comprehensive documentation for temporal type handling - DATE: days since Unix epoch (1970-01-01) - TIMESTAMP: nanoseconds since Unix epoch - TIMESTAMP_INSTANT: UTC-normalized timestamps - Timezone handling notes referencing util.cc **Tests (file_orc_test.cc - 188 lines):** - Added TemporalPredicatePushdown test with 5 comprehensive scenarios - Test 1: DATE32 basic filtering (>= predicate, stripe skipping) - Test 2: DATE32 range filtering (compound AND predicate) - Test 3: TIMESTAMP basic filtering (>= with nanosecond precision) - Test 4: TIMESTAMP less-than filtering (< predicate verification) - Test 5: TIMESTAMP equality (single value case with min=max) - Uses realistic date/timestamp values (2020-2022 date range) - Verifies stripe-level filtering with row count assertions **Technical Implementation:** - DATE: ORC stores as int64 days, Arrow uses int32 Date32Scalar - TIMESTAMP: ORC stores as int64 nanos, Arrow uses TimestampScalar(NANO) - No changes needed in file_orc.cc - DeriveFieldGuarantee already handles temporal types generically via Scalar comparison - Follows existing pattern from INT/LONG/FLOAT/DOUBLE/STRING/BINARY **Code Stats:** - 3 files would be modified (adapter.cc, file_orc.cc, file_orc_test.cc) - 235 lines added, 1 line deleted - 48 lines in adapter (DATE/TIMESTAMP cases + documentation) - 188 lines of tests (5 comprehensive test scenarios) - 0 lines in file_orc.cc (no changes needed) Note: Build verification skipped due to CMake configuration issues in this environment. The implementation follows established patterns and should compile successfully. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
cbb330 added a commit to cbb330/arrow that referenced this pull request
Feb 20, 2026cbb330 added a commit to cbb330/arrow that referenced this pull request
Feb 24, 2026) Implemented temporal type (DATE32, TIMESTAMP) predicate pushdown for ORC files. **Adapter Changes (adapter.cc - 48 lines):** - Added liborc::DATE case handler - Extracts min/max from IntegerColumnStatistics (days since epoch) - Returns Date32Scalar for Arrow integration - Handles int64 to int32 conversion for days - Added liborc::TIMESTAMP and TIMESTAMP_INSTANT case handlers - Extracts min/max from IntegerColumnStatistics (nanoseconds since epoch) - Returns TimestampScalar with NANO unit - Supports both TIMESTAMP (local) and TIMESTAMP_INSTANT (UTC) variants - Comprehensive documentation for temporal type handling - DATE: days since Unix epoch (1970-01-01) - TIMESTAMP: nanoseconds since Unix epoch - TIMESTAMP_INSTANT: UTC-normalized timestamps - Timezone handling notes referencing util.cc **Tests (file_orc_test.cc - 188 lines):** - Added TemporalPredicatePushdown test with 5 comprehensive scenarios - Test 1: DATE32 basic filtering (>= predicate, stripe skipping) - Test 2: DATE32 range filtering (compound AND predicate) - Test 3: TIMESTAMP basic filtering (>= with nanosecond precision) - Test 4: TIMESTAMP less-than filtering (< predicate verification) - Test 5: TIMESTAMP equality (single value case with min=max) - Uses realistic date/timestamp values (2020-2022 date range) - Verifies stripe-level filtering with row count assertions **Technical Implementation:** - DATE: ORC stores as int64 days, Arrow uses int32 Date32Scalar - TIMESTAMP: ORC stores as int64 nanos, Arrow uses TimestampScalar(NANO) - No changes needed in file_orc.cc - DeriveFieldGuarantee already handles temporal types generically via Scalar comparison - Follows existing pattern from INT/LONG/FLOAT/DOUBLE/STRING/BINARY **Code Stats:** - 3 files would be modified (adapter.cc, file_orc.cc, file_orc_test.cc) - 235 lines added, 1 line deleted - 48 lines in adapter (DATE/TIMESTAMP cases + documentation) - 188 lines of tests (5 comprehensive test scenarios) - 0 lines in file_orc.cc (no changes needed) Note: Build verification skipped due to CMake configuration issues in this environment. The implementation follows established patterns and should compile successfully. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
cbb330 added a commit to cbb330/arrow that referenced this pull request
Feb 24, 2026This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters