ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction by wesm · Pull Request #34 · apache/arrow

@wesm

@wesm

@wesm

@wesm

@wesm changed the title ARROW-62: Clarify interpretation of set bits in null bitmaps, indicate bit-endianness ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction

Mar 23, 2016

@wesm

@wesm

@wesm wesm deleted the ARROW-62 branch

March 25, 2016 02:20

wesm added a commit to wesm/arrow that referenced this pull request

Sep 8, 2018
Requires PARQUET-485 (apache#32)

The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with
other implementations of Parquet. This patch adds an implementation of plain
encoding and uses BitReader instead of RleDecoder to decode plain-encoded
boolean data. Unit tests to verify.

Also closes PR apache#12. Thanks to @edani for reporting.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#34 from wesm/PARQUET-454 and squashes the following commits:

01cb5a7 [Wes McKinney] Use a seed in the data generation
0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.

Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403

kou pushed a commit that referenced this pull request

May 10, 2020
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit ed5f534
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test #20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test #21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test #22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test #23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test #24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test #25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test #26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test #27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test #28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test #29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test #30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test #31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test #32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test #33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test #34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test #35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test #36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test #37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test #38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test #39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test #40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test #41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test #42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test #43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test #44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test #45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test #46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test #47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test #48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test #49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test #50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test #51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes #7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request

Oct 8, 2021
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup

zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request

Feb 8, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup

zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request

Mar 3, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup

rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request

Mar 23, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup

zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request

Apr 26, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup

rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request

May 27, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request

Oct 24, 2025

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 20, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 20, 2026
)

Implemented temporal type (DATE32, TIMESTAMP) predicate pushdown for ORC files.

**Adapter Changes (adapter.cc - 48 lines):**
- Added liborc::DATE case handler
  - Extracts min/max from IntegerColumnStatistics (days since epoch)
  - Returns Date32Scalar for Arrow integration
  - Handles int64 to int32 conversion for days
- Added liborc::TIMESTAMP and TIMESTAMP_INSTANT case handlers
  - Extracts min/max from IntegerColumnStatistics (nanoseconds since epoch)
  - Returns TimestampScalar with NANO unit
  - Supports both TIMESTAMP (local) and TIMESTAMP_INSTANT (UTC) variants
- Comprehensive documentation for temporal type handling
  - DATE: days since Unix epoch (1970-01-01)
  - TIMESTAMP: nanoseconds since Unix epoch
  - TIMESTAMP_INSTANT: UTC-normalized timestamps
  - Timezone handling notes referencing util.cc

**Tests (file_orc_test.cc - 188 lines):**
- Added TemporalPredicatePushdown test with 5 comprehensive scenarios
- Test 1: DATE32 basic filtering (>= predicate, stripe skipping)
- Test 2: DATE32 range filtering (compound AND predicate)
- Test 3: TIMESTAMP basic filtering (>= with nanosecond precision)
- Test 4: TIMESTAMP less-than filtering (< predicate verification)
- Test 5: TIMESTAMP equality (single value case with min=max)
- Uses realistic date/timestamp values (2020-2022 date range)
- Verifies stripe-level filtering with row count assertions

**Technical Implementation:**
- DATE: ORC stores as int64 days, Arrow uses int32 Date32Scalar
- TIMESTAMP: ORC stores as int64 nanos, Arrow uses TimestampScalar(NANO)
- No changes needed in file_orc.cc - DeriveFieldGuarantee already handles
  temporal types generically via Scalar comparison
- Follows existing pattern from INT/LONG/FLOAT/DOUBLE/STRING/BINARY

**Code Stats:**
- 3 files would be modified (adapter.cc, file_orc.cc, file_orc_test.cc)
- 235 lines added, 1 line deleted
- 48 lines in adapter (DATE/TIMESTAMP cases + documentation)
- 188 lines of tests (5 comprehensive test scenarios)
- 0 lines in file_orc.cc (no changes needed)

Note: Build verification skipped due to CMake configuration issues in this
environment. The implementation follows established patterns and should
compile successfully.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 20, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 20, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 20, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 20, 2026
Reverts PRs apache#76-88 which added:
- Task apache#32: Float32/float64 predicate pushdown
- Task apache#33: String/binary predicate pushdown
- Task apache#34: Timestamp/date predicate pushdown
- Task apache#35: Nested type tests

Tasks 32-35 are now marked as pending for future implementation.

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 24, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 24, 2026
)

Implemented temporal type (DATE32, TIMESTAMP) predicate pushdown for ORC files.

**Adapter Changes (adapter.cc - 48 lines):**
- Added liborc::DATE case handler
  - Extracts min/max from IntegerColumnStatistics (days since epoch)
  - Returns Date32Scalar for Arrow integration
  - Handles int64 to int32 conversion for days
- Added liborc::TIMESTAMP and TIMESTAMP_INSTANT case handlers
  - Extracts min/max from IntegerColumnStatistics (nanoseconds since epoch)
  - Returns TimestampScalar with NANO unit
  - Supports both TIMESTAMP (local) and TIMESTAMP_INSTANT (UTC) variants
- Comprehensive documentation for temporal type handling
  - DATE: days since Unix epoch (1970-01-01)
  - TIMESTAMP: nanoseconds since Unix epoch
  - TIMESTAMP_INSTANT: UTC-normalized timestamps
  - Timezone handling notes referencing util.cc

**Tests (file_orc_test.cc - 188 lines):**
- Added TemporalPredicatePushdown test with 5 comprehensive scenarios
- Test 1: DATE32 basic filtering (>= predicate, stripe skipping)
- Test 2: DATE32 range filtering (compound AND predicate)
- Test 3: TIMESTAMP basic filtering (>= with nanosecond precision)
- Test 4: TIMESTAMP less-than filtering (< predicate verification)
- Test 5: TIMESTAMP equality (single value case with min=max)
- Uses realistic date/timestamp values (2020-2022 date range)
- Verifies stripe-level filtering with row count assertions

**Technical Implementation:**
- DATE: ORC stores as int64 days, Arrow uses int32 Date32Scalar
- TIMESTAMP: ORC stores as int64 nanos, Arrow uses TimestampScalar(NANO)
- No changes needed in file_orc.cc - DeriveFieldGuarantee already handles
  temporal types generically via Scalar comparison
- Follows existing pattern from INT/LONG/FLOAT/DOUBLE/STRING/BINARY

**Code Stats:**
- 3 files would be modified (adapter.cc, file_orc.cc, file_orc_test.cc)
- 235 lines added, 1 line deleted
- 48 lines in adapter (DATE/TIMESTAMP cases + documentation)
- 188 lines of tests (5 comprehensive test scenarios)
- 0 lines in file_orc.cc (no changes needed)

Note: Build verification skipped due to CMake configuration issues in this
environment. The implementation follows established patterns and should
compile successfully.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 24, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 24, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 24, 2026

cbb330 added a commit to cbb330/arrow that referenced this pull request

Feb 24, 2026
Reverts PRs apache#76-88 which added:
- Task apache#32: Float32/float64 predicate pushdown
- Task apache#33: String/binary predicate pushdown
- Task apache#34: Timestamp/date predicate pushdown
- Task apache#35: Nested type tests

Tasks 32-35 are now marked as pending for future implementation.