Skip to content

Conversation

@andishgar
Copy link
Contributor

@andishgar andishgar commented Oct 27, 2025

Rationale for this change

Enable ARROW:null_count:approximate support for arrow::ArrayStatistics, along with the corresponding Python binding.

What changes are included in this PR?

Enable ARROW:null_count:approximate in C++ and bind it to ArrayStatistics in Python.

Are these changes tested?

Yes, I ran the relevant unit tests.

Are there any user-facing changes?

Yes. The type of arrow::ArrayStatistics::null_count has been changed from std::optional<int64_t> to std::optional<CountType>, and a new field is_null_count_exact has been added to ArrayStatistics in Python.

@github-actions
Copy link

⚠️ GitHub issue #47103 has been automatically assigned in GitHub to PR creator.

remover logger header
correct is_null_count_exact comment
@andishgar andishgar marked this pull request as ready for review October 28, 2025 08:45
@andishgar
Copy link
Contributor Author

@kou
Regarding the Ruby binding, would it be possible to ask someone to work on it?

@kou
Copy link
Member

kou commented Oct 29, 2025

Yes. I'll do it.

@kou kou requested a review from Copilot October 29, 2025 05:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for approximate null counts in Arrow array statistics by extending the null_count field to support both exact (int64_t) and approximate (double) values using std::variant. This aligns with the existing pattern used for distinct_count.

Key changes:

  • Changed null_count from std::optional<int64_t> to std::optional<CountType> (variant of int64_t and double)
  • Added is_null_count_exact property to distinguish between exact and approximate null counts
  • Updated all related tests and comparison logic to handle the variant type

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cpp/src/arrow/array/statistics.h Changed null_count type from int64_t to CountType variant
cpp/src/arrow/record_batch.cc Added logic to handle both exact and approximate null counts when creating statistics arrays
cpp/src/arrow/compare.cc Updated equality comparison to use ArrayStatisticsOptionalValueEquals for null_count
cpp/src/arrow/record_batch_test.cc Renamed test and added new test for approximate null count
cpp/src/arrow/array/statistics_test.cc Added test for approximate null count and updated existing tests
cpp/src/arrow/array/array_test.cc Updated variable type and test assertions to handle variant null_count
cpp/src/parquet/arrow/arrow_statistics_test.cc Updated assertions to extract int64_t from variant null_count
python/pyarrow/includes/libarrow.pxd Changed null_count type from int64_t to CArrayStatisticsCountType
python/pyarrow/array.pxi Added is_null_count_exact property and updated null_count to handle variant
python/pyarrow/tests/parquet/test_parquet_file.py Added assertion to verify is_null_count_exact is True

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kou kou self-requested a review as a code owner October 29, 2025 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants