-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47103: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate #47969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
remover logger header correct is_null_count_exact comment
|
@kou |
|
Yes. I'll do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for approximate null counts in Arrow array statistics by extending the null_count field to support both exact (int64_t) and approximate (double) values using std::variant. This aligns with the existing pattern used for distinct_count.
Key changes:
- Changed
null_countfromstd::optional<int64_t>tostd::optional<CountType>(variant of int64_t and double) - Added
is_null_count_exactproperty to distinguish between exact and approximate null counts - Updated all related tests and comparison logic to handle the variant type
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/arrow/array/statistics.h | Changed null_count type from int64_t to CountType variant |
| cpp/src/arrow/record_batch.cc | Added logic to handle both exact and approximate null counts when creating statistics arrays |
| cpp/src/arrow/compare.cc | Updated equality comparison to use ArrayStatisticsOptionalValueEquals for null_count |
| cpp/src/arrow/record_batch_test.cc | Renamed test and added new test for approximate null count |
| cpp/src/arrow/array/statistics_test.cc | Added test for approximate null count and updated existing tests |
| cpp/src/arrow/array/array_test.cc | Updated variable type and test assertions to handle variant null_count |
| cpp/src/parquet/arrow/arrow_statistics_test.cc | Updated assertions to extract int64_t from variant null_count |
| python/pyarrow/includes/libarrow.pxd | Changed null_count type from int64_t to CArrayStatisticsCountType |
| python/pyarrow/array.pxi | Added is_null_count_exact property and updated null_count to handle variant |
| python/pyarrow/tests/parquet/test_parquet_file.py | Added assertion to verify is_null_count_exact is True |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fix typo Co-authored-by: Copilot <[email protected]>
Rationale for this change
Enable
ARROW:null_count:approximatesupport forarrow::ArrayStatistics, along with the corresponding Python binding.What changes are included in this PR?
Enable
ARROW:null_count:approximatein C++ and bind it toArrayStatisticsin Python.Are these changes tested?
Yes, I ran the relevant unit tests.
Are there any user-facing changes?
Yes. The type of
arrow::ArrayStatistics::null_counthas been changed fromstd::optional<int64_t>tostd::optional<CountType>, and a new fieldis_null_count_exacthas been added toArrayStatisticsin Python.