GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE #46103
+796
−33
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Parquet files often contain columns with highly repetitive values (e.g., status codes, categories, constant metadata fields). Currently, Arrow reads these into dense arrays, materializing every value and consuming significant memory.
This PR implements direct reads of Parquet RLE (Run Length Encoded) data into Arrow REE (Run End Encoded) representation as described in #32339, using the existing
read_dictionaryAPI as inspiration for interface level changes. Likeread_dictionary, this feature is currently only supported for columns with a Parquet physical type ofBYTE_ARRAY, such as string or binary types.Example usage:
This is a considerably hefty feature so please let me know if there's anything I can do to help the review process (e.g. by splitting this into multiple smaller PRs). The way I implemented this is by adding a
GetNextValueAndNumRepeatsAPI to theRleBitPackedDecoderwhich skips materialization of all values forRleRuns while still going throughBitPackedRuns bit by bit. I'm definitely open to suggestions on other approaches here.Regarding performance, I am anecdotally observing an order of magnitude (i.e. ~10x) speedup for reading columns which have lots of repeated values and a slight performance degredation for columns which contain purely unique values when read with this feature enabled (noting that the feature is toggled by the user, so presumably they would have a good understanding of the shape of their data to decide whether to enable/disable this feature). I haven't done any scientific benchmarking outside of this; let me know if that would be helpful.
What changes are included in this PR?
ArrowReaderProperties::set_read_ree()/read_ree()methods to enable REE reading per-columnGetNextValueAndNumRepeats()andGetNextValueAndNumRepeatsSpaced()methods inRleBitPackedDecoderByteArrayReeRecordReaderfor decoding ParquetBYTE_ARRAYcolumns toRunEndEncodedArrayPlainandRLE_DICTIONARYencodingsread_ree_columnsparameter inParquetDatasetandParquetFileAre these changes tested?
Yes, through included C++ unit tests and pytests.
Are there any user-facing changes?
Yes.