Skip to content

Conversation

@lesterfan
Copy link
Contributor

@lesterfan lesterfan commented Apr 10, 2025

Rationale for this change

Parquet files often contain columns with highly repetitive values (e.g., status codes, categories, constant metadata fields). Currently, Arrow reads these into dense arrays, materializing every value and consuming significant memory.

This PR implements direct reads of Parquet RLE (Run Length Encoded) data into Arrow REE (Run End Encoded) representation as described in #32339, using the existing read_dictionary API as inspiration for interface level changes. Like read_dictionary, this feature is currently only supported for columns with a Parquet physical type of BYTE_ARRAY, such as string or binary types.

Example usage:

 import pyarrow.parquet as pq
 # These columns will be directly read as Arrow run-end-encoded without full materialization if their Parquet representation was run-length-encoded.
 table = pq.read_table('data.parquet', read_ree=['category', 'status']) 

This is a considerably hefty feature so please let me know if there's anything I can do to help the review process (e.g. by splitting this into multiple smaller PRs). The way I implemented this is by adding a GetNextValueAndNumRepeats API to the RleBitPackedDecoder which skips materialization of all values for RleRuns while still going through BitPackedRuns bit by bit. I'm definitely open to suggestions on other approaches here.

Regarding performance, I am anecdotally observing an order of magnitude (i.e. ~10x) speedup for reading columns which have lots of repeated values and a slight performance degredation for columns which contain purely unique values when read with this feature enabled (noting that the feature is toggled by the user, so presumably they would have a good understanding of the shape of their data to decide whether to enable/disable this feature). I haven't done any scientific benchmarking outside of this; let me know if that would be helpful.

What changes are included in this PR?

  1. Add ArrowReaderProperties::set_read_ree() / read_ree() methods to enable REE reading per-column
  2. Implement GetNextValueAndNumRepeats() and GetNextValueAndNumRepeatsSpaced() methods inRleBitPackedDecoder
  3. Add ByteArrayReeRecordReader for decoding Parquet BYTE_ARRAY columns to RunEndEncodedArray
  4. Support REE decoding for both Plain and RLE_DICTIONARY encodings
  5. Add Python bindings via read_ree_columns parameter in ParquetDataset and ParquetFile

Are these changes tested?

Yes, through included C++ unit tests and pytests.

Are there any user-facing changes?

Yes.

@lesterfan lesterfan requested a review from wgtmac as a code owner April 10, 2025 21:46
@github-actions
Copy link

⚠️ GitHub issue #32339 has been automatically assigned in GitHub to PR creator.

@lesterfan lesterfan changed the title GH-32339: [C++][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE Apr 10, 2025
@github-actions
Copy link

⚠️ GitHub issue #32339 has been automatically assigned in GitHub to PR creator.

@lesterfan
Copy link
Contributor Author

Tagging @pitrou and @raulcd for some initial feedback on the implementation. I'm happy to split this into smaller PRs if that's easier for review (though guidance on how to split it would be appreciated), or make design changes based on your input. Tagging you both since we've worked together previously and I see you've reviewed recent REE changes, but feel free to suggest other reviewers if more appropriate.

@pitrou
Copy link
Member

pitrou commented Oct 27, 2025

@lesterfan I suggest you first rebase or merge your work on git main, because the changes in #47294 might make this work easier for you.

@lesterfan lesterfan force-pushed the rle-ree-rebased-on-main branch from 40aedcc to 3547e1c Compare October 27, 2025 14:05
@lesterfan
Copy link
Contributor Author

@pitrou Yup I spent a while rebasing this work on main over the weekend! (This was another reason I wanted to bump this 😄)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants