How to improve Parquet reading performances? #7737

devoxi · 2023-10-04T08:50:46Z

devoxi
Oct 4, 2023

Hello!

We've been experimenting in the last couple of days with Datafusion (31.0) and we've been comparing performances with our existing ClickHouse setup. To do so, we have exported a 5GB Parquet dataset, and building some UDFs we managed to replicate some of our queries.

In the end we are running a single quite simple query over the same Parquet dataset on the same mac with both Datafusion and ClickHouse. ClickHouse is always answering in about 700ms while Datafusion in 1.2s.

I've tried multiple settings, verified it was not our UDF causing it, checked there was no cache on ClickHouse, and I couldn't make it any faster with Datafusion. According to the EXPLAIN ANALYZE the poor performances are coming from the Parquet phase.

I have to confess that we are beginners in Rust and we might have missed something, hence this message.
Here is the EXPLAIN ANALYZE of our query, if it can help:

Plan with Metrics | ProjectionExec: expr=[SUM(my_table.sign)@0 as tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * my_table.sign)@1 as _ccount_1], metrics=[output_rows=1, elapsed_compute=584ns]
     AggregateExec: mode=Final, gby=[], aggr=[SUM(my_table.sign)@0 as tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * my_table.sign)], metrics=[output_rows=1, elapsed_compute=48.459µs]
        CoalescePartitionsExec, metrics=[output_rows=20, elapsed_compute=5.708µs]
            AggregateExec: mode=Partial, gby=[], aggr=[SUM(my_table.sign)@0 as tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * my_table.sign)], metrics=[output_rows=20, elapsed_compute=4.769958246s]
                ProjectionExec: expr=[sign@0 as sign, nested_field.array_column@3 as nested_field.array_column], metrics=[output_rows=6185527, elapsed_compute=396.238µs]
                    CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=6185527, elapsed_compute=168.795215ms]
                        FilterExec: int_column@1 = 2 AND (CAST(string_column@2 AS Utf8) !~* (.*.|)(word1|word2|word3|word4).*), metrics=[output_rows=6185527, elapsed_compute=652.683132ms]
                            ParquetExec: file_groups={20 groups: [[path/to/parquet/my_dataset.parquet:0..262363404], [path/to/parquet/my_dataset.parquet:262363404..524726808], [path/to/parquet/my_dataset.parquet:524726808..787090212], [path/to/parquet/my_dataset.parquet:787090212..1049453616], [path/to/parquet/my_dataset.parquet:1049453616..1311817020], ...]}, projection=[sign, int_column, string_column, nested_field.array_column], predicate=int_column@5 = 2 AND (CAST(string_column@79 AS Utf8) !~* (.*.|)(word1|word2|word3|word4).*), pruning_predicate=int_column_min@0 <= 2 AND 2 <= int_column_max@1, metrics=[output_rows=6185527, elapsed_compute=20ns, page_index_rows_filtered=0, predicate_evaluation_errors=0, file_scan_errors=0, row_groups_pruned=13, num_predicate_creation_errors=0, file_open_errors=0, pushdown_rows_filtered=2880077, bytes_scanned=311061152, time_elapsed_processing=5.335386071s, pushdown_eval_time=781.00424ms, time_elapsed_scanning_until_data=1.472284126s, time_elapsed_opening=3.032638581s, time_elapsed_scanning_total=18.9360584s, page_index_eval_time=1.318µs]

We also noticed in some other queries that when having more Parquet files performances were much worse than in ClickHouse compared to a single Parquet file.

So is there anything we might have missed, that is general knowledge and could that lead to those performances?
Thanks for your help!

Answered by tustvold

Oct 4, 2023

That's interesting, the reason that might be is because it will read the footer once for every file group, which means it is doing that 20 times. Normally that is outweighed by the additional parallelism, but it is possible that the parquet file has been written in such a way that this isn't possible. Arrow-cpp had a bug for a very long time where it produced massive row groups, and DuckDB has an interesting approach to the spec 😅

Couple of questions:

What did you use to write the parquet file
How many columns does the parquet file have
How many row groups does the file contain - can be found with https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-layout.rs

Couple of…

View full answer

tustvold · 2023-10-04T11:35:47Z

tustvold
Oct 4, 2023
Collaborator

Some ideas:

Run in release mode
Enable SIMD instructions, see performance tips here
Enable parquet filter pushdown - https://docs.rs/datafusion/latest/datafusion/common/config/struct.ParquetOptions.html#structfield.pushdown_filters
Rewrite the regex filter to be a cheaper InList or disjunction of like expressions, to avoid expensive regex evaluation

9 replies

devoxi Oct 4, 2023
Author

I'm doing the exact same query on ClickHouse on the exact same Parquet file:

SELECT
    sum(sign)
FROM file('my_parquet_file.parquet', Parquet)
WHERE (int_column = 2)

Also in Datafusion I'm using the default settings to register the parquet file. I tried a few things but it didn't change anything:

ctx.register_parquet("my_table", "my_parquet_file.parquet", ParquetReadOptions::default()).await?;

tustvold Oct 4, 2023
Collaborator

Perhaps you could attach a CPU profile to see where it is actually spending time, normally I would suggest hotspot but that's Linux specific. I think mac instruments might have something? Or alternatively cargo-flamegraph might work

devoxi Oct 4, 2023
Author

I just built the flamegraph and if I'm not mistaken a lot of time is spent in here parquet::file::footer::read_metadata and more precisely here: <parquet::format::ColumnMetaData as thrift::protocol::TSerializable>::read_from_in_protocol.
Unfortunately it doesn't seem I can attach it here.

tustvold Oct 4, 2023
Collaborator

That's interesting, the reason that might be is because it will read the footer once for every file group, which means it is doing that 20 times. Normally that is outweighed by the additional parallelism, but it is possible that the parquet file has been written in such a way that this isn't possible. Arrow-cpp had a bug for a very long time where it produced massive row groups, and DuckDB has an interesting approach to the spec 😅

Couple of questions:

What did you use to write the parquet file
How many columns does the parquet file have
How many row groups does the file contain - can be found with https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-layout.rs

Couple of things to try:

You could disable https://docs.rs/datafusion/latest/datafusion/config/struct.OptimizerOptions.html#structfield.repartition_file_scans
Reduce the number of https://docs.rs/datafusion/latest/datafusion/config/struct.ExecutionOptions.html#structfield.target_partitions
Rewrite the file with a smaller https://docs.rs/datafusion/latest/datafusion/common/config/struct.ParquetOptions.html#structfield.max_row_group_size

Answer selected by Jefffrey

devoxi Oct 4, 2023
Author

In order to write the parquet file I've used ClickHouse (v23.3). (And by the way to perform the comparison with Datafusion I used the v23.9 of ClickHouse)
279 row groups according to parquet-layout. (and the whole parquet file is about 5GB and contains 250 columns, and 5M rows)

I'll try what you suggested and come back later with the result :) Thanks for your help!

devoxi Oct 4, 2023
Author

So, I did rewrite my Parquet file with datafusion-cli with this command:
COPY 'my_file_v1.parquet' TO 'my_file_v2.parquet' (format parquet, single_file_output true, compression snappy);. The number of row groups went down to 10. It drastically improved the query time with only the SUM(sign), and it's now even better than ClickHouse.

However I couldn't test properly with my full query as the rewritten parquet file has an issue with my nested column leading to this error:
Error: ArrowError(ExternalError(ArrowError("Parquet argument error: Parquet error: Invalid offset in sparse column chunk data: 145661441")))
I now understand the problems that can come from various Parquet implementations and it's definitely something we'll take into account now 😅

I also managed to test my query including the regex, and on this one ClickHouse was still faster.
I also saw that you did a draft PR to mitigate the issue when there are a lot of row groups, so I tried it with your branch, and while it did change the flamegraph shape, the total execution time didn't really decrease. So there might be other issues, but probably not linked to Parquet.

Anyway, thanks a lot for your help, it was really appreciated!

Ted-Jiang Oct 7, 2023
Collaborator

That's interesting, the reason that might be is because it will read the footer once for every file group, which means it is doing that 20 times.

@tustvold could you please show me where the code is 🤣 , took me a long time to find it, could we read the footer in the file level and pass the info offset to file group level 🤔

tustvold Oct 7, 2023
Collaborator

It's a consequence of the repartition file scans pass, #7739 is one option to rectify this, but I'm not sure it is a good idea

alamb · 2023-10-04T14:04:59Z

alamb
Oct 4, 2023
Collaborator

I also think @Ted-Jiang added code (not yet released) in #7570 that caches parquet data statistics. Maybe this could help the usecase described in this PR as well. The usecase was a little different (reusing the statistics within a session, rather than within a query)

0 replies

How to improve Parquet reading performances? #7737

Uh oh!

Uh oh!

devoxi Oct 4, 2023

Replies: 2 comments · 9 replies

Uh oh!

Uh oh!

tustvold Oct 4, 2023 Collaborator

Uh oh!

Uh oh!

devoxi Oct 4, 2023 Author

Uh oh!

Uh oh!

tustvold Oct 4, 2023 Collaborator

Uh oh!

Uh oh!

devoxi Oct 4, 2023 Author

Uh oh!

Uh oh!

tustvold Oct 4, 2023 Collaborator

Uh oh!

Uh oh!

devoxi Oct 4, 2023 Author

Uh oh!

devoxi Oct 4, 2023 Author

Uh oh!

Ted-Jiang Oct 7, 2023 Collaborator

Uh oh!

tustvold Oct 7, 2023 Collaborator

Uh oh!

Uh oh!

alamb Oct 4, 2023 Collaborator

devoxi
Oct 4, 2023

Replies: 2 comments 9 replies

tustvold
Oct 4, 2023
Collaborator

devoxi Oct 4, 2023
Author

tustvold Oct 4, 2023
Collaborator

devoxi Oct 4, 2023
Author

tustvold Oct 4, 2023
Collaborator

devoxi Oct 4, 2023
Author

devoxi Oct 4, 2023
Author

Ted-Jiang Oct 7, 2023
Collaborator

tustvold Oct 7, 2023
Collaborator

alamb
Oct 4, 2023
Collaborator