Skip to content

Teach UnionExec to require its inputs sorted #9898

@NGA-TRAN

Description

@NGA-TRAN

Is your feature request related to a problem or challenge?

In InfluxDB IOx, we want to improve this query plan:

Sort (col1, col2, col3, ... coln)
   Union
       Projection (col1, col2,  col3, ... coln)
       Projection (col1, col2,  col3, ... coln)
       Projection (col1, col2,  col3, ... coln)

Since col1 is the same for each projection, the sort can be pushed down below the Union and the new plan will be like this:

ProgressiveEval                            <-- new operator (available in InfluxDB IOx that output data in their order)
   Union
         Sort(col2, col3, ... coln)     <-- sort is pushed down but only from col2 to coln
              Projection (col1, col2,  col3, ... coln)
        Sort(col2, col3, ... coln)
             Projection (col1, col2,  col3, ... coln)
        .....
             .....
        Sort(col2, col3, ... coln)
             Projection (col1, col2,  col3, ... coln)

There are now many sorts but each only work on a subset of data in parallel. Also, the ProgressiveEval ensure the sort streams is ordered by col1 which is very cheap.

The above plan would work for us, however, we hit an issue in DataFusion that the Sorts under Union are always removed from the plan at the the enforce_sorting step https://github.com/apache/arrow-datafusion/blob/09f5a544d25f36ff1d65cc377123aee9b0e8f538/datafusion/core/src/physical_optimizer/enforce_sorting.rs#L361.

After some investigation, we found the reason the sorts under union are always removed because the Union does not implement required_input_ordering. It uses the default https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/physical-plan/src/lib.rs#L155 which is always an array of None

Describe the solution you'd like

Implement required_input_ordering in UnionExec to have it ask its inputs to keep their sort order if the Union has output_ordering

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions