Skip to content

Conversation

@yongchul
Copy link
Contributor

feat: support with ties in FetchRel

WITH TIES in SQL standard and FetchRel is missing the support.

Industry Adoption

  • Tranasct SQL variants implement WITH TIES in non-standard TOP.
  • Oracle, PostgreSQL, MariaDB, implement WITH TIES in ANSI SQL OFFSET ... FETCH ... clause.

All implementations require ORDER BY clause as it does not make sense to evaluate ties without specific order.

jcamachor
jcamachor previously approved these changes Mar 21, 2025
Copy link
Contributor

@jcamachor jcamachor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once there's consensus on the approach among reviewers, please also include the documentation update in site/docs/relations/logical_relations.md as part of this PR.

@EpsilonPrime
Copy link
Member

I suspect this will be discussed in the Substrait Community meeting this week. There has been some discussion around adding first class support for ordering. If we had ordering ties would merely be a boolean option in FetchRel. Concepts for first class ordering include required ordering (such as after a sort) and optional ordering (perhaps set via a hint in a ReadRel or after some other operation).

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your point about

we could simply have a boolean field with_ties then requires immediate SortRel below FetchRel. However, such conditional inter-rel dependency is confusing and the consumers requiring extra validation or inference to deduce sort order as the operation is not self-contained.

I think it makes sense to include the ties in the FetchRel as you did to avoid inter-rel dependency.

I don't have a strong opinion on this but now that I enumerated the names, with_ties_sorts seems like a good one to me.

I think your idea for with_ties_sorts, along with an explanatory comments, should suffice. Adding a wrapping message feels a little redundant.

@yongchul
Copy link
Contributor Author

Hi @vbarua,

Based on your point about

we could simply have a boolean field with_ties then requires immediate SortRel below FetchRel. However, such conditional inter-rel dependency is confusing and the consumers requiring extra validation or inference to deduce sort order as the operation is not self-contained.

I think it makes sense to include the ties in the FetchRel as you did to avoid inter-rel dependency.

Glad to hear that you agree! :)

I don't have a strong opinion on this but now that I enumerated the names, with_ties_sorts seems like a good one to me.

I think your idea for with_ties_sorts, along with an explanatory comments, should suffice. Adding a wrapping message feels a little redundant.

Thank you! While I was revising the PR including the documentation, I realized that it is better to keep the desired sort separate from with ties (i.e., sorts and with_ties separate). If we do so, we can solely capture the semantic of (ORDER BY)? OFFSET? FETCH (WITH TIES)? in a single logical operator.

Additional benefit is that we can drop typically with desired orderedness from input.

The change is still completely backward compatible -- everything still holds if nothing is set but producers have freedom to express the intent without explicit sort.

PTAL!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammatical suggestions but this seems good otherwise.

What should happen if there are no sort fields and with_ties is true?

  • The plan returns an error
  • The plan runs as if with_ties is false

I see that you state "At least one filed MUST be specified when with_ties is true" so I think the answer is "The plan returns an error" but I want to confirm.

If that is true, can we add something like "If with_ties is true then there must be at least one sort field or else the plan is invalid" to the .md file?

Copy link
Contributor Author

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments.

@vbarua vbarua requested review from jcamachor, vbarua and westonpace May 21, 2025 17:57
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

It makes sense, in my opinion, to include the sort fields on the FetchRel because the WITH TIES concept is heavily interlinked with the sort. This does potentially make the SortRel redundant, because every SortRel can be expressed as a FetchRel, but I think that's okay.

An alternative approach to this would be to encode this as some combination of

FetchRel[...]
   SortRel[...]

but that depends on systems understanding physical properties, specifically orderedness, and retaining/propagating them correctly to avoid losing the sort before the fetch during translation/optimization/execution. I don't think we're at a point were we can rely on this, and having the sort on the FetchRel makes it 100% unambiguous what needs to happen.

This does introduce a little bit of redundancy, because in theory all SortRels can be expressed as FetchRels now if we wanted, but I think that's okay.

//
// Note: the output records are in the order of `sorts` if at least one sort field is specified.
// Otherwise, the input orderedness is preserved.
repeated SortField sorts = 7;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During sync up, @jacques-n asked what is the minimal thing need to be done to support the scenario. The minimal is to specify list of column comparisons the same way as the backing Sort. In the Fetch requirement is weaker than Sort because we do not care the order across columns as well as directions but we do need to align how we calculate the equality of a column.

This (comparison, equality) happen two places in Substrait today: SortField and ComparisonJoinKey but they are separate.

Perhaps, we can do is

message EqualField {
  Expression.FieldReference field = 1;
  ComparisonJoinKey.ComparisonType comparison = 2;
}

message FetchRel {
  ..
  repeated EqualField tie_breakers = 7;
  bool with_ties = 8; // optional. maybe implicit based on non-empty `tie_breakers` field.
  ..
}

Then to implement with_ties, the producer must put the appropriate Sort below and setting up the tie breaking fields consistent to the Sort. I see a value in this approach (keeping a naive implementation of fetch simple) but not quite sure whether this is simpler than the proposed change, especially in the sense that a producer can concisely expression what needs to be done, and consumer has an option to implement or execute how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to this pattern versus using SortFields since ordering doesn't matter in this relation.

@benbellick benbellick self-requested a review November 22, 2025 15:30
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't heard of this with ties syntax before, so forgive me if I am misunderstanding.

But, I wonder if we can keep the FetchRel and SortRel decoupled.

What if FetchRel specified what we call tie_fields (instead of sorts) to determine what rows are considered "tied". There is no need to have a with_ties field, we just assume that a non-empty tie_fields field implies that you do want the with ties beahvior.

Then if tie_fields has at least one field in it, after fetching count rows, continue fetching additional rows where tie_fields match the value at position count.

The key point: tie_fields does not imply sorting - it just defines tie equality based on whatever order the input arrives in.

Example

Input table (unsorted):

row name score
1 Alice 95
2 Bob 80
3 Carol 95
4 Dave 70
5 Eve 80
6 Frank 60

Read -> Fetch(count=3, tie_fields=[score]):

row name score
1 Alice 95
2 Bob 80
3 Carol 95

After fetching 3 rows, row 3 has score=95. Row 4 has score=70 (differs from row 3), so stop.

Read -> Sort(score DESC) -> Fetch(count=3, tie_fields=[score]):

row name score
1 Alice 95
2 Carol 95
3 Bob 80
4 Eve 80

After fetching 3 rows, row 3 has score=80. Row 4 also has score=80, so continue. Row 5 has score=70, so stop.


So then, if we take this approach, both of these plans make sense:

FetchRel(count=3, tie_fields=[score])
  └─ SortRel(fields=[score DESC])
      └─ ReadRel

and

FetchRel(count=3, tie_fields=[score])
  └─ ReadRel

The first plan (with SortRel) gives you SQL ORDER BY ... FETCH FIRST ... WITH TIES semantics. The second plan is valid but probably less useful - it just returns the first N rows plus any that happen to tie with row N based on read order.

Documentation should clarify that tie_fields typically matches the SortRel fields when composed together.

This approach keeps FetchRel and SortRel fully decoupled, making the operators more composable and easier to reason about independently.

What do you think? I can see a case for saying that this may be more confusing, but I don't love having to enforce that FetchRel with ties must come coupled with a matching SortRel in its input.

@yongchul
Copy link
Contributor Author

Thanks @benbellick for thoughts! Comments are inlined.

I hadn't heard of this with ties syntax before, so forgive me if I am misunderstanding.

But, I wonder if we can keep the FetchRel and SortRel decoupled.

What if FetchRel specified what we call tie_fields (instead of sorts) to determine what rows are considered "tied". There is no need to have a with_ties field, we just assume that a non-empty tie_fields field implies that you do want the with ties beahvior.

with_ties field is actually a minor issue. Yes, we can remove it and use the other as an indicator. That's fine.

More fundamental issue is that you need to define which values are equivalent in which columns. This is NOT just list of columns but list of how you compare equalities for each column. You have to compare and tell equivalences of values for ALL types that can be sorted including NULLs (e.g., string with collations, timestamp with timezone with/without collations. NULLs are considered the same). If you decouple, you will have to introduce VERY tight coupling with the sort operation. I don't know whether I can call this two tightly coupled pair as decoupled.

So to me, if we want truly decoupled fetchrel, we will have to define equality as well as comparisons. Then we can go redefine joins, aggregates, and sorts. This fetchrel will list the equality comparisons of columns that matters. Still, almost identical information is still there right below in the sort rel (beyond the fact that we are actually talking about comparing the values of a single column, unlike joins, but sort). But then, this is getting too close to sort order, which supposed to deal with all of this already.

Having said that, perhaps the right place to add these thing is actually SortRel but then I can hearing "No but Sort should be simple and composable and we have FetchRel to retrieve first rows!" argument. 😆

I know systems from both camps: fetch with sort or sort with fetch, and we are certainly repeating that history in this project. One thing that I know is though, this is a single logical operation, and I don't really think they should be separate for the sake of pretending decomposability.

I am emphasizing logical because I believe that exchanging logical operators is what Substrait brings the most of its value. Decomposability is fine but I prefer to capture the high level semantic more concisely and clearly when they can't be cleanly decomposable like this case.

Note that I have a very strong bias. I can be convinced though with a beautiful solution from fresh angle!

@benbellick
Copy link
Member

@yongchul thanks for the response! This is definitely helping me to understand the problem more.

However, I wrote a bit unclearly. Before when I mentioned tie_fields, I did really mean for those to be a list of SortField messages, rather than just a generic reference to fields. Sorry about that! I was just trying to keep the "examples" simple.

So the comparison semantics would be available.

That being said, I do see the case that this really should just be one operation. My above comment was partially in response to @vbarua's suggestion of having a restriction of using SortRel as input to FetchRel.

I think that your solution is great and the redundancy is difficult to avoid.


One more point to raise: the documentation contains a reference to a Top-N physical relation that says the following (emphasis mine):

The top-N operator reorders a dataset based on one or more identified sort fields as well as a sorting function. Rather than sort the entire dataset, the top-N will only maintain the total number of records required to ensure a limited output. A top-n is a combination of a logical sort and logical fetch operations.

This sounds a lot like what is trying to be achieved here. However, this documented relation doesn't exist in the protos 😅.

Should we:
a. Remove it from the documentation (since FetchRel with sorts now serves this purpose)
b. Introduce a TopNRel message which does what you are accomplishing here and leave FetchRel untouched

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants