Skip to content

Conversation

@yongchul
Copy link
Contributor

Motivation

AggreateRel is a very complex operation and implementations may behave differently in corner cases. We have identified one such corner case in at least two popular database systems: SQL server and Oracle.

SELECT COUNT(*) FROM T
SELECT COUNT(*) FROM T GROUP BY GROUPING SET (())

If T is empty, both queries should yield the same result: 1 row, 0. However, the two systems yield no rows on an empty table T.

Proposal

Since there can be other unidentified behavioral differences, we propose to capture these in a Compatbility message in AggregateRel and capture such behaviors in the message rather than adding a new field to the AggregateRel. The default behavior does not change.

The scheme can be extended to other operators and can be included in the dialect, whether certain behavior is supported or not.

We may consider promote this compatibility to plan-level message but it will be a potpourri of all rels in Substrait (i.e., submessage allocated per rel to prevent different behaviors all over the places).

@jacques-n
Copy link
Contributor

Makes sense to me. It would be good to add to the spec docs too and also clarify a couple examples of systems that support each behavior.

@vbarua
Copy link
Member

vbarua commented Nov 20, 2025

I think it makes sense to add this kind of toggle to AggregateRel to capture this behaviour, but I'm not sure I fully understand the scheme you're proposing.

we propose to capture these in a Compatbility message in AggregateRel and capture such behaviors in the message rather than adding a new field to the AggregateRel.

Technically you've added a field

Compatibility compatibility = 6;

to the AggregateRel (which is reasonable)

We may consider promote this compatibility to plan-level message

I don't think this would ever make sense, because as you point out it would be a potpurri of toggles and most toggles would only apply to specific rels.

Like Jacques said, it would be helpful to have this behaviour documented and explained in https://substrait.io/relations/logical_relations/#aggregate-operation

// when specified with non-empty groupings field even when groupings includes
// empty grouping sets.
bool groupings_yield_no_rows_on_empty_input = 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm off two minds on declaring settings like this. On one hand, having single message with a bunch of boolean behavioral toggles makes it easy to add new toggles, because we can just add a new field. We would need to make sure that the default unset value matched the default behaviour when we do this. Generally though, I'm wary of boolean toggles because IMO they can be hard to understand, and are limited to switching between 2 different behaviors.

I personally lean towards the enum style of setting toggles because we can indicate the expected behavior with the name, we can declare more than 2 types of behaviors, we can add behaviors easily if we discover more weird system behaviour and we can explicitly declare the unset values as unspecified.

message EmptyInputMode {
      OUTPUT_MODE_UNSPECIFIED = 0;
      OUTPUT_MODE_YIELD_EMPTY_ROW = 1;
      OUTPUT_MODE_YIELD_NO_ROW = 1;
}

Copy link
Member

@vbarua vbarua Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add these kinds of compatibility toggles, we should also document them in the website. It would be good to include the systems this is useful for, as well as example queries to trigger the behaviour in the docs for context as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to add the documentation -- the weeks have been crazy, can't find time... perhaps later this week or next week, I'll update the PR with the documentation.

@yongchul
Copy link
Contributor Author

yongchul commented Nov 20, 2025

We may consider promote this compatibility to plan-level message
I don't think this would ever make sense, because as you point out it would be a potpurri of toggles and most toggles would only apply to specific rels.

On flip side, if you are talking to one system, the behavior is likely fixed -- i.e., it's not applied to a single (which is very flexible) but all of the operators because they share implementation -- and just applied as blanket. Then, why not just spell out the behaviors once and assume that to the rest of the plan? :smile It is just a thought and probably better to start with this per rel toggles and promote to global if we see things really cropping up in the wild.

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative idea. What if we instead model this behavior via a table function which takes in input-relations? We could take a hardline position and say that the AggregateRel in this case always returns a single row with result 0. Then we could model this behavior something like the following:

  TableFunctionRel {
    name: "make_empty_if_single_all_zero_row"
    input: AggregateRel {
      input: T (empty table)
      groupings: [()]
    }
  }

This would keep AggregateRel semantics clean and keep us oriented towards composable primitives, while making the non-standard SQL Server/Oracle behavior explicit in the plan structure rather than hidden behind a flag.

I know table functions don't exist yet (this would require introducing TableFunctionRel as a new relation type in Substrait), but this could be one way to mitigate the problem of a growing grab-bag of config settings.

@yongchul
Copy link
Contributor Author

One alternative idea. What if we instead model this behavior via a table function which takes in input-relations? We could take a hardline position and say that the AggregateRel in this case always returns a single row with result 0. Then we could model this behavior something like the following:

It is an interesting approach but I don't like it because it is way more work for both producers and consumers to support these things correctly. Note that you are not just wrapping the subtree rooted at aggregate rel but TableFunction(AggregateRel) - Input. Also, this approach just happens to work in this case related to input, but it won't work if the compatibility fundamentally affecting the core behavior of aggregate rel itself (let's see how many of those cases we have down the road).

If you want to go with truly composable scenario, we probably better to have a standard wrappers or decorators to correct the behaviors (around superficial input and output) but how many of them will be there? I don't know... How many such behavioral differences to capture as flags? I don't know either but at least it is more practical in terms of implementation unless we introduce yet another "standard" wrapper framework.

We could go with this as an extension without introducing another behavior flags. Interested systems could just use these extensions (somehow discover), not bothering Substrait core. I have discussed this in the past meeting before sending out this PR. 😄

If the community is not agreeing to have this, I'm fine with going with the extension model. In the end, for this one, it is just a flag. We may discuss some standard decorator of rels...

@benbellick
Copy link
Member

@yongchul thanks for the response! I see the point that you are making, especially about the case of affecting the semantics of an operation in a more profound way.

I'm coming at this with a bias toward writing Substrait plans that are generically executable rather than conforming to a specific execution model, but I recognize that might not always be practical.

Given that, the compatibility flag approach does seem like the easiest path forward here.

If we do take this approach, is it possible then to include clear documentation of which systems exhibit which behaviors (as @jacques-n mentioned), along with explicit guidance on when to use compatibility flags vs extensions? (in the site documentation)

@vbarua
Copy link
Member

vbarua commented Dec 2, 2025

I missed your update between all of the other updates going on 😅

I think I understand what you're trying to accomplish with the per relation Compatibility message. It effectively becomes a bag in which we can capture all the toggles we might need for engine Compatibility, and it avoids polluting the relation itself with optional toggle fields.

On flip side, if you are talking to one system, the behavior is likely fixed

I can imagine having plan-level overrides for this in the future, but I think keeping them per relation makes the most sense for now. I'm broadly in favour of a per relation Compatibility message using enum toggles for all the settings. For each setting, we should document a clear default for when it is unset.

Table Function Discussion

I'm also in favour of re-using existing primitives and trying for composability when possible. I could see something like @benbellick's proposed make_empty_if_single_all_zero_row working for this, but like YongChul I'm not sure if this would be applicable for every type of compatibility issue, and I could see it getting a little onerous if we needed to set lots of different toggles for a single relation. Think

toggle_fn3(
    toggle_fn2(
      toggle_fn1(
        AggregateRel 
)))

In cases where the compatibility toggles can expressed as table functions, I can imagine mentioning it as such in the documentation. Then a consumer can handle the toggle by taking the aggregate and wrapping the function around it, and it's effectively sugar on the relation.

When we go to release 1.0, if we realized that all the toggles can be handle like that.

To Ben's point, I do think that guidance around these kinds of toggles, and when to add them, would be useful. While they let us capture behavioral deviations in the ecosystem, they also fragment it. Though, if we can identify the various different behaviour of core relations, we can also start thinking about compatibility shims for them all.

Cursed Golf Question

Does something like this

SELECT SUM(<col>) FROM T

also return 0 for those engines if T is empty?

Then the make_empty_if_single_all_zero_row trick wouldn't work because we couldn't distinguish between an aggregate of an empty table returning 0, and an aggregate where all the values where 0.

@yongchul
Copy link
Contributor Author

yongchul commented Dec 2, 2025

Thanks @vbarua! Comments are inlined... (I personally dislike github comment system especially not having thread if the comment is at PR)

I missed your update between all of the other updates going on 😅

I think I understand what you're trying to accomplish with the per relation Compatibility message. It effectively becomes a bag in which we can capture all the toggles we might need for engine Compatibility, and it avoids polluting the relation itself with optional toggle fields.

On flip side, if you are talking to one system, the behavior is likely fixed

I can imagine having plan-level overrides for this in the future, but I think keeping them per relation makes the most sense for now. I'm broadly in favour of a per relation Compatibility message using enum toggles for all the settings. For each setting, we should document a clear default for when it is unset.

We are in agreement at least compatibility message per relation when it needed for now. I will add documentations for the toggles.

Table Function Discussion

I'm also in favour of re-using existing primitives and trying for composability when possible. I could see something like @benbellick's proposed make_empty_if_single_all_zero_row working for this, but like YongChul I'm not sure if this would be applicable for every type of compatibility issue, and I could see it getting a little onerous if we needed to set lots of different toggles for a single relation.

I'm not against the composability but what you will need for this (and probably other such modifiers) need to separate the core operator and its input. That said, it shoold be

ToggleFunc[AggregateRel] - Input

but not

ToggleFunc - AggregateRel - Input

You see ToggleFunc effectively wrapping the AggregateRel operator and inspect both input and output? Or, the ToggleFunc should take Func<Enumerable<Row> -> Enumerable<Row>>, Func<()->Enumerable<Row> with all schema derivations. I'm not sure whether we want this savvy type inference. Also, we need to define a type hierarchy (or traits) of relops to incorporate the existing relops to capture it as a high-order function IMO.

Or, at least, we would want to define a special modifier table function that takes a relop (no input), AND relops (inputs) to clearly capture above scenario. This is just to workaround the elaborated high order function type system. I can see that special modifier table function is properly composable but not just any table function.

For more simple composability, Assert operator is a very good example.

AssertOneRow(Rel)

which raise runtime error when it sees more than 1 row (useful to guard from decorelated subqueries). More generic version would be

Assert(rel, expression, "message")

where raise runtime error when expression is false if any of the rows from the rel. Note that unlike the make empty if single all zero row, assert does not look into the gut of the rels. It just sees the output and modifies the behavior.

BTW, I may propose the assert rel soon... 😄

Think

toggle_fn3(
    toggle_fn2(
      toggle_fn1(
        AggregateRel 
)))

In cases where the compatibility toggles can expressed as table functions, I can imagine mentioning it as such in the documentation. Then a consumer can handle the toggle by taking the aggregate and wrapping the function around it, and it's effectively sugar on the relation.

Good case but I'm not sure whether we need this... yet. At least, we can capture it in terms of the modifier table functions I put ahead. Perhaps, we can even flatten the hierarchy in that special modifier table functions.

To Ben's point, I do think that guidance around these kinds of toggles, and when to add them, would be useful. While they let us capture behavioral deviations in the ecosystem, they also fragment it. Though, if we can identify the various different behaviour of core relations, we can also start thinking about compatibility shims for them all.

I agree. I hope things not going too crazy and I'm reasonably optimistic that we don't grow beyond 10. Also, we should capture this in the dialect so that when systems talk they should know whether it is supported or not.

Cursed Golf Question

Does something like this

SELECT SUM(<col>) FROM T

also return 0 for those engines if T is empty?

For this, SUM is NULL when T is empty, that's SQL.

Then the make_empty_if_single_all_zero_row trick wouldn't work because we couldn't distinguish between an aggregate of an empty table returning 0, and an aggregate where all the values where 0.

The intent for this compatibility is only applicable when an empty grouping set is defined.

SELECT SUM(<col>) FROM T GROUP BY ()

is equivalent to your example but in Substrait, it can be represented as empty grouping set. If the flag or compat mode is true, then this will yield 0 row. Otherwise, it will return 1 row with NULL.

@vbarua
Copy link
Member

vbarua commented Dec 2, 2025

For future reference I found an online SQL SERVER widget https://onecompiler.com/sqlserver/446em8jwe that helped me verify that

SELECT SUM(<col>) FROM T GROUP BY GROUPING SETS (())

returns 0 rows when T is empty as you said, which also make me more inclined to avoid wrapper approach.

@jacques-n
Copy link
Contributor

The table function composable pattern seems too reductionist to me. As a bit of a history note, a couple of other people were exploring ideas around the same time I started working on Substrait. They were modeling scalar and set operations as simply functions (just different kinds of inputs/outputs). I went with a more structured approach of having those be two different concepts because the cognitive overhead was lower when you're actually working with plans. I think the same is true here. Over-decomposition makes it harder for everyone to work on things.

For the specific of boolean versus enum... I'm generally a pattern of what @vbarua said in situations where there are likely more than two imaginable variants of a behavior. In this particular circumstance, I'm not sure that exists but I'm not sure it doesn't. I'm find with enum OR bool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants