Skip to content

Conversation

@wackywendell
Copy link
Contributor

@wackywendell wackywendell commented Jul 25, 2025

The Advanced Extensions section of the docs seemed somewhat mismatched from what was in the protobufs, so I tried to clarify it, based on what was in the protobuf definitions.

Thoughts welcome! Also could use some sanity-checking on the details!

Naming

I also included custom relations (ExtensionLeafRel, …) and custom reads and writes (ExtensionTable, …) all under the heading "Advanced Extensions", even though there is an AdvancedExtension message that doesn't cover those. The name "Advanced Extension" seems a bit ambiguous here - does it cover all of the above, or only the enhancements and optimizations in the AdvancedExtension message? - but it seems to be what we have, so I went with it. Thoughts?

Guidance

This is a bit low on guidance on when to use which - e.g. ExtensionLeafRel, ExtensionTable, or ReadRel.advanced_extension.enhancement could potentially all be used for an unusual kind of read. I'm not sure what the guidance here should be, or if there should be any, so I left it out; I'm not sure any guidance here would be that helpful.

@wackywendell wackywendell marked this pull request as ready for review July 25, 2025 20:46

Advanced extensions come in several main forms, discussed below:

1. Embedded extensions: These use the `AdvancedExtension` message for adding custom data to existing Substrait elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure or don't remember whether element is commonly used across the doc. Now that I check again, advanced extensions are generously sprinkled over the proto file... :) That said, it would be clearer to mention the scope of the embedded extension like adding custom data to enclosing Substrait message (or element).

The above mentioned protobuf and call out Any. So I'm wondering whether we can stick to Substraite message rather than using element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, message makes more sense than element for consistency!

| **`RelCommon`** | Extensions for any relational operator |
| **Relations** (e.g. `ProjectRel`) | Extensions for a specific relation type |
| **Hints** | Extensions within optimization hints |
| **`ReadRel.NamedTable`** | Add custom metadata to named table references |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove Add to be consistent with the rest of Usages? (i.e., either all starts with verbs or nouns)

| **`RelCommon`** | Extensions for any relational operator |
| **Relations** (e.g. `ProjectRel`) | Extensions for a specific relation type |
| **Hints** | Extensions within optimization hints |
| **`ReadRel.NamedTable`** | Add custom metadata to named table references |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like LocalFiles also has advanced extension.

| **Hints** | Extensions within optimization hints |
| **`ReadRel.NamedTable`** | Add custom metadata to named table references |
| **`WriteRel.NamedObjectWrite`** | Add custom metadata to write targets |
| **`DdlRel.NamedObjectWrite`** | Add custom metadata to DDL targets |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticed some of the advanced extension tags are not consistent in these Ddl related messages. Just note to self.

@wackywendell
Copy link
Contributor Author

@yongchul - thanks for the review! Good comments - I've updated to address them. Can you give them a look?

Thanks!

Copy link
Contributor

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking suggestions! Looks good to me :)

yongchul
yongchul previously approved these changes Jul 30, 2025
@wackywendell
Copy link
Contributor Author

Thanks again! Incorporated. I also realized I had the !!! note sections incorrect (they needed indenting), so I fixed that, and ran mkdocs locally to verify it worked - and it did!

@wackywendell
Copy link
Contributor Author

^ Thanks, @yongchul for the merge! I'm wondering - what do we need to do next to advance this? Do we need additional reviewers so we can get it approved and merged?

@yongchul
Copy link
Contributor

^ Thanks, @yongchul for the merge! I'm wondering - what do we need to do next to advance this? Do we need additional reviewers so we can get it approved and merged?

Thank you for your patience! We will need another SMC review I believe. I'll bring this up in the sync up.

- Update proto syntax to use new %%% format
- Reorder sections to prioritize custom read/write types over custom relations
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments but this is a welcome change!

!!! note "Enhancements vs Optimizations"

* Use **optimizations** for performance hints that don't change semantics and can be safely ignored.
* Use **enhancements** for semantic changes that must be understood by consumers or the plan cannot be executed correctly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps we could move this section to after the two sections below? It feels like it makes sense to compare them only after explaining what each thing is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - I see your point, although there's already a note there about Enhancement Constraints. Should we put the two notes one after the other, or leave this where it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. 🤷 lets just leave it as it is for now then.

| Inputs | 2 |
| Outputs | 1 |
| Property Maintenance | Distribution is maintained. Orderedness is eliminated. |
| Input Order | Same as the [Join](logical_relations.md#join-operation) operator. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing all of these changes here 🚀

@benbellick
Copy link
Member

@yongchul @wackywendell As I understand it, the governance rules dictate that only one SMC approval is necessary to get documentation changes in.

Since @yongchul is an SMC, do we have to wait for any other approvals? Or can we just merge it?

@yongchul
Copy link
Contributor

@yongchul @wackywendell As I understand it, the governance rules dictate that only one SMC approval is necessary to get documentation changes in.

Since @yongchul is an SMC, do we have to wait for any other approvals? Or can we just merge it?

I'm going to merge after re-reading the changes one last time. :smile


Advanced extensions come in several main forms, discussed below:

1. Embedded extensions: These use the `AdvancedExtension` message for adding custom data to existing Substrait messages
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this PR but it would be great if we can just easily reference this to the actual proto message definition. perhaps, onetime clean up is necessary.

- Modify the semantic behavior of operations
- Must be understood by consumers or the plan cannot be executed correctly
- Only one enhancement per message
- Examples: specialized join conditions (e.g. fuzzy matching, geospatial) or sorting (e.g. clustering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clustering is unclear. can we have more detailed example or expand clustering a bit? Fetch with tie is a good example but I have a pending PR #797 for this 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was thinking something like similarity matching, but that's a bit of a complicated example to explain in a few words, so I switched for locale-aware string ordering; does that make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locale aware sort is possible already as you can specify a custom comparison operation per field. Shall we just drop it for now and add an example later once we have a good one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right - I hadn't seen the custom comparison operator before.

I've dropped the sorting example.

@yongchul
Copy link
Contributor

@wackywendell thank you for the changes and sorry about taking it so long! Please resolve the two minor changes and it looks good to me to go!

@wackywendell
Copy link
Contributor Author

Thanks, @yongchul! I updated, let me know how it looks to you - especially the sorting example, clustering was unclear as you noted!

@jacques-n
Copy link
Contributor

The one issue I see with the extensions updates here is that they act like the only way to express something is using protobuf. Our goal was to spec first, with proto being an implementation. That being said, the reality is a bit different than that.

Clarify in the advanced extensions description that Protobuf is *an* implementation of Substrait. It's the only one for now, but not the only possible one.
@wackywendell
Copy link
Contributor Author

@jacques-n - good point. I was looking at it, and while I think most of the protobuf references are hard to get away from - e.g. field names, hierarchy - the top-level description did clearly treat the Protobuf representation as the only one, so I changed that; see 5fce283.

For the other places where it references Protobuf messages directly, that seems to be in line with the rest of the docs, and also seems like a good place to show implementations; and in theory, they could also be updated to multiple 'dropdowns' showing the representations in various forms if other implementations existed.

Does that make sense to you, or were there more specific ways we could address this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants