Skip to content

Conversation

@benbellick
Copy link
Member

@benbellick benbellick commented Nov 3, 2025

As discovered in this discussion with @mbrobbel, there is a need to clarify a URN ambiguity. Current urn implementation across java, python, and go assumes that there are exactly two colons, i.e. they all are using the regex ^extension:[^:]+:[^:]+$.

Instead, we clarify that urns as defined here are exactly as in rfc 8141 but with the urn: prefix cut off.

We update the documentation to enforce regex ^extension:[^:]+:[^:]+$.

@benbellick benbellick force-pushed the ben.bellick/clarify-urn-structure branch from 60c70d7 to 2b2591c Compare November 3, 2025 16:30
@benbellick benbellick requested a review from mbrobbel November 3, 2025 16:30
@benbellick benbellick marked this pull request as ready for review November 3, 2025 17:50
Copy link
Contributor

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just heads up. I'm not a native speaker so take lots of grain of salts with my nit comments about sentence and grammar. :)

@benbellick benbellick changed the title docs: clarify valid URNs ':' usage docs: clarify valid URNs Nov 3, 2025
@benbellick benbellick requested a review from yongchul November 3, 2025 21:14
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion. The meaning/parsing of Extension URNs changes slightly if we prefix urn: in front of them, versus replacing extension: with urn:. Let me know what you think.

* Table Functions

To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity.
To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor adjustment for clarity.

Suggested change
To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix.
To extend these items, users can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. These identifiers are URN-like but not technically URNs (they are prefixed with `extension:` instead of `urn:`), and will be referred to as `extension URNs` for clarity.
Extension URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) URNs when replacing `extension:` with `urn:`.

Simple extensions within a plan are split into three components: an extension URN, an extension declaration and a number of references.

* **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations.
* **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations. The URN with the `urn:` prefix added must conform to [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URN with the urn: prefix added must conform to RFC 8141.

It's a bit weird to say this. The way we've structured them now maps to

         | <NID> | <NSS>
extension:<owner>:<id>

I guess they do conform to the RFC if we prefix urn, but the interpretation would be different technically:

   | <NID>   |   <NSS>
urn:extension:<owner>:<id>

What do you think about:

The Extension URN with the extension: replaced with urn: must conform to RFC 8141

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I see the point you are making. However, we don't actually enforce that the structure of the <owner> part is reverse DNS. Why don't we just loosen the restriction on the urn entirely and say:

The urn is required to be a valid URN when urn: is prepended to the string. The format must conform to urn:extension:<Identifier>. The recommended format for the identifier is <Reverse-DNS-Name>:<any-valid-name>. This is consistent with the default substrait extensions and prevents name collisions.

To me, this feels more consistent with the urn spec. Maybe its just how my brain works, but saying "urn: added to the front makes it a valid URN" makes more sense to me than saying "urn: replacing extension: makes it a valid URN".

@benbellick
Copy link
Member Author

benbellick commented Nov 18, 2025

After a discussion with @vbarua, it makes sense to just enforce that the "URN" is something compliant with the regex ^extension:[^:]+:[^:]+$. We can say it is URN-like, but we may as well be overly restrictive.

This regex is exactly what is implemented in the substrait libs for java, go, and python.

@jacques-n I have altered the regex to be ^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$.

playground

@benbellick benbellick force-pushed the ben.bellick/clarify-urn-structure branch from d0792a9 to a9ed3f5 Compare November 18, 2025 19:52
The required regex is:

^extension:[^:]+:[^:]+$
@benbellick benbellick force-pushed the ben.bellick/clarify-urn-structure branch from a9ed3f5 to 8247607 Compare November 18, 2025 19:54
- `OWNER` represents the organization or entity providing the extension and should follow [reverse domain name convention](https://en.wikipedia.org/wiki/Reverse_domain_name_notation) (e.g., `io.substrait`, `com.example`, `org.apache.arrow`) to prevent name collisions
- `ID` is the specific identifier for the extension (e.g., `functions_arithmetic`, `custom_types`)

These URNs must match the regex `^extension:[a-zA-Z0-9_.-]+:[a-zA-Z0-9_.-]+$`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see it again, why do we allow upper case if we were to allow only narrow set of characters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recommend an even more restrictive urn? How about:
^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$

i.e same thing without capital letters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to be without capital letters, let me know what you think!

`^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$`
@benbellick benbellick requested a review from yongchul November 21, 2025 20:36
Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again I'm wondering if it wouldn't be easier to just use URNs as defined in RFC 8141?

@benbellick
Copy link
Member Author

Looking at this again I'm wondering if it wouldn't be easier to just use URNs as defined in RFC 8141?

@mbrobbel The problem is that defining in terms of RFC 8141 is inherently clunky because of the decision to not include urn: in the string. This leaves us with three choices:

  1. clarify that an extension urn called extension:<rest> is only valid if urn:extension:<rest> is RFC 8141 vaild,
  2. clarify that an extension urn called extension:<rest> is only valid if urn:<rest> is RFC 8141 valid, or
  3. give our own format which we describe as URN-like and then formally give a regex.

The problem with the first approach above is then extension becomes the NID, and so we have to put restrictions on what the NSS can be anyways.

The second approach is technically fine, but a bit clunky IMO.

The third approach seems simpler all in all as you can check the string against a regex directly. It also gives us more flexibility in the future to add things like versioning to the end when we are ready.

Also, the inspiration for this approach to using URN-like things came from java's maven, which is also not a general purpose URN.

I am open to relying on the RFC 8141 spec, but it doesn't seem to me that that is necessarily the simplest solution. In hindsight I wish that we had included urn: at the beginning 😅. Another possible solution is to migrate to using urn: at the start and taking option 1 (with an extra regex for the <NSS>) but that means an extra migration. In which case, we might as well go with 3 for now and tackle 1 later soas not to have two urn-related migrations happening at once.

@nielspardon
Copy link
Member

I would also prefer to stay closer to the RFC.

The problem with the first approach above is then extension becomes the NID, and so we have to put restrictions on what the NSS can be anyways.

Wikipedia says that the NID should be registered with IANA according to the RFC which probably makes extension not a good choice for a unique namespace identifier and something like substrait would be a better choice so you could have something like:

urn:substrait:extension:<rest>

@benbellick
Copy link
Member Author

benbellick commented Dec 1, 2025

I would also prefer to stay closer to the RFC.

The problem with the first approach above is then extension becomes the NID, and so we have to put restrictions on what the NSS can be anyways.

Wikipedia says that the NID should be registered with IANA according to the RFC which probably makes extension not a good choice for a unique namespace identifier and something like substrait would be a better choice so you could have something like:

urn:substrait:extension:<rest>

@nielspardon I do think that that is a good approach. What if we then said that valid URNs are urn:substrait:extension:<rest> where the urn:substrait portion is optional? That way we don't have to do a migration yet, but we could later transition to the fully explicit URN. I would prefer not to do any migration at the moment, considering we are in the middle of the uri -> urn migration.

@nielspardon
Copy link
Member

That way we don't have to do a migration yet, but we could later transition to the fully explicit URN. I would prefer not to do any migration at the moment, considering we are in the middle of the uri -> urn migration.

sure, we can do the change as a 1.0 item. Just saying if we want to give this another go we probably should consider that NID should be something we could register with IANA if we wanted to.

@benbellick
Copy link
Member Author

benbellick commented Dec 1, 2025

Sounds good. We will still need to introduce some sort of regex to validate the <NSS> component. This would be required to register with IANA anyways. We also may want to withhold from doing any registration until 1.0, when we have a stable idea of what these should look like (e.g. we will want to include version in the string eventually).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants