-
Notifications
You must be signed in to change notification settings - Fork 188
docs: clarify valid URNs #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
60c70d7 to
2b2591c
Compare
yongchul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just heads up. I'm not a native speaker so take lots of grain of salts with my nit comments about sentence and grammar. :)
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a suggestion. The meaning/parsing of Extension URNs changes slightly if we prefix urn: in front of them, versus replacing extension: with urn:. Let me know what you think.
site/docs/extensions/index.md
Outdated
| * Table Functions | ||
|
|
||
| To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. | ||
| To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor adjustment for clarity.
| To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix. | |
| To extend these items, users can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. These identifiers are URN-like but not technically URNs (they are prefixed with `extension:` instead of `urn:`), and will be referred to as `extension URNs` for clarity. | |
| Extension URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) URNs when replacing `extension:` with `urn:`. |
| Simple extensions within a plan are split into three components: an extension URN, an extension declaration and a number of references. | ||
|
|
||
| * **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations. | ||
| * **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations. The URN with the `urn:` prefix added must conform to [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URN with the
urn:prefix added must conform to RFC 8141.
It's a bit weird to say this. The way we've structured them now maps to
| <NID> | <NSS>
extension:<owner>:<id>
I guess they do conform to the RFC if we prefix urn, but the interpretation would be different technically:
| <NID> | <NSS>
urn:extension:<owner>:<id>
What do you think about:
The Extension URN with the
extension:replaced withurn:must conform to RFC 8141
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I see the point you are making. However, we don't actually enforce that the structure of the <owner> part is reverse DNS. Why don't we just loosen the restriction on the urn entirely and say:
The
urnis required to be a valid URN whenurn:is prepended to the string. The format must conform tourn:extension:<Identifier>. The recommended format for the identifier is<Reverse-DNS-Name>:<any-valid-name>. This is consistent with the default substrait extensions and prevents name collisions.
To me, this feels more consistent with the urn spec. Maybe its just how my brain works, but saying "urn: added to the front makes it a valid URN" makes more sense to me than saying "urn: replacing extension: makes it a valid URN".
|
@jacques-n I have altered the regex to be |
d0792a9 to
a9ed3f5
Compare
The required regex is: ^extension:[^:]+:[^:]+$
a9ed3f5 to
8247607
Compare
site/docs/extensions/index.md
Outdated
| - `OWNER` represents the organization or entity providing the extension and should follow [reverse domain name convention](https://en.wikipedia.org/wiki/Reverse_domain_name_notation) (e.g., `io.substrait`, `com.example`, `org.apache.arrow`) to prevent name collisions | ||
| - `ID` is the specific identifier for the extension (e.g., `functions_arithmetic`, `custom_types`) | ||
|
|
||
| These URNs must match the regex `^extension:[a-zA-Z0-9_.-]+:[a-zA-Z0-9_.-]+$`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see it again, why do we allow upper case if we were to allow only narrow set of characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recommend an even more restrictive urn? How about:
^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$
i.e same thing without capital letters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to be without capital letters, let me know what you think!
`^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$`
mbrobbel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again I'm wondering if it wouldn't be easier to just use URNs as defined in RFC 8141?
@mbrobbel The problem is that defining in terms of RFC 8141 is inherently clunky because of the decision to not include
The problem with the first approach above is then The second approach is technically fine, but a bit clunky IMO. The third approach seems simpler all in all as you can check the string against a regex directly. It also gives us more flexibility in the future to add things like versioning to the end when we are ready. Also, the inspiration for this approach to using URN-like things came from java's maven, which is also not a general purpose URN. I am open to relying on the RFC 8141 spec, but it doesn't seem to me that that is necessarily the simplest solution. In hindsight I wish that we had included |
|
I would also prefer to stay closer to the RFC.
Wikipedia says that the NID should be registered with IANA according to the RFC which probably makes |
@nielspardon I do think that that is a good approach. What if we then said that valid URNs are |
sure, we can do the change as a 1.0 item. Just saying if we want to give this another go we probably should consider that NID should be something we could register with IANA if we wanted to. |
|
Sounds good. We will still need to introduce some sort of regex to validate the |
As discovered in this discussion with @mbrobbel, there is a need to clarify a URN ambiguity. Current
urnimplementation across java, python, and go assumes that there are exactly two colons, i.e. they all are using the regex^extension:[^:]+:[^:]+$.Instead, we clarify that urns as defined here are exactly as in rfc 8141 but with theurn:prefix cut off.We update the documentation to enforce regex
^extension:[^:]+:[^:]+$.