-
Notifications
You must be signed in to change notification settings - Fork 62
Use ICU4X built-in data #482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
160ed2e to
d637c00
Compare
d637c00 to
5b83760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how much simpler this PR makes Parley and Parley Data, but I worry about how it impacts future work, which you may be able to provide expertise in guiding.
Removing the compartmentalisation provided by AnalysisDataSources would make it harder for us to enable a BYO data mechanism IIUC. In the short term, we want to enable support for complex scripts (and for consumers to pass that data in).
For context, we want to enable a workflow such that, on the web, we can ship the binary separately from ICU data to clients. This enables us to evolve a binary (which is more volatile than ICU data) without the consumer needing to download the same ICU data each binary version.
Separately, this enables a more "pay for what you use" approach with the application layer deciding what ICU data may be provided for a given application state (which may evolve during a client's session).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this was the original idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted the changes to AnalysisDataSources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, we want to enable a workflow such that, on the web, we can ship the binary separately from ICU data to clients. This enables us to evolve a binary (which is more volatile than ICU data) without the consumer needing to download the same ICU data each binary version.
Your current architecture of using compiled data does not allow for that. Changing to serde data for low level functionality like normalization and properties is not something that we recommend.
None of this data is particularly big. Once we get into complex segmentation data, that strategy makes more sense, but not for the data that this PR changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to serde data for low level functionality like normalization and properties is not something that we recommend.
I was hoping it could be blob based.
None of this data is particularly big.
I think this is relative. The unicode data stored in Parley represents 6% of my binary quota currently 😅 . We might be able to use lazy module instantiation in Wasm to get around that, but it will become an area of optimisation - but, certainly, in the future (and not in the near term!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this change reduces the size of the Vello Editor example from 9.7 MB to 9.57 MB 🎉
274fe95 to
0781142
Compare
taj-p
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to serde data for low level functionality like normalization and properties is not something that we recommend.
I was hoping it could be blob based.
None of this data is particularly big.
I think this is relative. The unicode data stored in Parley represents 6% of my binary quota currently 😅 . We might be able to use lazy module instantiation in Wasm to get around that, but it will become an area of optimisation - but, certainly, in the future (and not in the near term!)
Co-authored-by: Taj Pereira <[email protected]>
|
Thank you @robertbastian for the contribution!! This simplifies |
The data that is currently being generated in
parley_datais the same data that ICU4X ships with. However, usingtry_new_unstableconstructors with custom data providers can be less efficient than enabling thecompiled_datafeature, as these constructors do runtime lookups and branching, whereas mostcompiled_dataconstructors areconst.Benchmarks look neutral: