Use ICU4X built-in data #482

robertbastian · 2025-12-09T23:24:31Z

The data that is currently being generated in parley_data is the same data that ICU4X ships with. However, using try_new_unstable constructors with custom data providers can be less efficient than enabling the compiled_data feature, as these constructors do runtime lookups and branching, whereas most compiled_data constructors are const.

Benchmarks look neutral:

Default Style - arabic 20 characters               [   9.9 us ...   9.7 us ]      -1.46%*
Default Style - latin 20 characters                [   4.5 us ...   4.3 us ]      -4.27%*
Default Style - japanese 20 characters             [   9.1 us ...   8.9 us ]      -2.30%*
Default Style - arabic 1 paragraph                 [  55.5 us ...  55.6 us ]      +0.13%
Default Style - latin 1 paragraph                  [  18.2 us ...  17.9 us ]      -1.49%*
Default Style - japanese 1 paragraph               [  76.8 us ...  76.9 us ]      +0.16%
Default Style - arabic 4 paragraph                 [ 234.0 us ... 235.1 us ]      +0.48%
Default Style - latin 4 paragraph                  [  69.0 us ...  68.2 us ]      -1.05%*
Default Style - japanese 4 paragraph               [ 131.9 us ... 136.0 us ]      +3.11%
Styled - arabic 20 characters                      [  11.3 us ...  11.3 us ]      -0.43%
Styled - latin 20 characters                       [   6.3 us ...   6.3 us ]      -0.99%
Styled - japanese 20 characters                    [   9.9 us ...   9.7 us ]      -1.80%*
Styled - arabic 1 paragraph                        [  59.4 us ...  58.5 us ]      -1.40%
Styled - latin 1 paragraph                         [  23.7 us ...  23.3 us ]      -1.82%*
Styled - japanese 1 paragraph                      [  86.6 us ...  87.5 us ]      +1.05%*
Styled - arabic 4 paragraph                        [ 251.7 us ... 252.5 us ]      +0.32%
Styled - latin 4 paragraph                         [  90.4 us ...  89.1 us ]      -1.45%*
Styled - japanese 4 paragraph                      [ 123.7 us ... 124.0 us ]      +0.24%

taj-p · 2025-12-23T19:35:36Z

parley/src/analysis/mod.rs

I like how much simpler this PR makes Parley and Parley Data, but I worry about how it impacts future work, which you may be able to provide expertise in guiding.

Removing the compartmentalisation provided by AnalysisDataSources would make it harder for us to enable a BYO data mechanism IIUC. In the short term, we want to enable support for complex scripts (and for consumers to pass that data in).

For context, we want to enable a workflow such that, on the web, we can ship the binary separately from ICU data to clients. This enables us to evolve a binary (which is more volatile than ICU data) without the consumer needing to download the same ICU data each binary version.

Separately, this enables a more "pay for what you use" approach with the application layer deciding what ICU data may be provided for a given application state (which may evolve during a client's session).

For reference, this was the original idea

I've reverted the changes to AnalysisDataSources

For context, we want to enable a workflow such that, on the web, we can ship the binary separately from ICU data to clients. This enables us to evolve a binary (which is more volatile than ICU data) without the consumer needing to download the same ICU data each binary version.

Your current architecture of using compiled data does not allow for that. Changing to serde data for low level functionality like normalization and properties is not something that we recommend.

None of this data is particularly big. Once we get into complex segmentation data, that strategy makes more sense, but not for the data that this PR changes.

Changing to serde data for low level functionality like normalization and properties is not something that we recommend.

I was hoping it could be blob based.

None of this data is particularly big.

I think this is relative. The unicode data stored in Parley represents 6% of my binary quota currently 😅 . We might be able to use lazy module instantiation in Wasm to get around that, but it will become an area of optimisation - but, certainly, in the future (and not in the near term!)

taj-p · 2025-12-23T19:55:22Z

parley/src/analysis/mod.rs

Looks like this change reduces the size of the Vello Editor example from 9.7 MB to 9.57 MB 🎉

This reverts commit 5b83760.

This reverts commit 559028b.

taj-p

LGTM 🎉

parley_data/src/lib.rs

taj-p · 2026-02-09T23:21:32Z

parley/src/analysis/mod.rs

Changing to serde data for low level functionality like normalization and properties is not something that we recommend.

I was hoping it could be blob based.

None of this data is particularly big.

I think this is relative. The unicode data stored in Parley represents 6% of my binary quota currently 😅 . We might be able to use lazy module instantiation in Wasm to get around that, but it will become an area of optimisation - but, certainly, in the future (and not in the near term!)

Co-authored-by: Taj Pereira <[email protected]>

taj-p · 2026-02-10T19:42:31Z

Thank you @robertbastian for the contribution!! This simplifies parley_data a lot and improved our binary size 🙌 🙏 🎉

robertbastian force-pushed the baked-data branch 2 times, most recently from 160ed2e to d637c00 Compare December 22, 2025 22:56

robertbastian marked this pull request as ready for review December 22, 2025 23:01

robertbastian added 3 commits December 23, 2025 11:12

use compiled_data feature instead of datagen

32cd871

remove AnalysisDataSources

559028b

slim parley_data

5b83760

robertbastian force-pushed the baked-data branch from d637c00 to 5b83760 Compare December 23, 2025 10:15

taj-p reviewed Dec 23, 2025

View reviewed changes

parley/src/analysis/mod.rs

Copy link

Contributor

taj-p Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this change reduces the size of the Vello Editor example from 9.7 MB to 9.57 MB 🎉

robertbastian mentioned this pull request Feb 5, 2026

Add API for setting complex segmentation data #533

Open

robertbastian added 4 commits February 5, 2026 09:50

Revert "slim parley_data"

983bdc4

This reverts commit 5b83760.

Revert "remove AnalysisDataSources"

e1c11f4

This reverts commit 559028b.

Partially reapply "slim parley_data"

445bee0

Merge remote-tracking branch 'upstream/main' into baked-data

eaebd30

robertbastian requested a review from taj-p February 5, 2026 09:13

Merge remote-tracking branch 'upstream/main' into baked-data

0781142

robertbastian force-pushed the baked-data branch from 274fe95 to 0781142 Compare February 5, 2026 14:17

taj-p approved these changes Feb 9, 2026

View reviewed changes

Update parley_data/src/lib.rs

f9c6a74

Co-authored-by: Taj Pereira <[email protected]>

robertbastian requested a review from taj-p February 10, 2026 09:31

taj-p added this pull request to the merge queue Feb 10, 2026

Merged via the queue into linebender:main with commit ac25155 Feb 10, 2026
24 checks passed

robertbastian deleted the baked-data branch February 10, 2026 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ICU4X built-in data #482

Use ICU4X built-in data #482

Uh oh!

robertbastian commented Dec 9, 2025 •

edited

Loading

Uh oh!

taj-p Dec 23, 2025 •

edited

Loading

Uh oh!

taj-p Dec 28, 2025

Uh oh!

robertbastian Feb 5, 2026

Uh oh!

robertbastian Feb 5, 2026

Uh oh!

taj-p Feb 9, 2026

Uh oh!

taj-p Dec 23, 2025

Uh oh!

taj-p left a comment

Uh oh!

Uh oh!

taj-p Feb 9, 2026

Uh oh!

Uh oh!

taj-p commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use ICU4X built-in data #482

Use ICU4X built-in data #482

Uh oh!

Conversation

robertbastian commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taj-p Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taj-p Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

robertbastian Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

robertbastian Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

taj-p Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

taj-p Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

taj-p left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taj-p Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taj-p commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robertbastian commented Dec 9, 2025 •

edited

Loading

taj-p Dec 23, 2025 •

edited

Loading

taj-p commented Feb 10, 2026 •

edited

Loading