Skip to content

Conversation

@valadaptive
Copy link
Contributor

@valadaptive valadaptive commented Feb 1, 2026

This PR adds an API to Parley for loading locale-specific text analysis data at runtime: LayoutContext::load_segmenter_models_auto and LayoutContext::load_segmenter_models_dictionary.

The data itself is now generated by parley_data_gen, and exported as static byte slices from the parley_data crate if the bundled-segmenter-models feature is enabled. The intent is that if you're deploying a desktop application, you can just pass the models in from parley_data directly, but if you're deploying to the web, you can depend on parley_data at build time and export the model data into files to be dynamically loaded.

There are some required supporting changes:

  • parley_data now generates dictionary providers too, even though the baked data doesn't provide any dictionaries and just errors out. This matches how it works with LSTMs.

  • I noticed that the code becomes a lot cleaner, especially with some new abstractions, if LineSegmenters doesn't have to borrow itself mutably and uses OnceCell instead of Option.

    Right now, to juggle lifetimes properly, analyze_text moves the LineSegmenters struct out of the parent LayoutContext, and then moves it back in after analysis. I noticed that the LineSegmenters struct is 5KB large(!) so getting rid of those moves probably helps with performance too.

There's an opportunity for future improvement: right now, the data has to be loaded from a &'static [u8] or Box<[u8]>, since that's the interface that icu_provider_blob exposes. We could change the API to use linebender_resource_handle::Blob, but it requires that type to be compatible with Yoke. I did implement this, but it's not yet made it into a release. This may be a bit trickier; icu_provider_blob has a private API that it doesn't expose...

SegmenterModelData::from_blob returns an icu_provider::DataError; not sure if we want to "leak" that error type publicly. Parley, as far as I can tell, does not define any error types of its own yet. Would an unwrap/expect be better here?

Copy link
Contributor

@conor-93 conor-93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, fantastic work!

Comment on lines 43 to 116
normal: Option<LineSegmenter>,
keep_all: Option<LineSegmenter>,
break_all: Option<LineSegmenter>,
normal: OnceCell<LineSegmenter>,
keep_all: OnceCell<LineSegmenter>,
break_all: OnceCell<LineSegmenter>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting rid of those moves probably helps with performance too.

I'm seeing a nice performance improvement on this PR!

Default Style - arabic 20 characters               [  10.5 us ...  10.1 us ]      -3.50%*
Default Style - latin 20 characters                [   4.7 us ...   4.5 us ]      -4.19%*
Default Style - japanese 20 characters             [   9.5 us ...   9.2 us ]      -3.94%*
Default Style - arabic 1 paragraph                 [  58.3 us ...  57.9 us ]      -0.58%
Default Style - latin 1 paragraph                  [  19.5 us ...  19.2 us ]      -1.63%
Default Style - japanese 1 paragraph               [  86.6 us ...  85.1 us ]      -1.69%*
Default Style - arabic 4 paragraph                 [ 245.2 us ... 245.4 us ]      +0.07%
Default Style - latin 4 paragraph                  [  72.4 us ...  72.4 us ]      +0.03%
Default Style - japanese 4 paragraph               [ 115.8 us ... 114.6 us ]      -0.99%
Styled - arabic 20 characters                      [  11.4 us ...  11.1 us ]      -2.48%*
Styled - latin 20 characters                       [   6.2 us ...   5.9 us ]      -5.24%*
Styled - japanese 20 characters                    [  10.6 us ...  10.1 us ]      -4.33%*
Styled - arabic 1 paragraph                        [  61.7 us ...  60.8 us ]      -1.48%
Styled - latin 1 paragraph                         [  23.6 us ...  23.0 us ]      -2.27%*
Styled - japanese 1 paragraph                      [  91.0 us ...  89.8 us ]      -1.27%
Styled - arabic 4 paragraph                        [ 264.5 us ... 260.1 us ]      -1.68%
Styled - latin 4 paragraph                         [ 122.0 us ... 122.8 us ]      +0.68%
Styled - japanese 4 paragraph                      [ 171.5 us ... 168.7 us ]      -1.66%

Comment on lines 94 to 95
/// Unlike [`Self::load_segmenter_models_auto`] which uses LSTM for Southeast Asian scripts, this uses dictionary
/// data for all complex scripts. Dictionary models may be faster at runtime but require more data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: I think it's more correct to refer to these as just 'dictionaries', they're not models in the sense that LSTMs are, more just lookup tables of known words. There's a few other usages of this phrasing this PR, too.

Suggested change
/// Unlike [`Self::load_segmenter_models_auto`] which uses LSTM for Southeast Asian scripts, this uses dictionary
/// data for all complex scripts. Dictionary models may be faster at runtime but require more data.
/// Unlike [`Self::load_segmenter_models_auto`] which uses LSTM for Southeast Asian scripts, this uses dictionary
/// data for all complex scripts. Dictionaries may be faster at runtime but require more data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded the documentation a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I've yet to take a close look at the PR - I have time to do that tomorrow. But, one question I have is whether we're still planning on allowing the extraction of all unicode data from the binary so that it can be provided at runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, one question I have is whether we're still planning on allowing the extraction of all unicode data from the binary so that it can be provided at runtime

I can move in that direction if it turns out to be necessary, but the existing data (that is, everything but the LSTMs and dictionaries) is pretty small. The source files are ~50KB compressed. Moving those out of the library would require them to be provided before any text can be laid out, and might slow things down (the baked data is regular compiler-optimizable code; dynamically-loaded data would not be).

The tradeoff would become even worse if we decide to switch to Packtab, which (IIRC) produces smaller tables and faster lookups, but requires the compiler to be able to optimize (and inline!) the exact series of shifts/multiplies/etc used in each lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might slow things down (the baked data is regular compiler-optimizable code; dynamically-loaded data would not be)

I think CodePointTrie construction might be slower, but I'm not sure whether it would effect lookup.


Nice! Yes, I validated that finding. It's about 50 kB gzipped (12% of total binary) and 240 kB uncompressed (17% of total binary) if the binary just lays out some text in Wasm:

use parley::{FontContext, LayoutContext, StyleProperty, GenericFamily, Layout, Alignment, AlignmentOptions};

#[derive(Clone, Copy, Default, Debug, PartialEq)]
struct Brush;

#[no_mangle]
pub extern "C" fn create_layout() -> usize {
    let mut font_cx = FontContext::new();
    let mut layout_cx = LayoutContext::new();
    
    let text = "Hello World!";
    let mut builder = layout_cx.ranged_builder(&mut font_cx, text, 1.0, true);
    builder.push_default(GenericFamily::SansSerif);
    builder.push_default(StyleProperty::FontSize(16.0));
    
    let mut layout: Layout<Brush> = builder.build(text);
    layout.break_all_lines(Some(200.0));
    layout.align(Some(200.0), Alignment::Start, AlignmentOptions::default());
    
    layout.width() as usize
}

For me, that represents 6% of my total binary quota 😅 . It's not nothing, but I agree it's probably not worth bikeshedding at this stage. I think the first port of call might be to enable ICU data sharing with HarfRust and if I had to choose between this and general Parley latency performance, I would choose the latter at this time.

@taj-p taj-p self-requested a review February 3, 2026 03:56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to add a burmese font so we don't need to render tofu glyphs?

I guess it's clear that the with models version is segmenting differently!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was added a bit later than the Thai one; I think a Burmese font wouldn't hurt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct per me playing around in Google. Does the line segmentation for the thai text look correct to you @waywardmonkeys ?

}

#[cfg(feature = "runtime-segmenter-data")]
pub(crate) fn load_segmenter_models(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian - I was wondering if you would be able to give this code a once over from the perspective of an ICU4X maintainer 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a consumer know whether they should load in an external segmenter to correctly lay out some input text?

If I understand the API currently, it requires the consumer to decide upfront - either via pre-scanning text prior to Parley or making assumptions based on the type of user and where they may be located.

Since there are only a small size of complex scripts needing this attention, I wonder whether we can use a bitset over a single byte to track this during analysis and provide it back to the caller in Layout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might take a bit more work; I'll need to figure out how ICU4X itself determines characters' locales. Not sure if it's a matter of asking each individual model about the character in turn, or if there's an optimization it uses here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU4X uses complex data if a code point has the SA property. Inside that code path we use the table from the spec to decide which model to use.

I don't think you or your callers should pre-scan the text to determine if complex breaking is required. If you do that, you need to have the data available anyway. Loading the data is very cheap (if you use compiled data, not what this PR is doing), having the data is the problem that some users might want to avoid.

}

#[cfg(feature = "runtime-segmenter-data")]
pub(crate) fn load_segmenter_models(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that, since this is a replace operation, consumers cannot easily append models. For example, if a user starts with latin, then adds Thai, then a few minutes later adds Burmese text, when we go to upload the Burmese text, we'll need to re-prepare the Thai model to load_segmenter_models([thai, burmese]).

I'm unsure whether we should support an append API and what that might look like. If it's too complex, happy to proceed as is and look at this another time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed ICU4X treated things immutably, but it turns out there is a push method on MultiForkByErrorProvider. I've added an append API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might slow things down (the baked data is regular compiler-optimizable code; dynamically-loaded data would not be)

I think CodePointTrie construction might be slower, but I'm not sure whether it would effect lookup.


Nice! Yes, I validated that finding. It's about 50 kB gzipped (12% of total binary) and 240 kB uncompressed (17% of total binary) if the binary just lays out some text in Wasm:

use parley::{FontContext, LayoutContext, StyleProperty, GenericFamily, Layout, Alignment, AlignmentOptions};

#[derive(Clone, Copy, Default, Debug, PartialEq)]
struct Brush;

#[no_mangle]
pub extern "C" fn create_layout() -> usize {
    let mut font_cx = FontContext::new();
    let mut layout_cx = LayoutContext::new();
    
    let text = "Hello World!";
    let mut builder = layout_cx.ranged_builder(&mut font_cx, text, 1.0, true);
    builder.push_default(GenericFamily::SansSerif);
    builder.push_default(StyleProperty::FontSize(16.0));
    
    let mut layout: Layout<Brush> = builder.build(text);
    layout.break_all_lines(Some(200.0));
    layout.align(Some(200.0), Alignment::Start, AlignmentOptions::default());
    
    layout.width() as usize
}

For me, that represents 6% of my total binary quota 😅 . It's not nothing, but I agree it's probably not worth bikeshedding at this stage. I think the first port of call might be to enable ICU data sharing with HarfRust and if I had to choose between this and general Parley latency performance, I would choose the latter at this time.

@valadaptive
Copy link
Contributor Author

One thing I just realized about using OnceCell in LineSegmenters: it means that LayoutContext is no longer Sync. We can avoid this by just borrowing each individual field from AnalysisDataSources instead, but that'd involve various other code changes. Should I do that as part of this PR?

@taj-p
Copy link
Contributor

taj-p commented Feb 5, 2026

One thing I just realized about using OnceCell in LineSegmenters: it means that LayoutContext is no longer Sync. We can avoid this by just borrowing each individual field from AnalysisDataSources instead, but that'd involve various other code changes. Should I do that as part of this PR?

If it's a fast follow, I think it's fine to do in the next PR. We just want to avoid keeping main in a state where Layout isn't sync. Other consumers who point to main may be impacted.

If it's not a lot of work, I recommend getting that PR up before merging this one. Otherwise, I'll leave it to you on whether you want to merge and fast follow instead.

@taj-p
Copy link
Contributor

taj-p commented Feb 5, 2026

One thing I just realized about using OnceCell in LineSegmenters: it means that LayoutContext is no longer Sync. We can avoid this by just borrowing each individual field from AnalysisDataSources instead, but that'd involve various other code changes. Should I do that as part of this PR?

I wonder if we should assert this case somewhere to prevent this from happening

Copy link
Contributor

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an ICU4X perspective this is overly complicated. I don't see a reason to use a buffer provider which has a deserialization cost at runtime. You can also hide the compiled data behind a Cargo feature.

#482 should be merged before any more data provider work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU4X uses complex data if a code point has the SA property. Inside that code path we use the table from the spec to decide which model to use.

I don't think you or your callers should pre-scan the text to determine if complex breaking is required. If you do that, you need to have the data available anyway. Loading the data is very cheap (if you use compiled data, not what this PR is doing), having the data is the problem that some users might want to avoid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: you're generating data without customizing it. this way you pay the cost of deserializing your data at runtime, with no benefit compared to compiled data

you should merge at least the first commit of #482 before doing any more work with ICU4X data

@valadaptive
Copy link
Contributor Author

From an ICU4X perspective this is overly complicated. I don't see a reason to use a buffer provider which has a deserialization cost at runtime. You can also hide the compiled data behind a Cargo feature.

I believe the reason to separate the data is for web use cases. If the code changes more frequently than the ICU4X data, you can ship the two separately and serve the latter with a much longer cache life. That's what I understood from what @taj-p said, anyway.

@nicoburns
Copy link
Collaborator

One thing I just realized about using OnceCell in LineSegmenters: it means that LayoutContext is no longer Sync. We can avoid this by just borrowing each individual field from AnalysisDataSources instead, but that'd involve various other code changes. Should I do that as part of this PR?

As a general rule, I think using OnceCell / OnceLock outside of statics / thread locals is almost always a mistake. You can usually just use Option and handle the initialization checking manually.

impl_segmenter_break_grapheme_cluster_v1!($provider);
impl_segmenter_break_line_v1!($provider);
impl_normalizer_nfc_v1!($provider);
impl_segmenter_dictionary_auto_v1!($provider);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these three contain no data. if you really need them (for bounds somehwere), consider not generating them and just adding empty DataProvider implementations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are necessary for bounds. Since impl_segmenter_lstm_auto_v1 is already empty and present before this PR, could I just make that change separately and land it beforehand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, #482 does this

Comment on lines +164 to +166
icu_segmenter::provider::SegmenterBreakWordV1::INFO,
icu_segmenter::provider::SegmenterBreakWordOverrideV1::INFO,
icu_segmenter::provider::SegmenterBreakLineV1::INFO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need to add these three markers to the postcard data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for trait bounds, otherwise I get errors like:

error[E0277]: the trait bound `BakedProvider: DataProvider<SegmenterDictionaryAutoV1>` is not satisfied
   --> parley/src/analysis/mod.rs:217:54
    |
217 |                 WordSegmenter::try_new_auto_unstable(&combined, WordBreakOptions::default())
    |                 ------------------------------------ ^^^^^^^^^ unsatisfied trait bound
    |                 |
    |                 required by a bound introduced by this call
    |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error cannot be due to that. the compiler does not know which markers are in a postcard file

#[cfg(feature = "bundled-segmenter-models")]
pub mod bundled_models {
/// Thai LSTM model for word/line segmentation.
pub const THAI_LSTM: &[u8] = include_bytes!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider documenting the sizes of these so clients can make more informed decisions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are their sizes expected to change? While exporting them from parley_data means they won't change without explicit regeneration, I'm worried the docs might drift out of sync with the actual data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they might. you could generate the whole bundled_models module in parley_data_gen

///
/// These are pre-generated postcard blobs that can be loaded at runtime.
///
/// You can also depend on `parley_data` in your build, and export these models to files that you can ship alongside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just reexport this module in parley, so clients don't have to depend on parley_data

}
.expect("Failed to create WordSegmenter with runtime models");

// Clear cached line segmenters; they will be lazily recreated with the new mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do the same for all segmenters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the line segmenters and word segmenter are the only ones that use complex script data--correct me if I'm wrong. We eagerly reinitialize the word segmenter, and lazily reinitialize the line segmenters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the line segmenters and word segmenter are the only ones that use complex script data

yes. I meant why not treat the four segmenters that use complex script data the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should really do it eagerly, because if you have clients passing you arbitrary data, the constructors can fail and you want to return an error instead of panicking.

the non-custom constructors (try_new_for_non_complex_scripts) are infallible, so you can create these in AnalysisDataSources::new(). however the unstable constructors are pretty fallible with user-supplied data, so you should eagerly construct them in load_segmenter_models to return errors to the caller

#[cfg(feature = "runtime-segmenter-data")]
pub(crate) fn load_segmenter_models(
&mut self,
providers: Vec<icu_provider_blob::BlobDataProvider>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move the .provider and collect() here, to encapsulate the logic more. then the SegmenterModelData.provider field can be private, and the fact that you collect into a Vec as well

Suggested change
providers: Vec<icu_provider_blob::BlobDataProvider>,
models: impl IntoIterator<Item = SegmenterModelData>,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I collected into a Vec in the public methods on Layout is to avoid monomorphizing this function, which contains a lot more actual logic.

}
Some(buffer_provider) => {
let cur_mode = buffer_provider.segmenter_mode;
assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an undocumented panic. the public append functions only talk about performance issues with duplicated models

I'm not sold on the use case of the append methods. when would a client not supply the models all at once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @taj-p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would a client not supply the models all at once?

For our own web app, this would give us an opportunity to load in models as the user adds to their content, mostly to minimise memory usage (one of our leading concerns).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even then that should be done through the other constructor. If you want to add two new models, calling the append function twice does unnecessary work.

}
.expect("Failed to create WordSegmenter with runtime models");

// Clear cached line segmenters; they will be lazily recreated with the new mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the line segmenters and word segmenter are the only ones that use complex script data

yes. I meant why not treat the four segmenters that use complex script data the same

}
.expect("Failed to create WordSegmenter with runtime models");

// Clear cached line segmenters; they will be lazily recreated with the new mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should really do it eagerly, because if you have clients passing you arbitrary data, the constructors can fail and you want to return an error instead of panicking.

the non-custom constructors (try_new_for_non_complex_scripts) are infallible, so you can create these in AnalysisDataSources::new(). however the unstable constructors are pretty fallible with user-supplied data, so you should eagerly construct them in load_segmenter_models to return errors to the caller

pub mod bundled_models {
/// Thai LSTM model for word/line segmentation.
pub const THAI_LSTM: &[u8] = include_bytes!(
pub static THAI_LSTM: &[u8] = include_bytes!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm I think const was better. it's a const reference, so while the reference might be copied around, the thing behind it ('static) won't be

@robertbastian
Copy link
Contributor

Can you please rename this PR to something like "Add API for setting complex segmentation data"? "locale data" is very generic

@valadaptive valadaptive changed the title Add locale data API Add API for setting complex segmentation data Feb 5, 2026
@robertbastian
Copy link
Contributor

robertbastian commented Feb 5, 2026

I put some suggestions in robertbastian/parley@5aa9888...99b1500 including:

  • propagating errors to the caller, which requires eager construction
  • removing the append logic, so we don't have to store the provider anymore

@robertbastian
Copy link
Contributor

Merge for #482 in 5aa9888

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants