-
Notifications
You must be signed in to change notification settings - Fork 62
Add API for setting complex segmentation data #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
11cde1f to
09214bd
Compare
conor-93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, fantastic work!
parley/src/analysis/mod.rs
Outdated
| normal: Option<LineSegmenter>, | ||
| keep_all: Option<LineSegmenter>, | ||
| break_all: Option<LineSegmenter>, | ||
| normal: OnceCell<LineSegmenter>, | ||
| keep_all: OnceCell<LineSegmenter>, | ||
| break_all: OnceCell<LineSegmenter>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting rid of those moves probably helps with performance too.
I'm seeing a nice performance improvement on this PR!
Default Style - arabic 20 characters [ 10.5 us ... 10.1 us ] -3.50%*
Default Style - latin 20 characters [ 4.7 us ... 4.5 us ] -4.19%*
Default Style - japanese 20 characters [ 9.5 us ... 9.2 us ] -3.94%*
Default Style - arabic 1 paragraph [ 58.3 us ... 57.9 us ] -0.58%
Default Style - latin 1 paragraph [ 19.5 us ... 19.2 us ] -1.63%
Default Style - japanese 1 paragraph [ 86.6 us ... 85.1 us ] -1.69%*
Default Style - arabic 4 paragraph [ 245.2 us ... 245.4 us ] +0.07%
Default Style - latin 4 paragraph [ 72.4 us ... 72.4 us ] +0.03%
Default Style - japanese 4 paragraph [ 115.8 us ... 114.6 us ] -0.99%
Styled - arabic 20 characters [ 11.4 us ... 11.1 us ] -2.48%*
Styled - latin 20 characters [ 6.2 us ... 5.9 us ] -5.24%*
Styled - japanese 20 characters [ 10.6 us ... 10.1 us ] -4.33%*
Styled - arabic 1 paragraph [ 61.7 us ... 60.8 us ] -1.48%
Styled - latin 1 paragraph [ 23.6 us ... 23.0 us ] -2.27%*
Styled - japanese 1 paragraph [ 91.0 us ... 89.8 us ] -1.27%
Styled - arabic 4 paragraph [ 264.5 us ... 260.1 us ] -1.68%
Styled - latin 4 paragraph [ 122.0 us ... 122.8 us ] +0.68%
Styled - japanese 4 paragraph [ 171.5 us ... 168.7 us ] -1.66%
parley/src/context.rs
Outdated
| /// Unlike [`Self::load_segmenter_models_auto`] which uses LSTM for Southeast Asian scripts, this uses dictionary | ||
| /// data for all complex scripts. Dictionary models may be faster at runtime but require more data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: I think it's more correct to refer to these as just 'dictionaries', they're not models in the sense that LSTMs are, more just lookup tables of known words. There's a few other usages of this phrasing this PR, too.
| /// Unlike [`Self::load_segmenter_models_auto`] which uses LSTM for Southeast Asian scripts, this uses dictionary | |
| /// data for all complex scripts. Dictionary models may be faster at runtime but require more data. | |
| /// Unlike [`Self::load_segmenter_models_auto`] which uses LSTM for Southeast Asian scripts, this uses dictionary | |
| /// data for all complex scripts. Dictionaries may be faster at runtime but require more data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded the documentation a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I've yet to take a close look at the PR - I have time to do that tomorrow. But, one question I have is whether we're still planning on allowing the extraction of all unicode data from the binary so that it can be provided at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, one question I have is whether we're still planning on allowing the extraction of all unicode data from the binary so that it can be provided at runtime
I can move in that direction if it turns out to be necessary, but the existing data (that is, everything but the LSTMs and dictionaries) is pretty small. The source files are ~50KB compressed. Moving those out of the library would require them to be provided before any text can be laid out, and might slow things down (the baked data is regular compiler-optimizable code; dynamically-loaded data would not be).
The tradeoff would become even worse if we decide to switch to Packtab, which (IIRC) produces smaller tables and faster lookups, but requires the compiler to be able to optimize (and inline!) the exact series of shifts/multiplies/etc used in each lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might slow things down (the baked data is regular compiler-optimizable code; dynamically-loaded data would not be)
I think CodePointTrie construction might be slower, but I'm not sure whether it would effect lookup.
Nice! Yes, I validated that finding. It's about 50 kB gzipped (12% of total binary) and 240 kB uncompressed (17% of total binary) if the binary just lays out some text in Wasm:
use parley::{FontContext, LayoutContext, StyleProperty, GenericFamily, Layout, Alignment, AlignmentOptions};
#[derive(Clone, Copy, Default, Debug, PartialEq)]
struct Brush;
#[no_mangle]
pub extern "C" fn create_layout() -> usize {
let mut font_cx = FontContext::new();
let mut layout_cx = LayoutContext::new();
let text = "Hello World!";
let mut builder = layout_cx.ranged_builder(&mut font_cx, text, 1.0, true);
builder.push_default(GenericFamily::SansSerif);
builder.push_default(StyleProperty::FontSize(16.0));
let mut layout: Layout<Brush> = builder.build(text);
layout.break_all_lines(Some(200.0));
layout.align(Some(200.0), Alignment::Start, AlignmentOptions::default());
layout.width() as usize
}For me, that represents 6% of my total binary quota 😅 . It's not nothing, but I agree it's probably not worth bikeshedding at this stage. I think the first port of call might be to enable ICU data sharing with HarfRust and if I had to choose between this and general Parley latency performance, I would choose the latter at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to add a burmese font so we don't need to render tofu glyphs?
I guess it's clear that the with models version is segmenting differently!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was added a bit later than the Thai one; I think a Burmese font wouldn't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct per me playing around in Google. Does the line segmentation for the thai text look correct to you @waywardmonkeys ?
| } | ||
|
|
||
| #[cfg(feature = "runtime-segmenter-data")] | ||
| pub(crate) fn load_segmenter_models( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertbastian - I was wondering if you would be able to give this code a once over from the perspective of an ICU4X maintainer 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does a consumer know whether they should load in an external segmenter to correctly lay out some input text?
If I understand the API currently, it requires the consumer to decide upfront - either via pre-scanning text prior to Parley or making assumptions based on the type of user and where they may be located.
Since there are only a small size of complex scripts needing this attention, I wonder whether we can use a bitset over a single byte to track this during analysis and provide it back to the caller in Layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might take a bit more work; I'll need to figure out how ICU4X itself determines characters' locales. Not sure if it's a matter of asking each individual model about the character in turn, or if there's an optimization it uses here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ICU4X uses complex data if a code point has the SA property. Inside that code path we use the table from the spec to decide which model to use.
I don't think you or your callers should pre-scan the text to determine if complex breaking is required. If you do that, you need to have the data available anyway. Loading the data is very cheap (if you use compiled data, not what this PR is doing), having the data is the problem that some users might want to avoid.
| } | ||
|
|
||
| #[cfg(feature = "runtime-segmenter-data")] | ||
| pub(crate) fn load_segmenter_models( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that, since this is a replace operation, consumers cannot easily append models. For example, if a user starts with latin, then adds Thai, then a few minutes later adds Burmese text, when we go to upload the Burmese text, we'll need to re-prepare the Thai model to load_segmenter_models([thai, burmese]).
I'm unsure whether we should support an append API and what that might look like. If it's too complex, happy to proceed as is and look at this another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed ICU4X treated things immutably, but it turns out there is a push method on MultiForkByErrorProvider. I've added an append API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might slow things down (the baked data is regular compiler-optimizable code; dynamically-loaded data would not be)
I think CodePointTrie construction might be slower, but I'm not sure whether it would effect lookup.
Nice! Yes, I validated that finding. It's about 50 kB gzipped (12% of total binary) and 240 kB uncompressed (17% of total binary) if the binary just lays out some text in Wasm:
use parley::{FontContext, LayoutContext, StyleProperty, GenericFamily, Layout, Alignment, AlignmentOptions};
#[derive(Clone, Copy, Default, Debug, PartialEq)]
struct Brush;
#[no_mangle]
pub extern "C" fn create_layout() -> usize {
let mut font_cx = FontContext::new();
let mut layout_cx = LayoutContext::new();
let text = "Hello World!";
let mut builder = layout_cx.ranged_builder(&mut font_cx, text, 1.0, true);
builder.push_default(GenericFamily::SansSerif);
builder.push_default(StyleProperty::FontSize(16.0));
let mut layout: Layout<Brush> = builder.build(text);
layout.break_all_lines(Some(200.0));
layout.align(Some(200.0), Alignment::Start, AlignmentOptions::default());
layout.width() as usize
}For me, that represents 6% of my total binary quota 😅 . It's not nothing, but I agree it's probably not worth bikeshedding at this stage. I think the first port of call might be to enable ICU data sharing with HarfRust and if I had to choose between this and general Parley latency performance, I would choose the latter at this time.
426206c to
d176c1d
Compare
|
One thing I just realized about using |
If it's a fast follow, I think it's fine to do in the next PR. We just want to avoid keeping If it's not a lot of work, I recommend getting that PR up before merging this one. Otherwise, I'll leave it to you on whether you want to merge and fast follow instead. |
I wonder if we should assert this case somewhere to prevent this from happening |
robertbastian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an ICU4X perspective this is overly complicated. I don't see a reason to use a buffer provider which has a deserialization cost at runtime. You can also hide the compiled data behind a Cargo feature.
#482 should be merged before any more data provider work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ICU4X uses complex data if a code point has the SA property. Inside that code path we use the table from the spec to decide which model to use.
I don't think you or your callers should pre-scan the text to determine if complex breaking is required. If you do that, you need to have the data available anyway. Loading the data is very cheap (if you use compiled data, not what this PR is doing), having the data is the problem that some users might want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: you're generating data without customizing it. this way you pay the cost of deserializing your data at runtime, with no benefit compared to compiled data
you should merge at least the first commit of #482 before doing any more work with ICU4X data
I believe the reason to separate the data is for web use cases. If the code changes more frequently than the ICU4X data, you can ship the two separately and serve the latter with a much longer cache life. That's what I understood from what @taj-p said, anyway. |
As a general rule, I think using |
| impl_segmenter_break_grapheme_cluster_v1!($provider); | ||
| impl_segmenter_break_line_v1!($provider); | ||
| impl_normalizer_nfc_v1!($provider); | ||
| impl_segmenter_dictionary_auto_v1!($provider); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these three contain no data. if you really need them (for bounds somehwere), consider not generating them and just adding empty DataProvider implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are necessary for bounds. Since impl_segmenter_lstm_auto_v1 is already empty and present before this PR, could I just make that change separately and land it beforehand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, #482 does this
| icu_segmenter::provider::SegmenterBreakWordV1::INFO, | ||
| icu_segmenter::provider::SegmenterBreakWordOverrideV1::INFO, | ||
| icu_segmenter::provider::SegmenterBreakLineV1::INFO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't need to add these three markers to the postcard data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary for trait bounds, otherwise I get errors like:
error[E0277]: the trait bound `BakedProvider: DataProvider<SegmenterDictionaryAutoV1>` is not satisfied
--> parley/src/analysis/mod.rs:217:54
|
217 | WordSegmenter::try_new_auto_unstable(&combined, WordBreakOptions::default())
| ------------------------------------ ^^^^^^^^^ unsatisfied trait bound
| |
| required by a bound introduced by this call
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error cannot be due to that. the compiler does not know which markers are in a postcard file
| #[cfg(feature = "bundled-segmenter-models")] | ||
| pub mod bundled_models { | ||
| /// Thai LSTM model for word/line segmentation. | ||
| pub const THAI_LSTM: &[u8] = include_bytes!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider documenting the sizes of these so clients can make more informed decisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are their sizes expected to change? While exporting them from parley_data means they won't change without explicit regeneration, I'm worried the docs might drift out of sync with the actual data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they might. you could generate the whole bundled_models module in parley_data_gen
| /// | ||
| /// These are pre-generated postcard blobs that can be loaded at runtime. | ||
| /// | ||
| /// You can also depend on `parley_data` in your build, and export these models to files that you can ship alongside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just reexport this module in parley, so clients don't have to depend on parley_data
| } | ||
| .expect("Failed to create WordSegmenter with runtime models"); | ||
|
|
||
| // Clear cached line segmenters; they will be lazily recreated with the new mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not do the same for all segmenters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the line segmenters and word segmenter are the only ones that use complex script data--correct me if I'm wrong. We eagerly reinitialize the word segmenter, and lazily reinitialize the line segmenters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the line segmenters and word segmenter are the only ones that use complex script data
yes. I meant why not treat the four segmenters that use complex script data the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should really do it eagerly, because if you have clients passing you arbitrary data, the constructors can fail and you want to return an error instead of panicking.
the non-custom constructors (try_new_for_non_complex_scripts) are infallible, so you can create these in AnalysisDataSources::new(). however the unstable constructors are pretty fallible with user-supplied data, so you should eagerly construct them in load_segmenter_models to return errors to the caller
| #[cfg(feature = "runtime-segmenter-data")] | ||
| pub(crate) fn load_segmenter_models( | ||
| &mut self, | ||
| providers: Vec<icu_provider_blob::BlobDataProvider>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move the .provider and collect() here, to encapsulate the logic more. then the SegmenterModelData.provider field can be private, and the fact that you collect into a Vec as well
| providers: Vec<icu_provider_blob::BlobDataProvider>, | |
| models: impl IntoIterator<Item = SegmenterModelData>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I collected into a Vec in the public methods on Layout is to avoid monomorphizing this function, which contains a lot more actual logic.
| } | ||
| Some(buffer_provider) => { | ||
| let cur_mode = buffer_provider.segmenter_mode; | ||
| assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an undocumented panic. the public append functions only talk about performance issues with duplicated models
I'm not sold on the use case of the append methods. when would a client not supply the models all at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @taj-p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would a client not supply the models all at once?
For our own web app, this would give us an opportunity to load in models as the user adds to their content, mostly to minimise memory usage (one of our leading concerns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even then that should be done through the other constructor. If you want to add two new models, calling the append function twice does unnecessary work.
6410d11 to
d9d3446
Compare
441a4c4 to
e7cee47
Compare
Co-authored-by: Robert Bastian <[email protected]>
Co-authored-by: Robert Bastian <[email protected]>
| } | ||
| .expect("Failed to create WordSegmenter with runtime models"); | ||
|
|
||
| // Clear cached line segmenters; they will be lazily recreated with the new mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the line segmenters and word segmenter are the only ones that use complex script data
yes. I meant why not treat the four segmenters that use complex script data the same
| } | ||
| .expect("Failed to create WordSegmenter with runtime models"); | ||
|
|
||
| // Clear cached line segmenters; they will be lazily recreated with the new mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should really do it eagerly, because if you have clients passing you arbitrary data, the constructors can fail and you want to return an error instead of panicking.
the non-custom constructors (try_new_for_non_complex_scripts) are infallible, so you can create these in AnalysisDataSources::new(). however the unstable constructors are pretty fallible with user-supplied data, so you should eagerly construct them in load_segmenter_models to return errors to the caller
parley_data/src/lib.rs
Outdated
| pub mod bundled_models { | ||
| /// Thai LSTM model for word/line segmentation. | ||
| pub const THAI_LSTM: &[u8] = include_bytes!( | ||
| pub static THAI_LSTM: &[u8] = include_bytes!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm I think const was better. it's a const reference, so while the reference might be copied around, the thing behind it ('static) won't be
|
Can you please rename this PR to something like "Add API for setting complex segmentation data"? "locale data" is very generic |
This reverts commit e7cee47.
|
I put some suggestions in robertbastian/parley@5aa9888...99b1500 including:
|
This PR adds an API to Parley for loading locale-specific text analysis data at runtime:
LayoutContext::load_segmenter_models_autoandLayoutContext::load_segmenter_models_dictionary.The data itself is now generated by
parley_data_gen, and exported asstaticbyte slices from theparley_datacrate if thebundled-segmenter-modelsfeature is enabled. The intent is that if you're deploying a desktop application, you can just pass the models in fromparley_datadirectly, but if you're deploying to the web, you can depend onparley_dataat build time and export the model data into files to be dynamically loaded.There are some required supporting changes:
parley_datanow generates dictionary providers too, even though the baked data doesn't provide any dictionaries and just errors out. This matches how it works with LSTMs.I noticed that the code becomes a lot cleaner, especially with some new abstractions, if
LineSegmentersdoesn't have to borrow itself mutably and usesOnceCellinstead ofOption.Right now, to juggle lifetimes properly,
analyze_textmoves theLineSegmentersstruct out of the parentLayoutContext, and then moves it back in after analysis. I noticed that theLineSegmentersstruct is 5KB large(!) so getting rid of those moves probably helps with performance too.There's an opportunity for future improvement: right now, the data has to be loaded from a
&'static [u8]orBox<[u8]>, since that's the interface that icu_provider_blob exposes.We could change the API to use linebender_resource_handle::Blob, but it requires that type to be compatible with Yoke. I did implement this, but it's not yet made it into a release.This may be a bit trickier; icu_provider_blob has a private API that it doesn't expose...SegmenterModelData::from_blobreturns anicu_provider::DataError; not sure if we want to "leak" that error type publicly. Parley, as far as I can tell, does not define any error types of its own yet. Would an unwrap/expect be better here?