-
-
Notifications
You must be signed in to change notification settings - Fork 823
Lazy scorers #2726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Lazy scorers #2726
Conversation
- Allow lazy evaluation of score. As soon as we identified that a doc won't reach the topK threshold, we can stop the evaluation. - Allow for a different segment level score, segment level score and their conversion. This PR breaks public API, but fixing code is straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this interface works! Thanks a lot for exploring it.
My only suggestion would be to break interfaces a little bit harder (naming wise), if possible: the capabilities exposed by this API go way beyond "tweaking a score", and into sorting on arbitrary features.
83308bf to
2fba123
Compare
|
I made a sketch of how this design could support implementing |
291f254 to
259007c
Compare
259007c to
ef32cbb
Compare
|
@stuhood Can you review for good now? I did not integrate your work on batching yet. I think this can be done in a separate PR. The PR ended up larger, because I added the flexibility to deal with ordering equivalent to |
6db70fb to
bfcf508
Compare
cd4e41e to
a55995a
Compare
a55995a to
71d9a5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for tackling this! The reduction in API surface area is really wonderful, and it looks like this enables the boxed/erased features in a cleaner way than #2681 did too.
Also, based on another experiment I was doing: I like removing impl Collector for TopDocs (despite the API change), because it cleans up the builder interface to not have to carry the generic type around.
Thanks again!
| false | ||
| } | ||
|
|
||
| /// Sorting by score is special in that it allows for the Block-Wand optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is specific to the overridden method for scores.
| /// Sorts by a fast value (u64, i64, f64, bool). | ||
| /// | ||
| /// The field must appear explicitly in the schema, with the right type, and declared as | ||
| /// a fast field.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// a fast field.. | |
| /// a fast field. |
| /// | ||
| /// If the field is multivalued, only the first value is considered. | ||
| /// | ||
| /// Document that do not have this value are still considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Document that do not have this value are still considered. | |
| /// Documents that do not have this value are still considered. |
| /// | ||
| /// If the field is multivalued, only the first value is considered. | ||
| /// | ||
| /// Document that do not have this value are still considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Document that do not have this value are still considered. | |
| /// Documents that do not have this value are still considered. |
| // Sorting by score is special in that it allows for the Block-Wand optimization. | ||
| fn collect_segment_top_k( | ||
| &self, | ||
| k: usize, | ||
| weight: &dyn crate::query::Weight, | ||
| reader: &crate::SegmentReader, | ||
| segment_ord: u32, | ||
| ) -> crate::Result<Vec<(Self::SortKey, DocAddress)>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with not looping the batching into this change.
One thing that I wonder is whether batched, lazy collection could be used to implement block-wand, which might make scores less of a special case! Something to experiment with in the future.
|
|
||
| /// Sort by similarity score. | ||
| #[derive(Clone, Debug, Copy)] | ||
| pub struct SortBySimilarityScore; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit: here we spell out "similarity score", whereas with TopDocs::order_by_score and the other method removals/changes, I think that we recognize rightly that a "score" should probably only mean a "similarity score". So I think that you could probably drop Similarity from the name here, which would make this more symmetrical with order_by_score.
Follow up of #2681
reach the topK threshold, we can stop the evaluation.
The latter is more natural for most search application, but is not what is common in SQL (which I assume is @stuhood / paradedb use case)
This PR breaks public API, but fixing code should be straightforward.