Skip to content

Conversation

@fulmicoton
Copy link
Collaborator

@fulmicoton fulmicoton commented Oct 30, 2025

Follow up of #2681

  • Allow lazy evaluation of score. As soon as we identified that a doc won't
    reach the topK threshold, we can stop the evaluation.
  • Adds a new type of ordering. Asc, Desc, Ascending but none at the end.
    The latter is more natural for most search application, but is not what is common in SQL (which I assume is @stuhood / paradedb use case)
  • Allow for a different segment level score, segment level score and their conversion.
  • Rationalization of part of the code / API

This PR breaks public API, but fixing code should be straightforward.

- Allow lazy evaluation of score. As soon as we identified that a doc won't
reach the topK threshold, we can stop the evaluation.
- Allow for a different segment level score, segment level score and their conversion.

This PR breaks public API, but fixing code is straightforward.
Copy link
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this interface works! Thanks a lot for exploring it.

My only suggestion would be to break interfaces a little bit harder (naming wise), if possible: the capabilities exposed by this API go way beyond "tweaking a score", and into sorting on arbitrary features.

@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/lazy-scorers branch from 83308bf to 2fba123 Compare November 1, 2025 15:15
@stuhood
Copy link
Collaborator

stuhood commented Nov 3, 2025

I made a sketch of how this design could support implementing Collector::collect_block over here: #2728 ... seems promising, thanks!

@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/lazy-scorers branch 2 times, most recently from 291f254 to 259007c Compare November 7, 2025 21:13
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/lazy-scorers branch from 259007c to ef32cbb Compare November 7, 2025 21:34
@fulmicoton fulmicoton changed the title Paul.masurel/lazy scorers Lazy scorers Nov 10, 2025
@fulmicoton fulmicoton marked this pull request as ready for review November 10, 2025 09:34
@fulmicoton
Copy link
Collaborator Author

@stuhood Can you review for good now? I did not integrate your work on batching yet. I think this can be done in a separate PR.

The PR ended up larger, because I added the flexibility to deal with ordering equivalent to
ORDER BY col DESC NULLS LAST as well as ORDER BY col DESC to be compatible with both Quickwit and ParadeDB.

@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/lazy-scorers branch 6 times, most recently from 6db70fb to bfcf508 Compare November 10, 2025 10:12
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/lazy-scorers branch 2 times, most recently from cd4e41e to a55995a Compare November 10, 2025 12:58
@fulmicoton-dd fulmicoton-dd force-pushed the paul.masurel/lazy-scorers branch from a55995a to 71d9a5d Compare November 10, 2025 13:04
Copy link
Collaborator

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for tackling this! The reduction in API surface area is really wonderful, and it looks like this enables the boxed/erased features in a cleaner way than #2681 did too.

Also, based on another experiment I was doing: I like removing impl Collector for TopDocs (despite the API change), because it cleans up the builder interface to not have to carry the generic type around.

Thanks again!

false
}

/// Sorting by score is special in that it allows for the Block-Wand optimization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is specific to the overridden method for scores.

/// Sorts by a fast value (u64, i64, f64, bool).
///
/// The field must appear explicitly in the schema, with the right type, and declared as
/// a fast field..
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// a fast field..
/// a fast field.

///
/// If the field is multivalued, only the first value is considered.
///
/// Document that do not have this value are still considered.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Document that do not have this value are still considered.
/// Documents that do not have this value are still considered.

///
/// If the field is multivalued, only the first value is considered.
///
/// Document that do not have this value are still considered.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Document that do not have this value are still considered.
/// Documents that do not have this value are still considered.

Comment on lines +27 to +34
// Sorting by score is special in that it allows for the Block-Wand optimization.
fn collect_segment_top_k(
&self,
k: usize,
weight: &dyn crate::query::Weight,
reader: &crate::SegmentReader,
segment_ord: u32,
) -> crate::Result<Vec<(Self::SortKey, DocAddress)>> {
Copy link
Collaborator

@stuhood stuhood Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with not looping the batching into this change.

One thing that I wonder is whether batched, lazy collection could be used to implement block-wand, which might make scores less of a special case! Something to experiment with in the future.


/// Sort by similarity score.
#[derive(Clone, Debug, Copy)]
pub struct SortBySimilarityScore;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: here we spell out "similarity score", whereas with TopDocs::order_by_score and the other method removals/changes, I think that we recognize rightly that a "score" should probably only mean a "similarity score". So I think that you could probably drop Similarity from the name here, which would make this more symmetrical with order_by_score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants