Skip to content

Conversation

@toulzx
Copy link

@toulzx toulzx commented Dec 2, 2025

Description

Since rank_bm25 is no longer actively maintained and relevant fixes have not been merged, this PR replaces it with bm25s.
This pull request updates the BM25Retriever implementation in libs/community/langchain_community/retrievers/bm25.py to use the bm25s package instead of rank_bm25, and refactors the code to work with the new API.
Current updated code switched from rank_bm25.BM25Okapi() to bm25s.BM25(method="atire", idf_method="lucene"), method="atire" is standard Okapi BM25 tf / length normalization (behavior aligned with rank_bm25.BM25Okapi) and idf_method="lucene" uses a Lucene-style IDF that avoids negative values. More different method / idf_method combination setup see variants.
It also improves _get_relevant_documents by dynamically forwarding only the supported keyword arguments to the retrieve method of the new BM25 implementation, enabling advanced options of bm25s.BM25.retrieve() such as backend_selection and n_threads.

Relevant Issue

no yet

Dependencies Change

wont, not core dependency set

Backwards Compatible

Not fully backwards compatible, but the impact should be limited.
The constructor parameters of rank_bm25.BM25Okapi and bm25s.BM25 do not overlap. If users were passing parameters intended for rank_bm25.BM25Okapi via bm25_params in BM25Retriever.from_documents() or BM25Retriever.from_texts(), those parameters will now cause errors because they are not valid for bm25s.BM25.

Reference Version

@github-actions github-actions bot added the infra label Dec 2, 2025
@toulzx
Copy link
Author

toulzx commented Dec 2, 2025

[Feature Request] how to exposing BM25 scores?

I would like to expose BM25 scores together with the top‑k documents returned by BM25Retriever. In langchain_core.vectorstores, there are separate methods like similarity_search_with_relevance_scores (v.s. similarity_search), but in langchain_core.retrievers the public API only exposes invoke, which returns a list of Document objects without scores.

To minimize API and interface changes on the retriever side, the approach I’m considering is:

  • After calling self.vectorizer.retrieve(...) inside _get_relevant_documents,
  • Take the scores returned by bm25s, and
  • Attach them to each Document via a metadata field such as metadata["bm25_score"].

This keeps the public BM25Retriever interface unchanged (still returning List[Document] from invoke), while allowing downstream code to access the scores if needed via doc.metadata["bm25_score"].

I’d appreciate feedback on whether this metadata-based approach is acceptable, or if there is a preferred pattern for exposing per-document scores from retrievers in the current API.

@toulzx toulzx changed the title chore(retrievers): replace package rank_bm25 withbm25s chore(retrievers): replace package rank_bm25 with bm25s Dec 3, 2025
@github-actions github-actions bot added infra and removed infra labels Dec 3, 2025
@toulzx toulzx marked this pull request as draft December 3, 2025 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant