chore(retrievers): replace package `rank_bm25` with `bm25s` #456

toulzx · 2025-12-02T09:55:37Z

Description

Since rank_bm25 is no longer actively maintained and relevant fixes have not been merged, this PR replaces it with bm25s.
This pull request updates the BM25Retriever implementation in libs/community/langchain_community/retrievers/bm25.py to use the bm25s package instead of rank_bm25, and refactors the code to work with the new API.
Current updated code switched from rank_bm25.BM25Okapi() to bm25s.BM25(method="atire", idf_method="lucene"), method="atire" is standard Okapi BM25 tf / length normalization (behavior aligned with rank_bm25.BM25Okapi) and idf_method="lucene" uses a Lucene-style IDF that avoids negative values. More different method / idf_method combination setup see variants.
It also improves _get_relevant_documents by dynamically forwarding only the supported keyword arguments to the retrieve method of the new BM25 implementation, enabling advanced options of bm25s.BM25.retrieve() such as backend_selection and n_threads.

Relevant Issue

no yet

Dependencies Change

wont, not core dependency set

Backwards Compatible

Not fully backwards compatible, but the impact should be limited.
The constructor parameters of rank_bm25.BM25Okapi and bm25s.BM25 do not overlap. If users were passing parameters intended for rank_bm25.BM25Okapi via bm25_params in BM25Retriever.from_documents() or BM25Retriever.from_texts(), those parameters will now cause errors because they are not valid for bm25s.BM25.

Reference Version

langchain-community v0.4.1
bm25s v0.2.14
rank_bm25 v0.2.2

toulzx · 2025-12-02T11:41:00Z

[Feature Request] how to exposing BM25 scores?

I would like to expose BM25 scores together with the top‑k documents returned by BM25Retriever. In langchain_core.vectorstores, there are separate methods like similarity_search_with_relevance_scores (v.s. similarity_search), but in langchain_core.retrievers the public API only exposes invoke, which returns a list of Document objects without scores.

To minimize API and interface changes on the retriever side, the approach I’m considering is:

After calling self.vectorizer.retrieve(...) inside _get_relevant_documents,
Take the scores returned by bm25s, and
Attach them to each Document via a metadata field such as metadata["bm25_score"].

This keeps the public BM25Retriever interface unchanged (still returning List[Document] from invoke), while allowing downstream code to access the scores if needed via doc.metadata["bm25_score"].

I’d appreciate feedback on whether this metadata-based approach is acceptable, or if there is a preferred pattern for exposing per-document scores from retrievers in the current API.

…kapi of rank_bm25 --- https://github.com/xhluca/bm25s?tab=readme-ov-file#variants https://github.com/jxmorris12/bm25_pt

toulzx added 2 commits December 2, 2025 08:25

chore: replace package rank_bm25 withbm25s

9cd9569

refactor

1e41b1b

github-actions bot added the infra label Dec 2, 2025

fix: specific default BM25 variant method to be consistent with BM25O…

1672490

…kapi of rank_bm25 --- https://github.com/xhluca/bm25s?tab=readme-ov-file#variants https://github.com/jxmorris12/bm25_pt

toulzx changed the title ~~chore(retrievers): replace package rank_bm25 withbm25s~~ chore(retrievers): replace package rank_bm25 with bm25s Dec 3, 2025

github-actions bot added infra and removed infra labels Dec 3, 2025

toulzx marked this pull request as draft December 3, 2025 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(retrievers): replace package `rank_bm25` with `bm25s` #456

chore(retrievers): replace package `rank_bm25` with `bm25s` #456

toulzx commented Dec 2, 2025 •

edited

Loading

Uh oh!

toulzx commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chore(retrievers): replace package rank_bm25 with bm25s #456

Are you sure you want to change the base?

chore(retrievers): replace package rank_bm25 with bm25s #456

Conversation

toulzx commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Relevant Issue

Dependencies Change

Backwards Compatible

Reference Version

Uh oh!

toulzx commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chore(retrievers): replace package `rank_bm25` with `bm25s` #456

chore(retrievers): replace package `rank_bm25` with `bm25s` #456

toulzx commented Dec 2, 2025 •

edited

Loading