WIP: Implement embedder and reranker input length limits #744

edwinyyyu · 2025-12-09T23:44:06Z

Purpose of the change

Embedders and rerankers have a context length limit, often pretty short. Embedding time may increase with input length. Malicious inputs may be long. Allow configuration to specify a length limit.

Description

Maximum input lengths are specified as Unicode code points because that is what is most native in Python, and because for cases where it is necessary, it can be good enough to compute a hard limit to set:

Unicode code points have a maximum number of 4 bytes regardless of which encoding.
Tokens should be a minimum of 1 byte.
Thus, for a model with a hard context limit of 8192 tokens, the input length limit may be set to 2048 Unicode code points.

For embedders:
Chunk into balanced partitions when the text is too long. Average the embeddings of the partitions.

For cross-encoder rerankers:
Chunk text into maximal partitions when the text is too long. Take the maximum score of the partitions.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (does not change functionality, e.g., code style improvements, linting)
Documentation update
Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
Security (improves security without changing functionality)

How Has This Been Tested?

Checklist

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

Signed-off-by: Edwin Yu <[email protected]>

Copilot

Pull request overview

This PR implements input length limits for embedders and rerankers to protect against excessively long inputs that could exceed model context limits or introduce security risks. When inputs exceed the configured maximum length (specified in Unicode code points), embedders partition text into balanced chunks and average the resulting embeddings, while cross-encoder rerankers use maximal chunks and return the maximum score across partitions.

Key changes include:

Added max_input_length parameter to embedder and reranker configurations
Implemented text chunking utilities (chunk_text, chunk_text_balanced, unflatten_like)
Modified embedders to handle long inputs by chunking and averaging embeddings

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/memmachine/conftest.py	Added Cohere test fixtures and max_input_length to OpenAI embedder
tests/memmachine/common/test_utils.py	New test file for text chunking utilities
tests/memmachine/common/resource_manager/test_embedder_manager.py	Fixed class name typo from Config to Conf
tests/memmachine/common/reranker/test_cross_encoder_reranker.py	Converted to integration tests with real model and added large input tests
tests/memmachine/common/reranker/test_cohere_reranker.py	New integration tests for Cohere reranker with large input tests
tests/memmachine/common/embedder/test_sentence_transformer_embedder.py	New integration tests with large input test cases
tests/memmachine/common/embedder/test_openai_embedder.py	New integration tests with large input test cases
tests/memmachine/common/embedder/test_amazon_bedrock_embedder.py	New integration tests with large input test cases
src/memmachine/common/utils.py	Implemented chunking and unflattening utilities
src/memmachine/common/resource_manager/reranker_manager.py	Added max_input_length parameter to cross-encoder initialization
src/memmachine/common/resource_manager/embedder_manager.py	Added max_input_length parameter to all embedder initializations
src/memmachine/common/reranker/cross_encoder_reranker.py	Implemented chunking and max-score logic for long inputs
src/memmachine/common/embedder/sentence_transformer_embedder.py	Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/openai_embedder.py	Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/amazon_bedrock_embedder.py	Implemented chunking and averaging for long inputs
src/memmachine/common/configuration/reranker_conf.py	Added max_input_length field to CrossEncoderRerankerConf
src/memmachine/common/configuration/embedder_conf.py	Added max_input_length field and renamed Config to Conf

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/memmachine/common/embedder/sentence_transformer_embedder.py

src/memmachine/common/reranker/cross_encoder_reranker.py

src/memmachine/common/embedder/openai_embedder.py

src/memmachine/common/embedder/sentence_transformer_embedder.py

src/memmachine/common/embedder/amazon_bedrock_embedder.py

src/memmachine/common/embedder/openai_embedder.py

malatewang · 2025-12-13T05:48:52Z

src/memmachine/common/embedder/sentence_transformer_embedder.py

        )

-        return response.astype(float).tolist()
+        chunk_embeddings = response.astype(float).tolist()


Since this is common logic, we may create a function that do chunk and call the vendor specific embed functions.

edwinyyyu force-pushed the length_limits branch from 6476b2c to 5822ba4 Compare December 9, 2025 23:46

edwinyyyu changed the title ~~WIP: Implement embedder input length limits~~ WIP: Implement embedder and reranker input length limits Dec 10, 2025

edwinyyyu force-pushed the length_limits branch from eec97d4 to b73a668 Compare December 10, 2025 00:08

edwinyyyu requested review from jealous, malatewang, o-love and vinares December 12, 2025 01:12

edwinyyyu force-pushed the length_limits branch 6 times, most recently from 68fb4d9 to 5fbc372 Compare December 12, 2025 17:40

edwinyyyu added 2 commits December 12, 2025 09:42

Implement embedder input length limits

c2eabc1

Signed-off-by: Edwin Yu <[email protected]>

Implement cross-encoder reranker input length limit

22489ed

Signed-off-by: Edwin Yu <[email protected]>

edwinyyyu force-pushed the length_limits branch from 5fbc372 to 20b8e5c Compare December 12, 2025 17:42

Fixes and tests

eb4a5cc

Signed-off-by: Edwin Yu <[email protected]>

edwinyyyu force-pushed the length_limits branch from 20b8e5c to eb4a5cc Compare December 12, 2025 17:56

edwinyyyu marked this pull request as ready for review December 12, 2025 17:56

sscargal requested a review from Copilot December 12, 2025 20:59

Copilot AI reviewed Dec 12, 2025

View reviewed changes

malatewang reviewed Dec 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Implement embedder and reranker input length limits #744

WIP: Implement embedder and reranker input length limits #744

Uh oh!

edwinyyyu commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malatewang Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: Implement embedder and reranker input length limits #744

Are you sure you want to change the base?

WIP: Implement embedder and reranker input length limits #744

Uh oh!

Conversation

edwinyyyu commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the change

Description

Type of change

How Has This Been Tested?

Checklist

Maintainer Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malatewang Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edwinyyyu commented Dec 9, 2025 •

edited

Loading