-
Notifications
You must be signed in to change notification settings - Fork 103
WIP: Implement embedder and reranker input length limits #744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6476b2c to
5822ba4
Compare
eec97d4 to
b73a668
Compare
68fb4d9 to
5fbc372
Compare
Signed-off-by: Edwin Yu <[email protected]>
Signed-off-by: Edwin Yu <[email protected]>
5fbc372 to
20b8e5c
Compare
Signed-off-by: Edwin Yu <[email protected]>
20b8e5c to
eb4a5cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements input length limits for embedders and rerankers to protect against excessively long inputs that could exceed model context limits or introduce security risks. When inputs exceed the configured maximum length (specified in Unicode code points), embedders partition text into balanced chunks and average the resulting embeddings, while cross-encoder rerankers use maximal chunks and return the maximum score across partitions.
Key changes include:
- Added
max_input_lengthparameter to embedder and reranker configurations - Implemented text chunking utilities (
chunk_text,chunk_text_balanced,unflatten_like) - Modified embedders to handle long inputs by chunking and averaging embeddings
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/memmachine/conftest.py | Added Cohere test fixtures and max_input_length to OpenAI embedder |
| tests/memmachine/common/test_utils.py | New test file for text chunking utilities |
| tests/memmachine/common/resource_manager/test_embedder_manager.py | Fixed class name typo from Config to Conf |
| tests/memmachine/common/reranker/test_cross_encoder_reranker.py | Converted to integration tests with real model and added large input tests |
| tests/memmachine/common/reranker/test_cohere_reranker.py | New integration tests for Cohere reranker with large input tests |
| tests/memmachine/common/embedder/test_sentence_transformer_embedder.py | New integration tests with large input test cases |
| tests/memmachine/common/embedder/test_openai_embedder.py | New integration tests with large input test cases |
| tests/memmachine/common/embedder/test_amazon_bedrock_embedder.py | New integration tests with large input test cases |
| src/memmachine/common/utils.py | Implemented chunking and unflattening utilities |
| src/memmachine/common/resource_manager/reranker_manager.py | Added max_input_length parameter to cross-encoder initialization |
| src/memmachine/common/resource_manager/embedder_manager.py | Added max_input_length parameter to all embedder initializations |
| src/memmachine/common/reranker/cross_encoder_reranker.py | Implemented chunking and max-score logic for long inputs |
| src/memmachine/common/embedder/sentence_transformer_embedder.py | Implemented chunking and averaging for long inputs |
| src/memmachine/common/embedder/openai_embedder.py | Implemented chunking and averaging for long inputs |
| src/memmachine/common/embedder/amazon_bedrock_embedder.py | Implemented chunking and averaging for long inputs |
| src/memmachine/common/configuration/reranker_conf.py | Added max_input_length field to CrossEncoderRerankerConf |
| src/memmachine/common/configuration/embedder_conf.py | Added max_input_length field and renamed Config to Conf |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
|
|
||
| return response.astype(float).tolist() | ||
| chunk_embeddings = response.astype(float).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is common logic, we may create a function that do chunk and call the vendor specific embed functions.
Purpose of the change
Embedders and rerankers have a context length limit, often pretty short. Embedding time may increase with input length. Malicious inputs may be long. Allow configuration to specify a length limit.
Description
Maximum input lengths are specified as Unicode code points because that is what is most native in Python, and because for cases where it is necessary, it can be good enough to compute a hard limit to set:
For embedders:
Chunk into balanced partitions when the text is too long. Average the embeddings of the partitions.
For cross-encoder rerankers:
Chunk text into maximal partitions when the text is too long. Take the maximum score of the partitions.
Type of change
How Has This Been Tested?
Checklist
Maintainer Checklist