Skip to content

Conversation

@edwinyyyu
Copy link
Contributor

@edwinyyyu edwinyyyu commented Dec 9, 2025

Purpose of the change

Embedders and rerankers have a context length limit, often pretty short. Embedding time may increase with input length. Malicious inputs may be long. Allow configuration to specify a length limit.

Description

Maximum input lengths are specified as Unicode code points because that is what is most native in Python, and because for cases where it is necessary, it can be good enough to compute a hard limit to set:

  • Unicode code points have a maximum number of 4 bytes regardless of which encoding.
  • Tokens should be a minimum of 1 byte.
  • Thus, for a model with a hard context limit of 8192 tokens, the input length limit may be set to 2048 Unicode code points.

For embedders:
Chunk into balanced partitions when the text is too long. Average the embeddings of the partitions.

For cross-encoder rerankers:
Chunk text into maximal partitions when the text is too long. Take the maximum score of the partitions.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update
  • Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
  • Security (improves security without changing functionality)

How Has This Been Tested?

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)

Checklist

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

@edwinyyyu edwinyyyu changed the title WIP: Implement embedder input length limits WIP: Implement embedder and reranker input length limits Dec 10, 2025
@edwinyyyu edwinyyyu force-pushed the length_limits branch 6 times, most recently from 68fb4d9 to 5fbc372 Compare December 12, 2025 17:40
Signed-off-by: Edwin Yu <[email protected]>
@edwinyyyu edwinyyyu marked this pull request as ready for review December 12, 2025 17:56
@sscargal sscargal requested a review from Copilot December 12, 2025 20:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements input length limits for embedders and rerankers to protect against excessively long inputs that could exceed model context limits or introduce security risks. When inputs exceed the configured maximum length (specified in Unicode code points), embedders partition text into balanced chunks and average the resulting embeddings, while cross-encoder rerankers use maximal chunks and return the maximum score across partitions.

Key changes include:

  • Added max_input_length parameter to embedder and reranker configurations
  • Implemented text chunking utilities (chunk_text, chunk_text_balanced, unflatten_like)
  • Modified embedders to handle long inputs by chunking and averaging embeddings

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/memmachine/conftest.py Added Cohere test fixtures and max_input_length to OpenAI embedder
tests/memmachine/common/test_utils.py New test file for text chunking utilities
tests/memmachine/common/resource_manager/test_embedder_manager.py Fixed class name typo from Config to Conf
tests/memmachine/common/reranker/test_cross_encoder_reranker.py Converted to integration tests with real model and added large input tests
tests/memmachine/common/reranker/test_cohere_reranker.py New integration tests for Cohere reranker with large input tests
tests/memmachine/common/embedder/test_sentence_transformer_embedder.py New integration tests with large input test cases
tests/memmachine/common/embedder/test_openai_embedder.py New integration tests with large input test cases
tests/memmachine/common/embedder/test_amazon_bedrock_embedder.py New integration tests with large input test cases
src/memmachine/common/utils.py Implemented chunking and unflattening utilities
src/memmachine/common/resource_manager/reranker_manager.py Added max_input_length parameter to cross-encoder initialization
src/memmachine/common/resource_manager/embedder_manager.py Added max_input_length parameter to all embedder initializations
src/memmachine/common/reranker/cross_encoder_reranker.py Implemented chunking and max-score logic for long inputs
src/memmachine/common/embedder/sentence_transformer_embedder.py Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/openai_embedder.py Implemented chunking and averaging for long inputs
src/memmachine/common/embedder/amazon_bedrock_embedder.py Implemented chunking and averaging for long inputs
src/memmachine/common/configuration/reranker_conf.py Added max_input_length field to CrossEncoderRerankerConf
src/memmachine/common/configuration/embedder_conf.py Added max_input_length field and renamed Config to Conf

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)

return response.astype(float).tolist()
chunk_embeddings = response.astype(float).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is common logic, we may create a function that do chunk and call the vendor specific embed functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants