-
Notifications
You must be signed in to change notification settings - Fork 74
feat(transformer): Add MRL-E preprocess Transformer #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds an MRLE transformer type and dimension parameter into the transform quantization pipeline, wiring it through configuration, type enums, and tests, while enforcing that MRLE appears first in any transformer chain and implementing a basic MRLE transformer class. Sequence diagram for applying MRLE transformer in cosine metricsequenceDiagram
participant TQ as TransformQuantizer
participant MRLE as MRLETransformer_cosine
participant Meta as MRLETMeta
TQ->>MRLE: Transform(original_vec, transformed_vec)
activate MRLE
MRLE->>MRLE: memcpy(transformed_vec, original_vec, output_dim_ * sizeof(float))
MRLE->>MRLE: Normalize(transformed_vec, transformed_vec, output_dim_)
MRLE->>Meta: create MRLETMeta
activate Meta
MRLE-->>TQ: return MRLETMeta
deactivate Meta
deactivate MRLE
TQ-->>TQ: use transformed_vec in quantization pipeline
Class diagram for new MRLE transformer integrationclassDiagram
class VectorTransformer {
<<abstract>>
+Allocator* allocator_
+int64_t input_dim_
+int64_t output_dim_
+VectorTransformerType type_
+Transform(original_vec float*, transformed_vec float*) TransformerMetaPtr
+InverseTransform(transformed_vec float*, original_vec float*) void
+Serialize(writer StreamWriter&) void
+Deserialize(reader StreamReader&) void
+Train(data float*, count uint64_t) void
}
class VectorTransformerType {
<<enum>>
NONE
PCA
RANDOM_ORTHOGONAL
FHT
RESIDUAL
NORMALIZE
MRLE
}
class TransformerMeta {
<<interface>>
}
class MRLETMeta {
}
class MRLETransformer {
<<template<MetricType metric>>
+MRLETransformer(allocator Allocator*, input_dim int64_t, output_dim int64_t)
+~MRLETransformer()
+Transform(original_vec float*, transformed_vec float*) TransformerMetaPtr
+InverseTransform(transformed_vec float*, original_vec float*) void
+Serialize(writer StreamWriter&) void
+Deserialize(reader StreamReader&) void
+Train(data float*, count uint64_t) void
}
class VectorTransformerParameter {
+uint32_t input_dim_ = 0
+uint32_t pca_dim_ = 0
+uint32_t mrle_dim_ = 0
+FromJson(json JsonType) static VectorTransformerParameter
+ToJson() JsonType
+CheckCompatibility(other ParamPtr) bool
}
class TransformQuantizer {
<<template<QuantTmpl, MetricType metric>>
-VectorTransformerParameter transformer_param_
-std::vector<VectorTransformerPtr> transform_chain_
+TransformQuantizer(other const TransformQuantizer&)
+MakeTransformerInstance(transform_str std::string, param const VectorTransformerParameter&) VectorTransformerPtr
}
class InnerStringParams {
<<constants>>
+TRANSFORMER_TYPE_VALUE_PCA : const char* const
+TRANSFORMER_TYPE_VALUE_ROM : const char* const
+TRANSFORMER_TYPE_VALUE_FHT : const char* const
+TRANSFORMER_TYPE_VALUE_MRLE : const char* const
+TRANSFORMER_TYPE_VALUE_RESIDUAL : const char* const
+TRANSFORMER_TYPE_VALUE_NORMALIZE : const char* const
+INPUT_DIM_KEY : const char* const
+PCA_DIM_KEY : const char* const
+MRLE_DIM_KEY : const char* const
}
TransformerMeta <|-- MRLETMeta
VectorTransformer <|-- MRLETransformer
TransformQuantizer ..> VectorTransformer : uses
TransformQuantizer ..> VectorTransformerParameter : config
MRLETransformer ..> MRLETMeta : creates
VectorTransformerType <.. MRLETransformer : sets_type
InnerStringParams <.. TransformQuantizer : transformer_name_strings
InnerStringParams <.. VectorTransformerParameter : json_keys
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @LHT129, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the vector transformation capabilities by introducing a new MRL-E preprocess transformer. This addition allows for specific vector handling, including direct copying and conditional normalization, within the existing transformation pipeline. The changes involve extending the core transformer framework to recognize and configure the MRL-E type, alongside implementing a crucial validation to maintain the integrity of transformation chains by enforcing MRL-E's position as the initial step when utilized. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
- In mrle_transformer.h, the MRLETransformer template has an extra closing brace after the class definition (
} };), which will break compilation and should be reduced to a single};. - Consider using a more specific error type/message for MRLETransformer::InverseTransform (e.g., NOT_IMPLEMENTED) instead of INTERNAL_ERROR to make error handling and debugging clearer when inverse transforms are intentionally unsupported.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In mrle_transformer.h, the MRLETransformer template has an extra closing brace after the class definition (`} };`), which will break compilation and should be reduced to a single `};`.
- Consider using a more specific error type/message for MRLETransformer::InverseTransform (e.g., NOT_IMPLEMENTED) instead of INTERNAL_ERROR to make error handling and debugging clearer when inverse transforms are intentionally unsupported.
## Individual Comments
### Comment 1
<location> `src/impl/transform/mrle_transformer.h:46-32` </location>
<code_context>
+ return meta;
+ }
+
+ void
+ InverseTransform(const float* transformed_vec, float* original_vec) const override {
+ throw VsagException(ErrorType::INTERNAL_ERROR, "InverseTransform not implement");
+ }
+
+ void
</code_context>
<issue_to_address>
**issue (bug_risk):** There appears to be an extra closing brace at the end of MRLETransformer, which will cause a compile error.
The class should be closed with a single `};`. Please remove the standalone `}` that appears immediately before the final `};` so the class definition compiles correctly.
</issue_to_address>
### Comment 2
<location> `src/impl/transform/mrle_transformer.h:24` </location>
<code_context>
+#include "vsag_exception.h"
+
+namespace vsag {
+struct MRLETMeta : public TransformerMeta {};
+
+template <MetricType metric = MetricType::METRIC_TYPE_L2SQR>
</code_context>
<issue_to_address>
**suggestion (typo):** The MRLETMeta type name looks inconsistent with the MRLE naming used elsewhere.
If this isn’t intentional, please rename `MRLETMeta` to `MRLEMeta` (to match `MRLETransformer`, `VectorTransformerType::MRLE`, etc.) so the MRLE-related types stay consistent.
Suggested implementation:
```c
namespace vsag {
struct MRLEMeta : public TransformerMeta {};
```
Renaming the type here will require updating all other usages of `MRLETMeta` in the codebase to `MRLEMeta`. Please:
1. Search for `MRLETMeta` in the project and replace each occurrence with `MRLEMeta`.
2. If there are any factory/registration mechanisms or type traits that reference `MRLETMeta`, update those references as well to keep the MRLE-related types consistent.
</issue_to_address>
### Comment 3
<location> `src/impl/transform/mrle_transformer.h:38-39` </location>
<code_context>
+
+ TransformerMetaPtr
+ Transform(const float* original_vec, float* transformed_vec) const override {
+ auto meta = std::make_shared<MRLETMeta>();
+ memcpy(transformed_vec, original_vec, this->output_dim_ * sizeof(float));
+ if constexpr (metric == MetricType::METRIC_TYPE_COSINE) {
+ Normalize(transformed_vec, transformed_vec, this->output_dim_);
</code_context>
<issue_to_address>
**issue (bug_risk):** Using output_dim_ as the memcpy length can cause out-of-bounds reads if output_dim_ > input_dim_.
Here `memcpy` uses `this->output_dim_` elements, but `original_vec` is only guaranteed to be `input_dim_` long. If `output_dim_ > input_dim_` (as might be allowed for MRLE), this will read past `original_vec`. Either guarantee `output_dim_ <= input_dim_` (e.g., via an invariant/assert) or bound the copy with `std::min(this->input_dim_, this->output_dim_)`, depending on the intended contract.
</issue_to_address>
### Comment 4
<location> `src/impl/transform/mrle_transformer.h:48` </location>
<code_context>
+
+ void
+ InverseTransform(const float* transformed_vec, float* original_vec) const override {
+ throw VsagException(ErrorType::INTERNAL_ERROR, "InverseTransform not implement");
+ }
+
</code_context>
<issue_to_address>
**suggestion:** The exception for InverseTransform is a bit misleading and the message is unclear.
Using `ErrorType::INTERNAL_ERROR` for an unimplemented method blurs the line between real internal failures and unsupported functionality. If available, prefer a more specific error code for "not implemented" or "unsupported", and update the message to something clearer like "InverseTransform not implemented".
Suggested implementation:
```c
void
InverseTransform(const float* transformed_vec, float* original_vec) const override {
// Currently, inverse transformation is not supported for MRLET.
// Use a specific error type to distinguish unsupported functionality
// from real internal errors.
throw VsagException(ErrorType::NOT_IMPLEMENTED, "InverseTransform not implemented");
}
// Copyright 2024-present the vsag project
```
- If `ErrorType::NOT_IMPLEMENTED` does not exist in your codebase, replace it with the most appropriate existing value (for example `UNIMPLEMENTED`, `UNSUPPORTED`, or similar) in both the enum definition and this throw site.
- Ensure `VsagException` and `ErrorType` are included/visible in this header (they likely already are; if not, add the appropriate include that is used elsewhere for throwing `VsagException`).
</issue_to_address>
### Comment 5
<location> `src/quantization/transform_quantization/transform_quantizer_test.cpp:97` </location>
<code_context>
TEST_CASE("TQ Compute", "[ut][TransformQuantizer]") {
constexpr MetricType metrics[1] = {MetricType::METRIC_TYPE_L2SQR};
- std::string tq_chain = GENERATE("rom, pca, fp32", "rom, fp32", "fht, fp32");
+ std::string tq_chain = GENERATE("rom, pca, fp32", "rom, fp32", "fht, fp32", "mrle, fp32");
for (auto dim : dims) {
</code_context>
<issue_to_address>
**issue (testing):** Add a negative test to verify that using MRLE not as the first transformer in the chain correctly throws an exception.
Since the code requires MRLE to be first in the chain (`MRLE must be first if exists`), please also add a negative test that builds a chain like `"rom, mrle, fp32"` (and optionally `"pca, mrle, fp32"`) and asserts that constructing `TransformQuantizer` throws the expected `VsagException` with the correct error type/message. This will ensure the MRLE-position constraint is properly enforced and prevent regressions.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| #include "vsag_exception.h" | ||
|
|
||
| namespace vsag { | ||
| struct MRLETMeta : public TransformerMeta {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (typo): The MRLETMeta type name looks inconsistent with the MRLE naming used elsewhere.
If this isn’t intentional, please rename MRLETMeta to MRLEMeta (to match MRLETransformer, VectorTransformerType::MRLE, etc.) so the MRLE-related types stay consistent.
Suggested implementation:
namespace vsag {
struct MRLEMeta : public TransformerMeta {};Renaming the type here will require updating all other usages of MRLETMeta in the codebase to MRLEMeta. Please:
- Search for
MRLETMetain the project and replace each occurrence withMRLEMeta. - If there are any factory/registration mechanisms or type traits that reference
MRLETMeta, update those references as well to keep the MRLE-related types consistent.
|
|
||
| void | ||
| InverseTransform(const float* transformed_vec, float* original_vec) const override { | ||
| throw VsagException(ErrorType::INTERNAL_ERROR, "InverseTransform not implement"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: The exception for InverseTransform is a bit misleading and the message is unclear.
Using ErrorType::INTERNAL_ERROR for an unimplemented method blurs the line between real internal failures and unsupported functionality. If available, prefer a more specific error code for "not implemented" or "unsupported", and update the message to something clearer like "InverseTransform not implemented".
Suggested implementation:
void
InverseTransform(const float* transformed_vec, float* original_vec) const override {
// Currently, inverse transformation is not supported for MRLET.
// Use a specific error type to distinguish unsupported functionality
// from real internal errors.
throw VsagException(ErrorType::NOT_IMPLEMENTED, "InverseTransform not implemented");
}
// Copyright 2024-present the vsag project- If
ErrorType::NOT_IMPLEMENTEDdoes not exist in your codebase, replace it with the most appropriate existing value (for exampleUNIMPLEMENTED,UNSUPPORTED, or similar) in both the enum definition and this throw site. - Ensure
VsagExceptionandErrorTypeare included/visible in this header (they likely already are; if not, add the appropriate include that is used elsewhere for throwingVsagException).
| TEST_CASE("TQ Compute", "[ut][TransformQuantizer]") { | ||
| constexpr MetricType metrics[1] = {MetricType::METRIC_TYPE_L2SQR}; | ||
| std::string tq_chain = GENERATE("rom, pca, fp32", "rom, fp32", "fht, fp32"); | ||
| std::string tq_chain = GENERATE("rom, pca, fp32", "rom, fp32", "fht, fp32", "mrle, fp32"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (testing): Add a negative test to verify that using MRLE not as the first transformer in the chain correctly throws an exception.
Since the code requires MRLE to be first in the chain (MRLE must be first if exists), please also add a negative test that builds a chain like "rom, mrle, fp32" (and optionally "pca, mrle, fp32") and asserts that constructing TransformQuantizer throws the expected VsagException with the correct error type/message. This will ensure the MRLE-position constraint is properly enforced and prevent regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new MRLE vector transformer. The implementation is a good start, but I've found a few issues. There is a critical bug in the MRLETransformer that could lead to a buffer over-read, for which I've suggested a fix. I also found a minor typo in an error message. Additionally, some changes in the test files appear to unintentionally alter the behavior of existing tests for PCA; I've provided suggestions to address this. Please review my comments below.
| TransformerMetaPtr | ||
| Transform(const float* original_vec, float* transformed_vec) const override { | ||
| auto meta = std::make_shared<MRLETMeta>(); | ||
| memcpy(transformed_vec, original_vec, this->output_dim_ * sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memcpy can read beyond the bounds of original_vec if output_dim_ is greater than input_dim_. This is a critical bug that can cause crashes or unpredictable behavior. The copy size should be limited by input_dim_. Additionally, if output_dim_ is greater than input_dim_, the remainder of transformed_vec will contain uninitialized data, which is then passed to Normalize, leading to incorrect results. The rest of the vector should be zero-padded to ensure correctness.
const auto copy_elements = std::min(static_cast<size_t>(this->input_dim_), static_cast<size_t>(this->output_dim_));
memcpy(transformed_vec, original_vec, copy_elements * sizeof(float));
if (this->output_dim_ > this->input_dim_) {
memset(transformed_vec + copy_elements, 0, (this->output_dim_ - copy_elements) * sizeof(float));
}|
|
||
| void | ||
| InverseTransform(const float* transformed_vec, float* original_vec) const override { | ||
| throw VsagException(ErrorType::INTERNAL_ERROR, "InverseTransform not implement"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }} | ||
| )"; | ||
| auto param_str = fmt::format(param_template, tq_chain, dim - 1); | ||
| auto param_str = fmt::format(param_template, tq_chain, dim, dim - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change unintentionally alters the behavior of existing PCA tests. pca_dim was previously dim - 1, performing dimension reduction. It is now set to dim, which means PCA no longer reduces dimensionality in this test. To preserve the original test behavior for PCA while accommodating the new mrle_dim parameter, consider providing distinct values that maintain the spirit of the original test.
| auto param_str = fmt::format(param_template, tq_chain, dim, dim - 1); | |
| auto param_str = fmt::format(param_template, tq_chain, dim - 1, dim - 1); |
| }} | ||
| )"; | ||
| auto param_str = fmt::format(param_template, tq_chain, dim - 2); | ||
| auto param_str = fmt::format(param_template, tq_chain, dim, dim - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment, this change alters the behavior of existing PCA tests in TestSerializeDeserializeTQ. pca_dim was dim - 2 and is now dim, removing the dimensionality reduction aspect from this test. To maintain the original test's intent, you should restore the previous value for pca_dim.
| auto param_str = fmt::format(param_template, tq_chain, dim, dim - 1); | |
| auto param_str = fmt::format(param_template, tq_chain, dim - 2, dim - 1); |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (71.42%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## main #1371 +/- ##
==========================================
- Coverage 90.95% 90.95% -0.01%
==========================================
Files 322 323 +1
Lines 18363 18393 +30
==========================================
+ Hits 16702 16729 +27
- Misses 1661 1664 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Signed-off-by: LHT129 <[email protected]>
wxyucs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
inabao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
inabao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
closed: #1370
Summary by Sourcery
Add support for an MRLE-based vector transformer in the transform quantization pipeline and enforce ordering constraints when it is used in transformer chains.
New Features:
Enhancements:
Tests: