-
Notifications
You must be signed in to change notification settings - Fork 74
feat(hgraph): support pqr in HGraph #1340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR extends HGraph to support a new PQR-based reordering strategy alongside the existing flatten reorder. It introduces a Reorder interface and parameter framework, implements PqrReorder (with residual computation, bias management, training, query reordering, serialization), integrates configurable reorder instantiation and lifecycle hooks into HGraph, enhances FlattenDataCell for residual calculation, updates JSON schema/parameter mapping, and augments tests to cover PQR scenarios. Sequence diagram for HGraph instantiation with configurable reorder strategysequenceDiagram
participant HGraph
participant HGraphParameter
participant ReorderParameter
participant FlattenReorder
participant PqrReorder
HGraph->>HGraphParameter: access reorder_param
HGraphParameter->>ReorderParameter: create from JSON
alt reorder_param.name_ == PQR_REORDER
HGraph->>PqrReorder: instantiate with basic_flatten_codes, common_param, reorder_param
else
HGraph->>FlattenReorder: instantiate with high_precise_codes, allocator
end
HGraph->>ReorderInterface: assign to reorder_
Sequence diagram for HGraph lifecycle hooks with Reorder integrationsequenceDiagram
participant HGraph
participant ReorderInterface
participant Dataset
participant StreamWriter
participant StreamReader
participant DistHeap
Note over HGraph: During Train
HGraph->>ReorderInterface: Train(base_data, num_elements)
Note over HGraph: During add_one_point
HGraph->>ReorderInterface: InsertVector(data, inner_id)
Note over HGraph: During resize
HGraph->>ReorderInterface: Resize(new_size)
Note over HGraph: During Serialize
HGraph->>ReorderInterface: Serialize(writer)
Note over HGraph: During Deserialize
HGraph->>ReorderInterface: Deserialize(reader)
Note over HGraph: During KnnSearch/RangeSearch/SearchWithRequest
HGraph->>ReorderInterface: Reorder(query, candidate_heap, k)
Class diagram for new and updated Reorder classesclassDiagram
class ReorderInterface {
<<interface>>
+Reorder(input, query, topk, allocator) const
+InsertVector(vector, id)
+Train(vector, count)
+Resize(new_size)
+Serialize(writer) const
+Deserialize(reader)
-size_
}
class FlattenReorder {
+Reorder(input, query, topk, allocator) const
-flatten_
-allocator_
}
class PqrReorder {
+Reorder(input, query, topk, allocator) const
+InsertVector(vector, id)
+Train(vector, count)
+Resize(new_size)
+Serialize(writer) const
+Deserialize(reader)
-flatten_
-reorder_code_
-allocator_
-bias_
-dim_
-metric_
}
ReorderInterface <|.. FlattenReorder
ReorderInterface <|.. PqrReorder
class ReorderParameter {
+name_
}
class FlattenReorderParameter {
+FromJson(json)
+ToJson() const
}
class PqrReorderParameter {
+residual_param_
+FromJson(json)
+ToJson() const
+CheckCompatibility(other) const
}
ReorderParameter <|.. FlattenReorderParameter
ReorderParameter <|.. PqrReorderParameter
class HGraph {
-reorder_
+reorder(query, candidate_heap, k) const
}
HGraph --> ReorderInterface : uses
Class diagram for FlattenDataCell residual calculation extensionclassDiagram
class FlattenInterface {
+CalResidual(vector, residual, count)
}
class FlattenDataCell {
+CalResidual(vector, residual, count)
+quantizer_
+io_
}
FlattenInterface <|.. FlattenDataCell
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Summary of ChangesHello @inabao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the HGraph index by introducing support for Product Quantization Reorder (PQR). This new reordering strategy aims to improve the accuracy of vector similarity search results. The changes involve a modular redesign of the reordering component, allowing for different reorder implementations, and integrating PQR into the core HGraph lifecycle, from parameter configuration and training to search operations and persistence. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Guard calls to reorder_->Serialize and reorder_->Deserialize behind use_reorder_ (or a null check) to avoid dereferencing a null reorder_ when reordering is disabled.
- Consider centralizing the mapping of reorder-related JSON keys and default values (in DEFAULT_MAP, CheckAndMappingExternalParam, and parameter templates) to reduce duplication and make future changes easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Guard calls to reorder_->Serialize and reorder_->Deserialize behind use_reorder_ (or a null check) to avoid dereferencing a null reorder_ when reordering is disabled.
- Consider centralizing the mapping of reorder-related JSON keys and default values (in DEFAULT_MAP, CheckAndMappingExternalParam, and parameter templates) to reduce duplication and make future changes easier.
## Individual Comments
### Comment 1
<location> `tests/test_hgraph.cpp:584-585` </location>
<code_context>
+ auto index = TestIndex::TestFactory(test_index->name, param, true);
+ auto dataset = HGraphTestIndex::pool.GetDatasetAndCreate(
+ dim, resource->base_count, metric_type);
+ TestIndex::TestContinueAdd(index, dataset, true);
+ HGraphTestIndex::TestGeneral(index, dataset, search_param, recall);
+ vsag::Options::Instance().set_block_size_limit(origin_size);
+ }
</code_context>
<issue_to_address>
**suggestion (testing):** No direct assertion of PQR reorder-specific outputs or behaviors.
Consider adding assertions that specifically test PQR reorder outputs, such as verifying reordered results and correct computation/storage of bias and residuals.
Suggested implementation:
```cpp
auto index = TestIndex::TestFactory(test_index->name, param, true);
auto dataset = HGraphTestIndex::pool.GetDatasetAndCreate(
dim, resource->base_count, metric_type);
TestIndex::TestContinueAdd(index, dataset, true);
HGraphTestIndex::TestGeneral(index, dataset, search_param, recall);
// PQR reorder-specific assertions
// 1. Check that reordered results exist and are valid
auto reordered_results = index->GetReorderedResults();
ASSERT_FALSE(reordered_results.empty());
// Optionally, check ordering or values
for (size_t i = 1; i < reordered_results.size(); ++i) {
ASSERT_LE(reordered_results[i-1].score, reordered_results[i].score);
}
// 2. Check that bias and residuals are computed and stored
auto bias = index->GetBias();
auto residuals = index->GetResiduals();
ASSERT_FALSE(bias.empty());
ASSERT_FALSE(residuals.empty());
ASSERT_EQ(bias.size(), resource->base_count);
ASSERT_EQ(residuals.size(), resource->base_count);
vsag::Options::Instance().set_block_size_limit(origin_size);
}
```
- The code assumes the existence of `GetReorderedResults()`, `GetBias()`, and `GetResiduals()` methods on the index object. If these do not exist, you will need to implement them in the index class.
- You may want to adjust the assertions to match the actual data structures and expected values in your test context.
- If you use a different assertion framework, replace `ASSERT_FALSE` and `ASSERT_EQ` with the appropriate macros.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for PQR (Product Quantization Residuals) reordering in HGraph, which is a significant feature enhancement. The changes are extensive, touching the HGraph lifecycle from construction and training to search and serialization to properly integrate the new reordering strategy. My review identified a critical bug in the PQR distance calculation logic and a potential null pointer dereference. I have also included several suggestions to improve code safety, clarity, and adherence to modern C++ practices. Addressing these points will improve the robustness and maintainability of the new feature.
| auto pqr_reorder_param = std::dynamic_pointer_cast<PqrReorderParameter>(reorder_param); | ||
| reorder_code_ = | ||
| FlattenInterface::MakeInstance(pqr_reorder_param->residual_param_, inner_common); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of std::dynamic_pointer_cast is not checked. If reorder_param is not of type PqrReorderParameter, pqr_reorder_param will be nullptr, and dereferencing it on the next line will cause a crash. It's crucial to add a null check here to handle potential invalid parameter types gracefully.
auto pqr_reorder_param = std::dynamic_pointer_cast<PqrReorderParameter>(reorder_param);
if (pqr_reorder_param == nullptr) {
throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid reorder parameter type for PqrReorder");
}
reorder_code_ =
FlattenInterface::MakeInstance(pqr_reorder_param->residual_param_, inner_common);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| float final_dist = candidate_result[i].first; | ||
| dists[i] = 1 - dists[i]; | ||
| if (metric_ == MetricType::METRIC_TYPE_L2SQR) { | ||
| final_dist += bias_[ids[i]]; | ||
| dists[i] *= 2; | ||
| } | ||
| reorder_heap->Push(final_dist - dists[i], ids[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating the reordered distance is incorrect and difficult to understand. The line dists[i] = 1 - dists[i]; is applied unconditionally, which leads to incorrect distance calculations for all metric types. For L2, it introduces an error, and for Cosine/IP, it also produces the wrong value. I suggest refactoring this loop for correctness and clarity.
float final_dist = candidate_result[i].first;
if (metric_ == MetricType::METRIC_TYPE_L2SQR) {
final_dist += bias_[ids[i]] - 2 * dists[i];
} else {
final_dist -= dists[i];
}
reorder_heap->Push(final_dist, ids[i]);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reorder_code_ uses a fixed IP as the distance metric, so there should be no issues with this
| #include "storage/stream_writer.h" | ||
| namespace vsag { | ||
|
|
||
| static const int64_t TRAIN_MAX_SIZE = 65535; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| FP32ComputeIP(residual_vector.data(), residual_vector.data(), dim_); | ||
| } | ||
| if (need_release) { | ||
| allocator_->Deallocate((void*)codes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a C-style cast (void*) to cast away const is unsafe and hides the intent. It's better to use const_cast to make it explicit that const is being cast away, which improves code safety and readability.
| allocator_->Deallocate((void*)codes); | |
| allocator_->Deallocate(const_cast<uint8_t*>(codes)); |
src/impl/reorder/pqr_reorder.cpp
Outdated
|
|
||
| void | ||
| PqrReorder::Train(const void* vector, uint64_t count) { | ||
| int64_t train_size = count > TRAIN_MAX_SIZE ? TRAIN_MAX_SIZE : count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::min is clearer than a ternary operator for capping a value. Additionally, since count is uint64_t, train_size should also be an unsigned type to avoid potential signed/unsigned comparison issues.
| int64_t train_size = count > TRAIN_MAX_SIZE ? TRAIN_MAX_SIZE : count; | |
| const uint64_t train_size = std::min(count, static_cast<uint64_t>(TRAIN_MAX_SIZE)); |
| DEFINE_POINTER(FlattenReorderParameter); | ||
| class FlattenReorderParameter : public ReorderParameter { | ||
| public: | ||
| explicit FlattenReorderParameter() : ReorderParameter(std::move(FLATTEN_REORDER)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::move on a const char* literal is unnecessary as it has no effect and can be misleading. The std::string constructor will be called regardless. Removing it makes the code cleaner.
| explicit FlattenReorderParameter() : ReorderParameter(std::move(FLATTEN_REORDER)) { | |
| explicit FlattenReorderParameter() : ReorderParameter(FLATTEN_REORDER) { |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (78.72%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## main #1340 +/- ##
==========================================
+ Coverage 91.83% 91.91% +0.07%
==========================================
Files 320 322 +2
Lines 17857 17901 +44
==========================================
+ Hits 16399 16453 +54
+ Misses 1458 1448 -10
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Signed-off-by: jinjiabao.jjb <[email protected]>
Signed-off-by: jinjiabao.jjb <[email protected]>
bad3f61 to
76c4906
Compare
#1252
Summary by Sourcery
Enable HGraph to support a PQR reordering stage by adding a PqrReorder implementation, wiring it into the HGraph lifecycle, exposing configurable reorder parameters, and covering it with new tests
New Features:
Enhancements:
Build:
Tests: