-
Notifications
You must be signed in to change notification settings - Fork 505
feat: support merge insert with duplicated rows in source #5582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: support merge insert with duplicated rows in source #5582
Conversation
Code Review: feat: support merge insert with duplicated rows in sourceSummaryThis PR adds ReviewP0 - Critical Issue: Memory Growth with Large Datasets The dedupe implementation buffers entire batches in For a dataset with N unique keys spread across M batches where most updates are scattered (not clustered by key), this requires O(M) batch storage. Consider:
P1 - Potential Issue: Stable Row IDs + Dedupe Mode In This could cause issues if:
Consider adding a test case for P1 - NULL Handling Semantics The PR mentions "NULL handling follows DataFusion semantics for sort ordering" but doesn't document or test this behavior explicitly. With
Consider adding a test case with NULL values in the dedupe column to document the expected behavior. Minor Notes
Automated review by Claude Code |
Code Review SummaryThis PR adds support for handling duplicate rows in source data during merge_insert operations via P1: Memory Concern with Buffered BatchesIssue: In Location: Concern: While there is a reference counting mechanism to clear batches when no longer referenced, in the worst case (many distinct keys with duplicates spread across all batches), all batches will be retained. The PR does have tests with 1000 rows, but production workloads with millions of rows could hit memory limits. Recommendation: Consider adding documentation about memory implications for very large source datasets with many duplicates, or consider if a streaming approach with external sorting could be used for extremely large inputs. P1: Missing NULL Value TestIssue: The documentation states "NULL handling follows DataFusion semantics for sort ordering" but there are no explicit tests verifying behavior when the dedupe column contains NULL values. Location: Recommendation: Add a test case that includes NULL values in the dedupe column to verify the expected behavior (typically NULLs sort first or last depending on ordering). Overall, the implementation is clean and well-tested for the core use cases. The API design follows the existing builder pattern nicely. The feature is properly restricted to the v2 path, with a clear error message for unsupported configurations. |
Code ReviewThis PR adds P0/P1 IssuesP1 - Memory concern with dedupe buffering: In
P1 - Test coverage for edge cases: The tests cover basic ascending/descending behavior well, but consider adding tests for:
Minor observations (not blockers)
SummaryThe implementation is well-structured with clear separation of concerns (accumulation during streaming, finalization at end). The ref-counting approach for batch memory management is clever. The main consideration is documenting memory behavior for very large source datasets with high key cardinality. |
Code ReviewThis PR adds Key ObservationsMemory Usage Concern (P1): Consider documenting this memory trade-off in the API documentation, or in future work, implementing a more memory-efficient approach (e.g., streaming dedup using a sort + aggregate approach). The batch clearing mechanism looks correct - batches are set to Minor Notes
Overall this is a well-implemented feature with good tests. |
Code Review SummaryThis PR adds Overall AssessmentThe implementation is well-designed with comprehensive test coverage across Rust, Python, and Java bindings. The deduplication logic follows DataFusion/SQL NULL semantics correctly. P0/P1 IssuesP1: Memory concern in dedupe mode In Consider:
This is acceptable for the initial implementation but worth documenting as a known limitation. P1: Edge case - dedupe column not found at runtime In Minor Observations (not blocking)
TestsGood test coverage including:
|
Code ReviewThis PR adds a useful feature to handle duplicate rows in merge_insert source data by specifying a P1 Issues1. Memory concern with unbounded buffering in dedupe mode In Consider:
2. Potential performance issue with scalar comparisons The Suggestions (non-blocking)
Overall, the implementation is solid with comprehensive tests covering ascending/descending ordering, NULL handling, and the case where dedupe column exists only in source. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Code ReviewThis PR adds deduplication support for merge_insert when source data contains duplicate rows matching the same target row. The implementation uses a "running dedupe" approach during the merge phase rather than pre-aggregation, which is memory-efficient for streaming workloads. Overall AssessmentThe implementation is well-structured with good test coverage across Python, Java, and Rust. The design choice to dedupe during merge rather than during join is appropriate for avoiding multiple passes. IssuesP1: Memory Management in Dedupe Mode The // In finalize_dedupe around line 2339
for (batch_idx, rows) in batch_rows {
let buffered_batch = self.buffered_batches[batch_idx].as_ref().ok_or_else(|| {
// This error path leaves other buffered_batches allocatedP1: Potential Panic with Empty Batches In Minor Observations (not blocking)
TestsGood coverage including:
The tests appropriately verify both the happy path and error conditions. |
Code Review SummaryThis PR adds support for handling duplicate rows in source data during merge insert operations by introducing P0/P1 Issues1. Memory accumulation in dedupe mode (P1 - Performance) The dedupe implementation buffers all batches in Consider documenting the memory implications for users or adding a warning when dedupe mode is used with very large datasets. 2. Potential panic in finalize_dedupe (P1 - Bug) In let buffered_batch = self.buffered_batches[batch_idx].as_ref().ok_or_else(|| ...)?;This returns an error, but the code assumes ref-counting prevents this. If there's a logic error in ref-count management, this could cause unexpected failures during finalization. The error message is clear, but consider adding a debug assertion to catch ref-count mismatches earlier. 3. Missing test for edge case: equal dedupe values across batches (P1 - Test coverage) The tests cover ascending/descending ordering and large datasets, but there's no explicit test for the error case when two rows across different batches have equal dedupe values. While Other Observations
|
Code Review: feat: support merge insert with duplicated rows in sourceSummaryThis PR adds dedupe functionality to merge_insert v2, allowing users to specify a column and sort order to resolve duplicate source rows matching the same target row. The implementation uses a running dedupe strategy during merge to avoid multiple passes. P0/P1 IssuesP1: Memory concern with batch buffering The dedupe implementation buffers entire batches in memory via Consider documenting this memory trade-off for users, or adding a threshold check that falls back to error when buffered data exceeds a limit. P1: Error message clarity for equal dedupe values When two rows have equal dedupe values (line ~2178): "Cannot deduplicate: multiple rows have the same value ({}) in the dedupe column"This error could be confusing since users might expect a tie-breaking behavior. Consider:
Minor Observations (not blocking)
Reviewed with focus on P0/P1 issues per repository guidelines |
Code Review SummaryThis PR adds dedupe functionality to merge_insert operations, allowing users to handle source data with duplicate keys by specifying a column to use for deduplication and sort options to control which row to keep. Overall AssessmentThe implementation is well-structured with comprehensive test coverage across Rust, Python, and Java bindings. The API design using P0/P1 IssuesP0 - Python API inconsistency with Rust/Java: In # In docstring examples:
builder.dedupe_by("timestamp").dedupe_ordering("descending") # Wrong method name
builder.dedupe_by("version").dedupe_ordering("ascending") # Wrong method nameThe docstring should be updated to use P1 - Missing test coverage for nulls_first behavior: While there are extensive tests for ascending/descending deduplication, there are no tests that verify the Minor Suggestions (Non-blocking)
🤖 Review generated by Claude Code |
So far merge_insert requires source table to have no duplicated rows, because if 2 rows match a row in target, it is ambiguous which one to merge into target.
This PR introduces 2 concepts:
dedupe_bywhich allows user to specify a column (must be primitive type) in the source to dedupe duplicate rows, anddedupe_sort_optionswhich allows user to specify which rows take precedence. The ordering can be increasing or decreasing, null first or last (similar to arrowSortOptions), the actual ordering of the data is based on DataFusion semantics.When running merge insert with these 2 parameters set, during the merge phase, we decide which row to merge using the value of the dedupe column, and based on the dedupe order, decide which one to keep. If there are multiple rows of the same dedupe value, we will fail the operation since again it is ambiguous which one to merge into target.
For the actual execution, I chose to do a running dedupe during merge, rather than doing an aggregation to get the row to merge during the join phase, to avoid multiple passes against the source dataset.
The implementation also works for nested field as dedupe column, also works if the dedupe column only exists in source not in target (which is the actual main use case for merging MemTable to underlying table)
Note: this feature is for merge_insert v2 only.