feat: support merge_insert with source dedupe on first seen value #5603

jackye1995 · 2025-12-31T06:09:01Z

Based on feedback in #5582

Simplified implementation that just keep the first value in case of duplicated on rows in source during merge insert. Users are expected to sort the source properly before using merge-insert.

github-actions · 2025-12-31T06:10:02Z

Code Review

Summary: This PR adds a DedupeBehavior option to merge_insert that allows keeping the first-seen value when duplicate source rows match the same target row, instead of failing.

P1: Performance Issue in Duplicate Filtering

In merge_insert.rs around line 2075-2082, the duplicate filtering uses:

let keep_mask: BooleanArray = (0..matched.num_rows())
    .map(|i| Some(!duplicate_indices.contains(&i)))
    .collect();

This is O(n*m) where n is the batch size and m is the number of duplicates. For batches with many duplicates, this becomes expensive. Consider using a HashSet for O(1) lookups:

let duplicate_set: std::collections::HashSet<_> = duplicate_indices.iter().collect();
let keep_mask: BooleanArray = (0..matched.num_rows())
    .map(|i| Some(!duplicate_set.contains(&i)))
    .collect();

Other Observations

Test Coverage: Good parameterized test covering multiple configurations (full schema, stable row IDs, file versions).
API Design: The builder pattern with dedupe_behavior() follows existing conventions well.
Stats Tracking: The num_skipped_duplicates metric is properly propagated through both code paths (full-schema and non-full-schema merge insert).

Overall this is a straightforward and well-tested feature addition. The main actionable item is the performance concern above.

codecov · 2025-12-31T06:56:56Z

Codecov Report

❌ Patch coverage is 90.76087% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/write/merge_insert.rs	86.50%	15 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127

Looks good!

wjones127 · 2026-01-08T21:21:28Z

rust/lance/src/dataset/write/merge_insert.rs

                let row_ids = matched.column(row_id_col).as_primitive::<UInt64Type>();

                let mut processed_row_ids = self.processed_row_ids.lock().unwrap();
+                let mut duplicate_indices = Vec::new();


It would be more efficient to just track keep_indices, then later you can directly call take on the record batch with those indices. Then you don't need to construct the hashmap or boolean mask.

jackye1995 requested a review from wjones127 December 31, 2025 06:09

github-actions bot added the enhancement New feature or request label Dec 31, 2025

jackye1995 added 3 commits January 8, 2026 11:26

feat: support merge_insert with dedupe using first seen value

b1074ab

address ai comments

d83f2e5

doc updates

8561b43

jackye1995 force-pushed the merge-insert-duplicated-simplified branch from 4e97413 to 8561b43 Compare January 8, 2026 19:59

jackye1995 changed the title ~~feat: support merge_insert with dedupe using first seen value~~ feat: support merge_insert with source dedupe on first seen value Jan 8, 2026

wjones127 approved these changes Jan 8, 2026

View reviewed changes

track keep indices directly

5ea44b6

jackye1995 merged commit f1e398d into lance-format:main Jan 9, 2026
26 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support merge_insert with source dedupe on first seen value #5603

feat: support merge_insert with source dedupe on first seen value #5603

jackye1995 commented Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

codecov bot commented Dec 31, 2025 •

edited

Loading

Uh oh!

wjones127 left a comment

Uh oh!

wjones127 Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: support merge_insert with source dedupe on first seen value #5603

feat: support merge_insert with source dedupe on first seen value #5603

Conversation

jackye1995 commented Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Code Review

P1: Performance Issue in Duplicate Filtering

Other Observations

Uh oh!

codecov bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 31, 2025 •

edited

Loading