Skip to content

Conversation

@jackye1995
Copy link
Contributor

Based on feedback in #5582

Simplified implementation that just keep the first value in case of duplicated on rows in source during merge insert. Users are expected to sort the source properly before using merge-insert.

@jackye1995 jackye1995 requested a review from wjones127 December 31, 2025 06:09
@github-actions github-actions bot added the enhancement New feature or request label Dec 31, 2025
@github-actions
Copy link
Contributor

Code Review

Summary: This PR adds a DedupeBehavior option to merge_insert that allows keeping the first-seen value when duplicate source rows match the same target row, instead of failing.

P1: Performance Issue in Duplicate Filtering

In merge_insert.rs around line 2075-2082, the duplicate filtering uses:

let keep_mask: BooleanArray = (0..matched.num_rows())
    .map(|i| Some(!duplicate_indices.contains(&i)))
    .collect();

This is O(n*m) where n is the batch size and m is the number of duplicates. For batches with many duplicates, this becomes expensive. Consider using a HashSet for O(1) lookups:

let duplicate_set: std::collections::HashSet<_> = duplicate_indices.iter().collect();
let keep_mask: BooleanArray = (0..matched.num_rows())
    .map(|i| Some(!duplicate_set.contains(&i)))
    .collect();

Other Observations

  1. Test Coverage: Good parameterized test covering multiple configurations (full schema, stable row IDs, file versions).

  2. API Design: The builder pattern with dedupe_behavior() follows existing conventions well.

  3. Stats Tracking: The num_skipped_duplicates metric is properly propagated through both code paths (full-schema and non-full-schema merge insert).

Overall this is a straightforward and well-tested feature addition. The main actionable item is the performance concern above.

@codecov
Copy link

codecov bot commented Dec 31, 2025

Codecov Report

❌ Patch coverage is 90.76087% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 86.50% 15 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 force-pushed the merge-insert-duplicated-simplified branch from 4e97413 to 8561b43 Compare January 8, 2026 19:59
@jackye1995 jackye1995 changed the title feat: support merge_insert with dedupe using first seen value feat: support merge_insert with source dedupe on first seen value Jan 8, 2026
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

let row_ids = matched.column(row_id_col).as_primitive::<UInt64Type>();

let mut processed_row_ids = self.processed_row_ids.lock().unwrap();
let mut duplicate_indices = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more efficient to just track keep_indices, then later you can directly call take on the record batch with those indices. Then you don't need to construct the hashmap or boolean mask.

@jackye1995 jackye1995 merged commit f1e398d into lance-format:main Jan 9, 2026
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants