-
Notifications
You must be signed in to change notification settings - Fork 513
feat: support merge_insert with source dedupe on first seen value #5603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support merge_insert with source dedupe on first seen value #5603
Conversation
Code ReviewSummary: This PR adds a P1: Performance Issue in Duplicate FilteringIn let keep_mask: BooleanArray = (0..matched.num_rows())
.map(|i| Some(!duplicate_indices.contains(&i)))
.collect();This is O(n*m) where n is the batch size and m is the number of duplicates. For batches with many duplicates, this becomes expensive. Consider using a let duplicate_set: std::collections::HashSet<_> = duplicate_indices.iter().collect();
let keep_mask: BooleanArray = (0..matched.num_rows())
.map(|i| Some(!duplicate_set.contains(&i)))
.collect();Other Observations
Overall this is a straightforward and well-tested feature addition. The main actionable item is the performance concern above. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
4e97413 to
8561b43
Compare
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
| let row_ids = matched.column(row_id_col).as_primitive::<UInt64Type>(); | ||
|
|
||
| let mut processed_row_ids = self.processed_row_ids.lock().unwrap(); | ||
| let mut duplicate_indices = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more efficient to just track keep_indices, then later you can directly call take on the record batch with those indices. Then you don't need to construct the hashmap or boolean mask.
Based on feedback in #5582
Simplified implementation that just keep the first value in case of duplicated
onrows in source during merge insert. Users are expected to sort the source properly before using merge-insert.