feat: support when_matched_delete in merge_insert #4939

jtuglu1 · 2025-10-13T05:30:59Z

codecov-commenter · 2025-10-14T02:50:57Z

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...nce/src/dataset/write/merge_insert/logical_plan.rs	87.50%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

jtuglu1 · 2025-10-14T16:37:08Z

@wjones127 I plan to push some changes to bump the coverage a bit. Lack of coverage is from explain/analyze plan methods. Something I think would be a good idea to future-proof this file is starting to standardize how these various combinations of configs should be unit tested together. We can always attempt to test all relevant combinations with targeted tests, but not sure how scalable that is. Maybe there's a better approach with parameterized tests, etc.?

wjones127 · 2025-10-14T17:22:24Z

rust/lance/src/dataset/write/merge_insert.rs

+    #[rstest::rstest]
+    #[tokio::test]
+    async fn test_when_matched_delete_no_matches(


One case you are missing that is likely to be important: Doing WhenMatched::Delete with just a list of ids (omitting all other columns). Could you add that?

I think when you do that, you'll find we are missing some implementation on one of the code paths.

Yeah, @wjones127 non-full schema matching merges will break with this. From what I can see, there's 3 codepaths (fast path conditions, full_schema=true, full_schema=false). For my own understanding, I'm wondering why these codepaths need to be different? Is there a way we can condense these somehow? At the very least it might be good to document this for future changes (I can do this).

Yes. We are in the middle of a refactor to consolidate these code paths. That refactor is tracked in this milestone: https://github.com/lancedb/lance/milestone/7

In the end all of them should go through the fast path. If you want to just say that non-full schema isn't supported for now, that's okay.

Gotcha. What do you recommend? I think the primary use-case of this feature might be for non-full schema batch deletes (like the example you gave above), but I don't want to create more work for you folks if the impl will be converted anyways to use datafusion.

WhenMatched::Delete will probably only be used in cases where the only resulting action is Delete. In which case, the current fast path physical plan probably won't be used in the end anyways. I think it would be worth creating a specific fast code path for it, which generates a specialized physical plan for bulk deletion. The first part of the plan can be the same, but the write node at the end should be specialized for deletion.

This is because in the delete case, the final write step doesn't actually require reading all the columns even of the source data. They should be projected away before the join, since all you need to do procedurally is:

Join the id columns to find the matching row ids

Apply the row id deletions

So I would look into creating a specific physical write node that handles that step (2).

In particular, the necessary_children_exprs() implementation can just say it needs _rowaddr:

https://github.com/lancedb/lance/blob/89c266f5ab442c778d37a82b01a09a6f912ce812/rust/lance/src/dataset/write/merge_insert/logical_plan.rs#L145

And then you should create a new physical node like FullSchemaMergeInsertExec that uses apply_deletions:

https://github.com/lancedb/lance/blob/89c266f5ab442c778d37a82b01a09a6f912ce812/rust/lance/src/dataset/write/merge_insert/exec/write.rs#L479-L482

wjones127 · 2025-10-14T17:28:30Z

Maybe there's a better approach with parameterized tests, etc.?

There's probably a better parametrized test. I'm working on adding a larger test suite, which can cover more write cases.

python/python/lance/dataset.py

rust/lance/src/dataset/write/merge_insert.rs

jackye1995

mostly looks good to me, some minor comments

jtuglu1 · 2025-12-30T20:19:12Z

@wjones127 @jackye1995 thoughts here?

jackye1995 · 2025-12-31T23:24:53Z

@jtuglu1 I thought about this a bit more, and also saw Will's old comment (sorry totally missed that previously) I think what Will suggested makes sense, I pushed a commit to add DeleteOnlyMergeInsertExec with some additional refactoring, let me know if this looks good to you

jtuglu1 · 2025-12-31T23:27:46Z

Yeah I considered this (and Will's comment), but I didn't really like the idea of adding physical plans for every single type of enum (.*MergeInsertExec). It seemed easier (to me) to handle things all in the same codepath (the projection optimizations, etc.). I haven't really looked through your changes here yet, but are there any performance benefits as compared to the current implementation?

Edit: looks like this is skipping write step which is I guess some time saved.

jackye1995 · 2025-12-31T23:57:21Z

Edit: looks like this is skipping write step which is I guess some time saved.

yes that's the main goal

github-actions bot added enhancement New feature or request java labels Oct 13, 2025

jtuglu1 force-pushed the add-when-match-delete-functionality branch 3 times, most recently from db7cd37 to 925f954 Compare October 14, 2025 02:03

github-actions bot added the python label Oct 14, 2025

jtuglu1 force-pushed the add-when-match-delete-functionality branch 3 times, most recently from 4898546 to b793aba Compare October 14, 2025 03:42

jtuglu1 marked this pull request as ready for review October 14, 2025 04:35

wjones127 self-assigned this Oct 14, 2025

wjones127 reviewed Oct 14, 2025

View reviewed changes

jtuglu1 force-pushed the add-when-match-delete-functionality branch 3 times, most recently from d1b0c26 to b17144d Compare December 29, 2025 22:55

Refactor

13fac77

jtuglu1 requested a review from wjones127 December 30, 2025 03:55