Skip to content

Conversation

@jtuglu1
Copy link
Contributor

@jtuglu1 jtuglu1 commented Oct 13, 2025

Closes #2271

@github-actions github-actions bot added enhancement New feature or request java labels Oct 13, 2025
@jtuglu1 jtuglu1 force-pushed the add-when-match-delete-functionality branch 3 times, most recently from db7cd37 to 925f954 Compare October 14, 2025 02:03
@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...nce/src/dataset/write/merge_insert/logical_plan.rs 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@jtuglu1 jtuglu1 force-pushed the add-when-match-delete-functionality branch 3 times, most recently from 4898546 to b793aba Compare October 14, 2025 03:42
@jtuglu1 jtuglu1 marked this pull request as ready for review October 14, 2025 04:35
@wjones127 wjones127 self-assigned this Oct 14, 2025
@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Oct 14, 2025

@wjones127 I plan to push some changes to bump the coverage a bit. Lack of coverage is from explain/analyze plan methods. Something I think would be a good idea to future-proof this file is starting to standardize how these various combinations of configs should be unit tested together. We can always attempt to test all relevant combinations with targeted tests, but not sure how scalable that is. Maybe there's a better approach with parameterized tests, etc.?

Comment on lines 4919 to 4921
#[rstest::rstest]
#[tokio::test]
async fn test_when_matched_delete_no_matches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case you are missing that is likely to be important: Doing WhenMatched::Delete with just a list of ids (omitting all other columns). Could you add that?

I think when you do that, you'll find we are missing some implementation on one of the code paths.

Copy link
Contributor Author

@jtuglu1 jtuglu1 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @wjones127 non-full schema matching merges will break with this. From what I can see, there's 3 codepaths (fast path conditions, full_schema=true, full_schema=false). For my own understanding, I'm wondering why these codepaths need to be different? Is there a way we can condense these somehow? At the very least it might be good to document this for future changes (I can do this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We are in the middle of a refactor to consolidate these code paths. That refactor is tracked in this milestone: https://github.com/lancedb/lance/milestone/7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end all of them should go through the fast path. If you want to just say that non-full schema isn't supported for now, that's okay.

Copy link
Contributor Author

@jtuglu1 jtuglu1 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. What do you recommend? I think the primary use-case of this feature might be for non-full schema batch deletes (like the example you gave above), but I don't want to create more work for you folks if the impl will be converted anyways to use datafusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WhenMatched::Delete will probably only be used in cases where the only resulting action is Delete. In which case, the current fast path physical plan probably won't be used in the end anyways. I think it would be worth creating a specific fast code path for it, which generates a specialized physical plan for bulk deletion. The first part of the plan can be the same, but the write node at the end should be specialized for deletion.

This is because in the delete case, the final write step doesn't actually require reading all the columns even of the source data. They should be projected away before the join, since all you need to do procedurally is:

  1. Join the id columns to find the matching row ids
  2. Apply the row id deletions

So I would look into creating a specific physical write node that handles that step (2).

In particular, the necessary_children_exprs() implementation can just say it needs _rowaddr:

https://github.com/lancedb/lance/blob/89c266f5ab442c778d37a82b01a09a6f912ce812/rust/lance/src/dataset/write/merge_insert/logical_plan.rs#L145

And then you should create a new physical node like FullSchemaMergeInsertExec that uses apply_deletions:

https://github.com/lancedb/lance/blob/89c266f5ab442c778d37a82b01a09a6f912ce812/rust/lance/src/dataset/write/merge_insert/exec/write.rs#L479-L482

@wjones127
Copy link
Contributor

Maybe there's a better approach with parameterized tests, etc.?

There's probably a better parametrized test. I'm working on adding a larger test suite, which can cover more write cases.

@jtuglu1 jtuglu1 force-pushed the add-when-match-delete-functionality branch 3 times, most recently from d1b0c26 to b17144d Compare December 29, 2025 22:55
@jtuglu1 jtuglu1 requested a review from wjones127 December 30, 2025 03:55
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good to me, some minor comments

@jtuglu1 jtuglu1 force-pushed the add-when-match-delete-functionality branch from b5a6597 to 8f63701 Compare December 30, 2025 07:04
@jtuglu1 jtuglu1 force-pushed the add-when-match-delete-functionality branch from 8f63701 to d647e6a Compare December 30, 2025 07:05
@jtuglu1 jtuglu1 requested a review from jackye1995 December 30, 2025 14:00
@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Dec 30, 2025

@wjones127 @jackye1995 thoughts here?

@jackye1995
Copy link
Contributor

@jtuglu1 I thought about this a bit more, and also saw Will's old comment (sorry totally missed that previously) I think what Will suggested makes sense, I pushed a commit to add DeleteOnlyMergeInsertExec with some additional refactoring, let me know if this looks good to you

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Dec 31, 2025

Yeah I considered this (and Will's comment), but I didn't really like the idea of adding physical plans for every single type of enum (.*MergeInsertExec). It seemed easier (to me) to handle things all in the same codepath (the projection optimizations, etc.). I haven't really looked through your changes here yet, but are there any performance benefits as compared to the current implementation?

Edit: looks like this is skipping write step which is I guess some time saved.

@jackye1995 jackye1995 force-pushed the add-when-match-delete-functionality branch from 59fbe58 to d95d7ae Compare December 31, 2025 23:56
@jackye1995
Copy link
Contributor

Edit: looks like this is skipping write step which is I guess some time saved.

yes that's the main goal

@jackye1995 jackye1995 force-pushed the add-when-match-delete-functionality branch from d95d7ae to 03c9ff8 Compare January 1, 2026 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add when_matched_delete to merge_insert

4 participants