-
Notifications
You must be signed in to change notification settings - Fork 510
feat: support when_matched_delete in merge_insert #4939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: support when_matched_delete in merge_insert #4939
Conversation
db7cd37 to
925f954
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
4898546 to
b793aba
Compare
|
@wjones127 I plan to push some changes to bump the coverage a bit. Lack of coverage is from explain/analyze plan methods. Something I think would be a good idea to future-proof this file is starting to standardize how these various combinations of configs should be unit tested together. We can always attempt to test all relevant combinations with targeted tests, but not sure how scalable that is. Maybe there's a better approach with parameterized tests, etc.? |
| #[rstest::rstest] | ||
| #[tokio::test] | ||
| async fn test_when_matched_delete_no_matches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One case you are missing that is likely to be important: Doing WhenMatched::Delete with just a list of ids (omitting all other columns). Could you add that?
I think when you do that, you'll find we are missing some implementation on one of the code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @wjones127 non-full schema matching merges will break with this. From what I can see, there's 3 codepaths (fast path conditions, full_schema=true, full_schema=false). For my own understanding, I'm wondering why these codepaths need to be different? Is there a way we can condense these somehow? At the very least it might be good to document this for future changes (I can do this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We are in the middle of a refactor to consolidate these code paths. That refactor is tracked in this milestone: https://github.com/lancedb/lance/milestone/7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end all of them should go through the fast path. If you want to just say that non-full schema isn't supported for now, that's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. What do you recommend? I think the primary use-case of this feature might be for non-full schema batch deletes (like the example you gave above), but I don't want to create more work for you folks if the impl will be converted anyways to use datafusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WhenMatched::Delete will probably only be used in cases where the only resulting action is Delete. In which case, the current fast path physical plan probably won't be used in the end anyways. I think it would be worth creating a specific fast code path for it, which generates a specialized physical plan for bulk deletion. The first part of the plan can be the same, but the write node at the end should be specialized for deletion.
This is because in the delete case, the final write step doesn't actually require reading all the columns even of the source data. They should be projected away before the join, since all you need to do procedurally is:
- Join the id columns to find the matching row ids
- Apply the row id deletions
So I would look into creating a specific physical write node that handles that step (2).
In particular, the necessary_children_exprs() implementation can just say it needs _rowaddr:
And then you should create a new physical node like FullSchemaMergeInsertExec that uses apply_deletions:
There's probably a better parametrized test. I'm working on adding a larger test suite, which can cover more write cases. |
d1b0c26 to
b17144d
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly looks good to me, some minor comments
b5a6597 to
8f63701
Compare
8f63701 to
d647e6a
Compare
|
@wjones127 @jackye1995 thoughts here? |
|
@jtuglu1 I thought about this a bit more, and also saw Will's old comment (sorry totally missed that previously) I think what Will suggested makes sense, I pushed a commit to add DeleteOnlyMergeInsertExec with some additional refactoring, let me know if this looks good to you |
|
Yeah I considered this (and Will's comment), but I didn't really like the idea of adding physical plans for every single type of enum (.*MergeInsertExec). It seemed easier (to me) to handle things all in the same codepath (the projection optimizations, etc.). I haven't really looked through your changes here yet, but are there any performance benefits as compared to the current implementation? Edit: looks like this is skipping write step which is I guess some time saved. |
59fbe58 to
d95d7ae
Compare
yes that's the main goal |
d95d7ae to
03c9ff8
Compare
Closes #2271