-
Notifications
You must be signed in to change notification settings - Fork 400
QoL: Destructive schema sync after manual column dropping #2909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
c280794 to
c37c422
Compare
c37c422 to
598ca5b
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good idea but we need to approach it in more systematic way:
- (almost) all of our destinations have
def get_storage_table(self, table_name: str) -> Tuple[str, TTableSchemaColumns]:and/or
def get_storage_tables(
self, table_names: Iterable[str]
) -> Iterable[Tuple[str, TTableSchemaColumns]]: implemented. this will reflect the storage to get table schema out of it. you can use it to compare with the pipeline schema.
- let's formalize it: add mixin class like
WithTableReflectionin the same mannerWithStateSyncis done.get_storage_tablesis the more general method so you can add only this to the mixin - Now add this mixing to all
JobClientBaseimplementations for which you want to support our new sync schema.
When the above is done we are able to actually compute schema diff.
Top level interface:
- we have
sync_schemathat will do a regular schema migration (add missing columns and tables in the destination` - we need another method which is the reverse: it will delete columns and tables in the schema that are not present in the destination and then do the schema sync above
- the method above should have a dry run mode - where we do not change the pipeline schema and we do not sync it
- it should make sure if
destination_client()implementsWithTableReflectionbefore continuing - it should allow to select tables to be affected
when this is done we can think about extending cli ie dlt pipeline <name> schema command
c256d51 to
d4aadb2
Compare
d4aadb2 to
ab715ff
Compare
ab715ff to
17651e2
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some changes needed
354680c to
85448dc
Compare
b1c3f3f to
d2ad80e
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really good! you need to resolve conflicts and maybe simplify code that deals with nested tables.
maybe we could document this? not sure where in the documentation it should go
d2ad80e to
b340618
Compare
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
docs | fee497d | Commit Preview URL Branch Preview URL |
Nov 25 2025, 12:41 PM |
739025e to
2f85143
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really good and is doing a lot of refactoring which we need. but we need even more :)
also when this is finished we need two new cli commands that will use it:
dlt pipeline fruitshop schema [foo] sync-to-destination (attach to pipeline, use old sync method)
dlt pipeline fruitshop schema [foo] sync-from-destination (the reverse operation, with optional sync back to destination)
dlt pipeline fruitshop schema [foo] should still show schema (you can add "show" as well)
heh a lot of work... sorry for that
is dry run possible for the destructive sync?
0504401 to
9471bc4
Compare
934764a to
cbd5a93
Compare
cbd5a93 to
fee497d
Compare
Description
This PR adds a new pipeline function that syncs the dlt schema with the destination (not vice versa) by removing a column from the schema if that column has been manually deleted in the destination.
Related PRs:
#2754
Further:
This should be extended to table drop syncs as well.
Note:
This is essentially solving the problem when the user manually drops things in the destination and the dlt pipeline breaks.