-
Notifications
You must be signed in to change notification settings - Fork 400
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Summary
Running a dlt.hub.transformation that returns a dlt.Relation where the first row contains zeros incorrectly treats the columns as missing.
Repro
# repro.py
import dlt
from dlt.sources.rest_api import rest_api_resources
import dlthub.data_quality as dq
@dlt.source
def jaffleshop():
jaffle_rest_resources: list = rest_api_resources({
"client": {
"base_url": "https://jaffle-shop.dlthub.com/api/v1",
"paginator": {"type": "header_link"},
},
"resources": [ # individual resources
"customers",
"products",
"orders",
],
# set the time range for all resources
"resource_defaults": {
"endpoint": {
"params": {
"start_date": "2017-01-01",
"end_date": "2017-01-15",
},
},
},
})
return jaffle_rest_resources
@dlt.hub.transformation
def jaffle_checks(dataset: dlt.Dataset) -> dlt.Relation:
checks = {
"orders": [dq.checks.is_unique("id"), dq.checks.case("subtotal > 0")]
}
return dq.prepare_checks(dataset, checks=checks)
pipeline = dlt.pipeline("jaffle_dq", destination="duckdb")
if __name__ == "__main__":
pipeline.run([jaffleshop()])Output of running the code
❯ uv run python repro.py
DATE|[WARNING]|356876|137503479621440|dlt|validate.py|verify_normalized_table:100|In schema `jaffleshop`: The following columns in table `jaffle_checks` did not receive any data during this load:
- success_count
- success_rate
This can happen if you specify columns manually, for example, using the `merge_key`, `primary_key` or `columns` argument but they do not exist in the data.If I change the if __name__ == "__main__": clause, I get a different result from what is materialized
if __name__ == "__main__":
query = dq.prepare_checks(
pipeline.dataset(),
checks={
"orders": [dq.checks.is_unique("id"), dq.checks.case("subtotal > 0")],
},
)
print(query.arrow())Output
❯ uv run python option1.py
pyarrow.Table
table_name: string
check_qualified_name: string
row_count: int64
success_count: int32
success_rate: double
----
table_name: [["orders","orders"]]
check_qualified_name: [["id__is_unique","subtotal__case__GT"]]
row_count: [[1570,1570]]
success_count: [[1570,1557]]
success_rate: [[1,0.9917197452229299]]Suggested solution
I'm unsure about the origin of the bug. I wonder if we can simplify the lifecycle of dlt.Relation and SqlModel
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Done