Skip to content

bug: missing column with SqlModel #3345

@zilto

Description

@zilto

Summary

Running a dlt.hub.transformation that returns a dlt.Relation where the first row contains zeros incorrectly treats the columns as missing.

Repro

# repro.py
import dlt
from dlt.sources.rest_api import rest_api_resources
import dlthub.data_quality as dq

@dlt.source
def jaffleshop():
    jaffle_rest_resources: list = rest_api_resources({
        "client": {
            "base_url": "https://jaffle-shop.dlthub.com/api/v1",
            "paginator": {"type": "header_link"},
        },
        "resources": [  # individual resources
            "customers",
            "products",
            "orders",
        ],
        # set the time range for all resources
        "resource_defaults": {
            "endpoint": {
                "params": {
                    "start_date": "2017-01-01",
                    "end_date": "2017-01-15",
                },
            },
        },
    })
    
    return jaffle_rest_resources


@dlt.hub.transformation
def jaffle_checks(dataset: dlt.Dataset) -> dlt.Relation:
    checks = {
        "orders": [dq.checks.is_unique("id"), dq.checks.case("subtotal > 0")]
    }
    return dq.prepare_checks(dataset, checks=checks)


pipeline = dlt.pipeline("jaffle_dq", destination="duckdb")


if __name__ == "__main__":
    pipeline.run([jaffleshop()])

Output of running the code

❯ uv run python repro.py
DATE|[WARNING]|356876|137503479621440|dlt|validate.py|verify_normalized_table:100|In schema `jaffleshop`: The following columns in table `jaffle_checks` did not receive any data during this load:
  - success_count
  - success_rate

This can happen if you specify columns manually, for example, using the `merge_key`, `primary_key` or `columns` argument but they do not exist in the data.

If I change the if __name__ == "__main__": clause, I get a different result from what is materialized

if __name__ == "__main__":
    query = dq.prepare_checks(
        pipeline.dataset(),
        checks={
            "orders": [dq.checks.is_unique("id"), dq.checks.case("subtotal > 0")],
        },
    )
    print(query.arrow())

Output

❯ uv run python option1.py
pyarrow.Table
table_name: string
check_qualified_name: string
row_count: int64
success_count: int32
success_rate: double
----
table_name: [["orders","orders"]]
check_qualified_name: [["id__is_unique","subtotal__case__GT"]]
row_count: [[1570,1570]]
success_count: [[1570,1557]]
success_rate: [[1,0.9917197452229299]]

Suggested solution

I'm unsure about the origin of the bug. I wonder if we can simplify the lifecycle of dlt.Relation and SqlModel

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions