Skip to content

Commit 95675ad

Browse files
committed
Improved lifecycle details
1 parent b6da7c0 commit 95675ad

File tree

2 files changed

+39
-25
lines changed

2 files changed

+39
-25
lines changed

docs/uv.lock

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/website/docs/hub/features/transformations/index.md

Lines changed: 37 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -157,20 +157,24 @@ Downstream of the transformation layer, we may want to know which columns origin
157157

158158
## Lifecycle of a SQL transformation
159159

160-
Just like regular dlt resources, dlt transformations go through the three stages of extract, normalize, and load when a pipeline is run.
160+
In this section, we focus on the lifecycle of transformations that yield a `Relation` object, which we call SQL transformations here. This is in contrast to Python-based transformations that yield dataframes or arrow tables, which go through the regular extract, normalize, and load lifecycle of a `dlt` resource.
161161

162162
### Extract
163163

164164
In the extract stage, a `Relation` yielded by a transformation is converted into a SQL string and saved as a `.model` file along with its source SQL dialect.
165-
At this stage, the SQL string is just the user's original query — either the string that was explicitly provided or the one generated by `Relation.to_sql()`. No dlt-specific columns like `_dlt_id` or `_dlt_load_id` are added yet.
165+
At this stage, the SQL string is just the user's original query — either the string that was explicitly provided or the one generated by `Relation.to_sql()`. No `dlt`-specific columns like `_dlt_id` or `_dlt_load_id` are added yet.
166166

167167
### Normalize
168168

169-
In the normalize stage, `.model` files are read and processed. This is where the main transformation logic happens.
169+
In the normalize stage, `.model` files are read and processed. The normalization process modifies your SQL queries to ensure they execute correctly and integrate with `dlt`'s features.
170170

171-
#### `dlt` columns
171+
:::info
172+
The normalization described here applies only to SQL-based transformations. Python-based transformations, such as those using dataframes or arrow tables, follow the [regular normalization process](../../../reference/explainers/how-dlt-works.md#normalize).
173+
:::
174+
175+
#### Adding `dlt` columns
172176

173-
During normalization, `dlt` will add internal dlt columns to your SQL queries depending on the configuration:
177+
During normalization, `dlt` adds internal `dlt` columns to your SQL queries depending on the configuration:
174178

175179
- `_dlt_load_id`, which tracks which load operation created or modified each row, is **added by default**. Even if present in your query, the `_dlt_load_id` column will be **replaced with a constant value** corresponding to the current load ID. To disable this behavior, set:
176180
```toml
@@ -190,35 +194,45 @@ During normalization, `dlt` will add internal dlt columns to your SQL queries de
190194
- In **Redshift**, `_dlt_id` is generated using an `MD5` hash of the load ID and row number.
191195
- In **SQLite**, `_dlt_id` is simulated using `lower(hex(randomblob(16)))`.
192196

193-
Additionally, column names are normalized according to the naming schema selected and the identifier capabilities of the destinations. This ensures compatibility and consistent naming conventions across different data sources and destination systems.
194-
195-
This allows `dlt` to maintain data lineage and enables features like incremental loading and merging, even when working with raw SQL queries.
196-
197-
:::info
198-
The normalization described here, including automatic injection or replacement of dlt columns, applies only to SQL-based transformations. Python-based transformations, such as those using dataframes or arrow tables, follow the [regular normalization process](../../../reference/explainers/how-dlt-works.md#normalize).
199-
:::
200197

201-
#### Query Processing
198+
#### Query transformations
202199

203-
Additionally, the normalization process in `dlt` takes care of several important steps to ensure your queries are executed smoothly and correctly on the input dataset:
200+
The normalization process also applies the following transformations to ensure your queries work correctly:
204201

205-
1. Adds special dlt columns (see above for details).
206-
2. Fully qualifies all identifiers by adding database and dataset prefixes, so tables are always referenced unambiguously during query execution.
207-
3. Properly quotes and, if necessary, adjusts the case of your identifiers to match the destination’s requirements.
208-
4. Handles differences in naming conventions by aliasing columns and tables as needed, so names always match those in the destination.
209-
5. Reorders columns to match the expected order in the destination table.
210-
6. Fills in default `NULL` values for any columns that exist in the destination table but are not selected in your query.
202+
1. Fully qualifies all identifiers with database and dataset prefixes
203+
2. Quotes and adjusts identifier casing to match destination requirements
204+
3. Normalizes column names according to the selected naming convention
205+
4. Aliases columns and tables to handle naming convention differences
206+
5. Reorders columns to match the destination table schema
207+
6. Fills in `NULL` values for columns that exist in the destination but aren't in your query
211208

212209
### Load
213210

214-
In the load stage, the normalized SELECT queries from `.model` files are wrapped in INSERT statements and executed on the destination.
211+
In the load stage, the normalized queries from `.model` files are wrapped in INSERT statements and executed on the destination.
215212
For example, given this query from the extract stage:
216213

217214
```sql
218215
SELECT id, value FROM table
219216
```
220217

221-
After the normalize stage processes it (adding dlt columns, wrapping in subquery, etc.), the load stage executes:
218+
After the normalize stage processes it (adding dlt columns, wrapping in subquery, etc.) and results in:
219+
220+
```sql
221+
SELECT
222+
_dlt_subquery."id" AS "id",
223+
_dlt_subquery."value" AS "value",
224+
'1749134128.17655' AS "_dlt_load_id",
225+
UUID() AS "_dlt_id"
226+
FROM (
227+
SELECT
228+
"my_table"."id" AS "id",
229+
"my_table"."value" AS "value"
230+
FROM "my_pipeline_dataset"."my_table" AS "my_table"
231+
)
232+
AS _dlt_subquery
233+
```
234+
235+
The load stage executes:
222236

223237
```sql
224238
INSERT INTO
@@ -237,7 +251,7 @@ FROM (
237251
AS _dlt_subquery
238252
```
239253

240-
The SELECT portion is what was produced during the normalize stage. In the load stage, this query is executed via the destination's SQL client, materializing the transformation result directly in the database.
254+
The query is executed via the destination's SQL client, materializing the transformation result directly in the database.
241255

242256
## Examples
243257

0 commit comments

Comments
 (0)