You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/website/docs/hub/features/transformations/index.md
+37-23Lines changed: 37 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -157,20 +157,24 @@ Downstream of the transformation layer, we may want to know which columns origin
157
157
158
158
## Lifecycle of a SQL transformation
159
159
160
-
Just like regular dlt resources, dlt transformations go through the three stages of extract, normalize, and load when a pipeline is run.
160
+
In this section, we focus on the lifecycle of transformations that yield a `Relation` object, which we call SQL transformations here. This is in contrast to Python-based transformations that yield dataframes or arrow tables, which go through the regular extract, normalize, and load lifecycle of a `dlt` resource.
161
161
162
162
### Extract
163
163
164
164
In the extract stage, a `Relation` yielded by a transformation is converted into a SQL string and saved as a `.model` file along with its source SQL dialect.
165
-
At this stage, the SQL string is just the user's original query — either the string that was explicitly provided or the one generated by `Relation.to_sql()`. No dlt-specific columns like `_dlt_id` or `_dlt_load_id` are added yet.
165
+
At this stage, the SQL string is just the user's original query — either the string that was explicitly provided or the one generated by `Relation.to_sql()`. No `dlt`-specific columns like `_dlt_id` or `_dlt_load_id` are added yet.
166
166
167
167
### Normalize
168
168
169
-
In the normalize stage, `.model` files are read and processed. This is where the main transformation logic happens.
169
+
In the normalize stage, `.model` files are read and processed. The normalization process modifies your SQL queries to ensure they execute correctly and integrate with `dlt`'s features.
170
170
171
-
#### `dlt` columns
171
+
:::info
172
+
The normalization described here applies only to SQL-based transformations. Python-based transformations, such as those using dataframes or arrow tables, follow the [regular normalization process](../../../reference/explainers/how-dlt-works.md#normalize).
173
+
:::
174
+
175
+
#### Adding `dlt` columns
172
176
173
-
During normalization, `dlt`will add internal dlt columns to your SQL queries depending on the configuration:
177
+
During normalization, `dlt`adds internal `dlt` columns to your SQL queries depending on the configuration:
174
178
175
179
-`_dlt_load_id`, which tracks which load operation created or modified each row, is **added by default**. Even if present in your query, the `_dlt_load_id` column will be **replaced with a constant value** corresponding to the current load ID. To disable this behavior, set:
176
180
```toml
@@ -190,35 +194,45 @@ During normalization, `dlt` will add internal dlt columns to your SQL queries de
190
194
- In **Redshift**, `_dlt_id` is generated using an `MD5` hash of the load ID and row number.
191
195
- In **SQLite**, `_dlt_id` is simulated using `lower(hex(randomblob(16)))`.
192
196
193
-
Additionally, column names are normalized according to the naming schema selected and the identifier capabilities of the destinations. This ensures compatibility and consistent naming conventions across different data sources and destination systems.
194
-
195
-
This allows `dlt` to maintain data lineage and enables features like incremental loading and merging, even when working with raw SQL queries.
196
-
197
-
:::info
198
-
The normalization described here, including automatic injection or replacement of dlt columns, applies only to SQL-based transformations. Python-based transformations, such as those using dataframes or arrow tables, follow the [regular normalization process](../../../reference/explainers/how-dlt-works.md#normalize).
199
-
:::
200
197
201
-
#### Query Processing
198
+
#### Query transformations
202
199
203
-
Additionally, the normalization process in `dlt` takes care of several important steps to ensure your queries are executed smoothly and correctly on the input dataset:
200
+
The normalization process also applies the following transformations to ensure your queries work correctly:
204
201
205
-
1. Adds special dlt columns (see above for details).
206
-
2. Fully qualifies all identifiers by adding database and dataset prefixes, so tables are always referenced unambiguously during query execution.
207
-
3. Properly quotes and, if necessary, adjusts the case of your identifiers to match the destination’s requirements.
208
-
4. Handles differences in naming conventions by aliasing columns and tables as needed, so names always match those in the destination.
209
-
5. Reorders columns to match the expected order in the destination table.
210
-
6. Fills in default `NULL` values for any columns that exist in the destination table but are not selected in your query.
202
+
1. Fully qualifies all identifiers with database and dataset prefixes
203
+
2. Quotes and adjusts identifier casing to match destination requirements
204
+
3. Normalizes column names according to the selected naming convention
205
+
4. Aliases columns and tables to handle naming convention differences
206
+
5. Reorders columns to match the destination table schema
207
+
6. Fills in `NULL` values for columns that exist in the destination but aren't in your query
211
208
212
209
### Load
213
210
214
-
In the load stage, the normalized SELECT queries from `.model` files are wrapped in INSERT statements and executed on the destination.
211
+
In the load stage, the normalized queries from `.model` files are wrapped in INSERT statements and executed on the destination.
215
212
For example, given this query from the extract stage:
216
213
217
214
```sql
218
215
SELECT id, value FROM table
219
216
```
220
217
221
-
After the normalize stage processes it (adding dlt columns, wrapping in subquery, etc.), the load stage executes:
218
+
After the normalize stage processes it (adding dlt columns, wrapping in subquery, etc.) and results in:
219
+
220
+
```sql
221
+
SELECT
222
+
_dlt_subquery."id"AS"id",
223
+
_dlt_subquery."value"AS"value",
224
+
'1749134128.17655'AS"_dlt_load_id",
225
+
UUID() AS"_dlt_id"
226
+
FROM (
227
+
SELECT
228
+
"my_table"."id"AS"id",
229
+
"my_table"."value"AS"value"
230
+
FROM"my_pipeline_dataset"."my_table"AS"my_table"
231
+
)
232
+
AS _dlt_subquery
233
+
```
234
+
235
+
The load stage executes:
222
236
223
237
```sql
224
238
INSERT INTO
@@ -237,7 +251,7 @@ FROM (
237
251
AS _dlt_subquery
238
252
```
239
253
240
-
The SELECT portion is what was produced during the normalize stage. In the load stage, this query is executed via the destination's SQL client, materializing the transformation result directly in the database.
254
+
The query is executed via the destination's SQL client, materializing the transformation result directly in the database.
0 commit comments