You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/concepts/models/model_kinds.md
+64-4Lines changed: 64 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -117,17 +117,77 @@ Depending on the target engine, models of the `INCREMENTAL_BY_TIME_RANGE` kind a
117
117
| Postgres | DELETE by time range, then INSERT |
118
118
| DuckDB | DELETE by time range, then INSERT |
119
119
120
+
## INCREMENTAL_BY_PARTITION
121
+
122
+
Models of the `INCREMENTAL_BY_PARTITION` kind are computed incrementally based on partition. A set of columns defines the model's partitioning key, and a partition is the group of rows with the same partitioning key value.
123
+
124
+
This model kind is designed for the scenario where data rows should be loaded and updated as a group based on their shared value for the partitioning key. This kind may be used with any SQL engine; SQLMesh will automatically create partitioned tables on engines that support explicit table partitioning (e.g., [BigQuery](https://cloud.google.com/bigquery/docs/creating-partitioned-tables), [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html)).
125
+
126
+
If a partitioning key in newly loaded data is not present in the model table, the new partitioning key and its data rows are inserted. If a partitioning key in newly loaded data is already present in the model table, **all the partitioning key's existing data rows in the model table are replaced** with the partitioning key's data rows in the newly loaded data. If a partitioning key is present in the model table but not present in the newly loaded data, the partitioning key's existing data rows are not modified and remain in the model table.
127
+
128
+
This kind is a good fit for datasets that have the following traits:
129
+
130
+
* The dataset's records can be grouped by a partitioning key.
131
+
* Each record has a partitioning key associated with it.
132
+
* It is appropriate to upsert records, so existing records can be overwritten by new arrivals when their partitioning keys match.
133
+
* All existing records associated with a given partitioning key can be removed or overwritten when any new record has the partitioning key value.
134
+
135
+
The column defining the partitioning key is specified in the model's `MODEL` DDL `partitioned_by` key. This example shows the `MODEL` DDL for an `INCREMENTAL_BY_PARTITION` model whose partition key is the row's value for the `region` column:
136
+
137
+
```sql linenums="1" hl_lines="4"
138
+
MODEL (
139
+
name db.events,
140
+
kind INCREMENTAL_BY_PARTITION,
141
+
partitioned_by region,
142
+
);
143
+
```
144
+
145
+
Compound partition keys are also supported, such as `region` and `department`:
146
+
147
+
```sql linenums="1" hl_lines="4"
148
+
MODEL (
149
+
name db.events,
150
+
kind INCREMENTAL_BY_PARTITION,
151
+
partitioned_by (region, department),
152
+
);
153
+
```
154
+
155
+
Date and/or timestamp column expressions are also supported (varies by SQL engine). This BigQuery example's partition key is based on the month each row's `event_date` occurred:
156
+
157
+
```sql linenums="1" hl_lines="4"
158
+
MODEL (
159
+
name db.events,
160
+
kind INCREMENTAL_BY_PARTITION,
161
+
partitioned_by DATETIME_TRUNC(event_date, MONTH)
162
+
);
163
+
```
164
+
165
+
**Note**: Partial data [restatement](../plans.md#restatement-plans) is not supported for this model kind, which means that the entire table will be recreated from scratch if restated. This may lead to data loss, so data restatement is disabled for models of this kind by default.
166
+
167
+
### Materialization strategy
168
+
Depending on the target engine, models of the `INCREMENTAL_BY_PARTITION` kind are materialized using the following strategies:
| Databricks | REPLACE WHERE by partitioning key |
173
+
| Spark | INSERT OVERWRITE by partitioning key |
174
+
| Snowflake | DELETE by partitioning key, then INSERT |
175
+
| BigQuery | DELETE by partitioning key, then INSERT |
176
+
| Redshift | DELETE by partitioning key, then INSERT |
177
+
| Postgres | DELETE by partitioning key, then INSERT |
178
+
| DuckDB | DELETE by partitioning key, then INSERT |
179
+
120
180
## INCREMENTAL_BY_UNIQUE_KEY
121
181
122
-
Models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are computed incrementally based on a unique key.
182
+
Models of the `INCREMENTAL_BY_UNIQUE_KEY` kind are computed incrementally based on a key that is unique for each data row.
123
183
124
184
If a key in newly loaded data is not present in the model table, the new data row is inserted. If a key in newly loaded data is already present in the model table, the existing row is updated with the new data. If a key is present in the model table but not present in the newly loaded data, its row is not modified and remains in the model table.
125
185
126
186
This kind is a good fit for datasets that have the following traits:
127
187
128
188
* Each record has a unique key associated with it.
129
189
* There is at most one record associated with each unique key.
130
-
* It is appropriate to upsert records, so existing records can be overridden by new arrivals when their keys match.
190
+
* It is appropriate to upsert records, so existing records can be overwritten by new arrivals when their keys match.
131
191
132
192
A [Slowly Changing Dimension](../glossary.md#slowly-changing-dimension-scd) (SCD) is one approach that fits this description well. See the [SCD Type 2](#scd-type-2) model kind for a specific model kind for SCD Type 2 models.
133
193
@@ -320,9 +380,9 @@ SCD Type 2 is a model kind that supports [slowly changing dimensions](https://en
320
380
321
381
SQLMesh achieves this by adding a `valid_from` and `valid_to` column to your model. The `valid_from` column is the timestamp that the record became valid (inclusive) and the `valid_to` column is the timestamp that the record became invalid (exclusive). The `valid_to` column is set to `NULL` for the latest record.
322
382
323
-
Therefore you can use these models to not only tell you what the latest value is for a given record but also what the values were anytime in the past. Note that maintaining this history does come at a cost of increased storage and compute and this may not be a good fit for sources that change frequently since the history could get very large.
383
+
Therefore, you can use these models to not only tell you what the latest value is for a given record but also what the values were anytime in the past. Note that maintaining this history does come at a cost of increased storage and compute and this may not be a good fit for sources that change frequently since the history could get very large.
324
384
325
-
**Note**: Partial data [restatement](../plans.md#restatement-plans) is not supported for this model kind, which means that the entire table will be recreated from scratch if restated. This may lead to data loss, which is why data restatement is disabled for models of this kind by default.
385
+
**Note**: Partial data [restatement](../plans.md#restatement-plans) is not supported for this model kind, which means that the entire table will be recreated from scratch if restated. This may lead to data loss, so data restatement is disabled for models of this kind by default.
326
386
327
387
There are two ways to tracking changes: By Time (Recommended) or By Column.
Copy file name to clipboardExpand all lines: docs/concepts/models/overview.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -192,7 +192,7 @@ Name is ***required*** and must be ***unique***.
192
192
- Storage format is a property for engines such as Spark or Hive that support storage formats such as `parquet` and `orc`.
193
193
194
194
### partitioned_by
195
-
- Partitioned by is an optional property for engines such as Spark or BigQuery that support partitioning. Use this to specify a multi-column partition key or to modify a date column for partitioning. For example, in BigQuery you could partition by day by extracting the day component of a timestamp column `event_ts` with `partitioned_by TIMESTAMP_TRUNC(event_ts, DAY)`.
195
+
- Partitioned by plays two roles. For most model kinds, it is an optional property for engines that support table partitioning such as Spark or BigQuery. For the [`INCREMENTAL_BY_PARTITION` model kind](./model_kinds.md#incremental_by_partition), it defines the partition key used to incrementally load data. It can specify a multi-column partition key or modify a date column for partitioning. For example, in BigQuery you could partition by day by extracting the day component of a timestamp column `event_ts` with `partitioned_by TIMESTAMP_TRUNC(event_ts, DAY)`.
196
196
197
197
### clustered_by
198
198
- Clustered by is an optional property for engines such as Bigquery that support clustering.
0 commit comments