Skip to content

Commit fee780f

Browse files
committed
Feat: Allow some control of table naming at the physical layer
1 parent 2953942 commit fee780f

File tree

11 files changed

+505
-14
lines changed

11 files changed

+505
-14
lines changed

docs/guides/configuration.md

Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -320,10 +320,14 @@ The cache directory is automatically created if it doesn't exist. You can clear
320320

321321
SQLMesh creates schemas, physical tables, and views in the data warehouse/engine. Learn more about why and how SQLMesh creates schema in the ["Why does SQLMesh create schemas?" FAQ](../faq/faq.md#schema-question).
322322

323-
The default SQLMesh behavior described in the FAQ is appropriate for most deployments, but you can override where SQLMesh creates physical tables and views with the `physical_schema_mapping`, `environment_suffix_target`, and `environment_catalog_mapping` configuration options. These options are in the [environments](../reference/configuration.md#environments) section of the configuration reference page.
323+
The default SQLMesh behavior described in the FAQ is appropriate for most deployments, but you can override *where* SQLMesh creates physical tables and views with the `physical_schema_mapping`, `environment_suffix_target`, and `environment_catalog_mapping` configuration options.
324+
325+
You can also override *what* the physical tables are called by using the `physical_table_naming_convention` option.
326+
327+
These options are in the [environments](../reference/configuration.md#environments) section of the configuration reference page.
324328

325329
#### Physical table schemas
326-
By default, SQLMesh creates physical tables for a model with a naming convention of `sqlmesh__[model schema]`.
330+
By default, SQLMesh creates physical schemas for a model with a naming convention of `sqlmesh__[model schema]`.
327331

328332
This can be overridden on a per-schema basis using the `physical_schema_mapping` option, which removes the `sqlmesh__` prefix and uses the [regex pattern](https://docs.python.org/3/library/re.html#regular-expression-syntax) you provide to map the schemas defined in your model to their corresponding physical schemas.
329333

@@ -436,6 +440,89 @@ Given the example of a model called `my_schema.users` with a default catalog of
436440
- Using `environment_suffix_target: catalog` only works on engines that support querying across different catalogs. If your engine does not support cross-catalog queries then you will need to use `environment_suffix_target: schema` or `environment_suffix_target: table` instead.
437441
- Automatic catalog creation is not supported on all engines even if they support cross-catalog queries. For engines where it is not supported, the catalogs must be managed externally from SQLMesh and exist prior to invoking SQLMesh.
438442

443+
#### Physical table naming convention
444+
445+
Out of the box, SQLMesh has the following defaults set:
446+
447+
- `environment_suffix_target: schema`
448+
- `physical_table_naming_convention: schema_and_table`
449+
450+
Given a catalog of `warehouse` and a model named `finance_mart.transaction_events_over_threshold`, this causes SQLMesh to create physical tables using the following convention:
451+
452+
```
453+
# <catalog>.sqlmesh__<schema>.<schema>__<table>__<fingerprint>
454+
455+
warehouse.sqlmesh__finance_mart.finance_mart__transaction_events_over_threshold__<fingerprint>
456+
```
457+
458+
This deliberately contains some redundancy with the *model* schema as it's repeated at the physical layer in both the physical schema name as well as the physical table name.
459+
460+
##### Table only
461+
462+
Some engines have object name length limitations which cause them to [silently truncate](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS) table and view names that exceed this limit. This behaviour breaks SQLMesh, so we raise a runtime error if we detect the engine would silently truncate the name of the table we are trying to create.
463+
464+
Having redundancy in the physical table names does reduce the number of characters that can be utilised in model names. To increase the number of characters available to model names, you can use `physical_table_naming_convention` like so:
465+
466+
=== "YAML"
467+
468+
```yaml linenums="1"
469+
physical_table_naming_convention: table_only
470+
```
471+
472+
=== "Python"
473+
474+
```python linenums="1"
475+
from sqlmesh.core.config import Config, ModelDefaultsConfig, TableNamingConvention
476+
477+
config = Config(
478+
model_defaults=ModelDefaultsConfig(dialect=<dialect>),
479+
physical_table_naming_convention=TableNamingConvention.TABLE_ONLY,
480+
)
481+
```
482+
483+
This will cause SQLMesh to omit the model schema from the table name and generate physical names that look like (using the above example):
484+
```
485+
# <catalog>.sqlmesh__<schema>.<table>__<fingerprint>
486+
487+
warehouse.sqlmesh__finance_mart.transaction_events_over_threshold__<fingerprint>
488+
```
489+
490+
Notice that the model schema name is no longer part of the physical table name. This allows for slightly longer model names on engines with low identifier length limits, which may be useful for your project.
491+
492+
##### MD5 hash
493+
494+
If you *still* need more characters, you can set `physical_table_naming_convention: hash_md5` like so:
495+
496+
=== "YAML"
497+
498+
```yaml linenums="1"
499+
physical_table_naming_convention: hash_md5
500+
```
501+
502+
=== "Python"
503+
504+
```python linenums="1"
505+
from sqlmesh.core.config import Config, ModelDefaultsConfig, TableNamingConvention
506+
507+
config = Config(
508+
model_defaults=ModelDefaultsConfig(dialect=<dialect>),
509+
physical_table_naming_convention=TableNamingConvention.HASH_MD5,
510+
)
511+
```
512+
513+
This will cause SQLMesh generate physical names that are always 45-50 characters in length and look something like:
514+
515+
```
516+
# sqlmesh_md5__<hash of what we would have generated using 'schema_and_table'>
517+
518+
sqlmesh_md5__d3b07384d113edec49eaa6238ad5ff00
519+
520+
# or, for a dev preview
521+
sqlmesh_md5__d3b07384d113edec49eaa6238ad5ff00__dev
522+
```
523+
524+
This has a downside that now it's much more difficult to determine which table corresponds to which model by just looking at the database with a SQL client. However, the table names now have a predictable length so there are no longer any surprises with identfiers exceeding the max length at the physical layer.
525+
439526
#### Environment view catalogs
440527

441528
By default, SQLMesh creates an environment view in the same [catalog](../concepts/glossary.md#catalog) as the physical table the view points to. The physical table's catalog is determined by either the catalog specified in the model name or the default catalog defined in the connection.

docs/reference/configuration.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,9 @@ Configuration options for SQLMesh environment creation and promotion.
3636
| `physical_schema_override` | (Deprecated) Use `physical_schema_mapping` instead. A mapping from model schema names to names of schemas in which physical tables for the corresponding models will be placed. | dict[string, string] | N |
3737
| `physical_schema_mapping` | A mapping from regular expressions to names of schemas in which physical tables for the corresponding models [will be placed](../guides/configuration.md#physical-table-schemas). (Default physical schema name: `sqlmesh__[model schema]`) | dict[string, string] | N |
3838
| `environment_suffix_target` | Whether SQLMesh views should append their environment name to the `schema` or `table` - [additional details](../guides/configuration.md#view-schema-override). (Default: `schema`) | string | N |
39-
| `gateway_managed_virtual_layer` | Whether SQLMesh views of the virtual layer will be created by the default gateway or model specified gateways - [additional details](../guides/multi_engine.md#gateway-managed-virtual-layer). (Default: False) | boolean | N |
40-
| `infer_python_dependencies` | Whether SQLMesh will statically analyze Python code to automatically infer Python package requirements. (Default: True) | boolean | N |
39+
| `physical_table_naming_convention`| Sets which parts of the model name are included in the physical table names. Options are `schema_and_table` or `table_only` - [additional details](../guides/configuration.md#physical-table-naming-convention). (Default: `schema_and_table`) | string | N |
40+
| `gateway_managed_virtual_layer` | Whether SQLMesh views of the virtual layer will be created by the default gateway or model specified gateways - [additional details](../guides/multi_engine.md#gateway-managed-virtual-layer). (Default: False) | boolean | N |
41+
| `infer_python_dependencies` | Whether SQLMesh will statically analyze Python code to automatically infer Python package requirements. (Default: True) | boolean | N |
4142
| `environment_catalog_mapping` | A mapping from regular expressions to catalog names. The catalog name is used to determine the target catalog for a given environment. | dict[string, string] | N |
4243
| `log_limit` | The default number of logs to keep (Default: `20`) | int | N |
4344

sqlmesh/core/config/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,10 @@
22
AutoCategorizationMode as AutoCategorizationMode,
33
CategorizerConfig as CategorizerConfig,
44
)
5-
from sqlmesh.core.config.common import EnvironmentSuffixTarget as EnvironmentSuffixTarget
5+
from sqlmesh.core.config.common import (
6+
EnvironmentSuffixTarget as EnvironmentSuffixTarget,
7+
TableNamingConvention as TableNamingConvention,
8+
)
69
from sqlmesh.core.config.connection import (
710
AthenaConnectionConfig as AthenaConnectionConfig,
811
BaseDuckDBConnectionConfig as BaseDuckDBConnectionConfig,

sqlmesh/core/config/common.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,34 @@ def __repr__(self) -> str:
4949
return str(self)
5050

5151

52+
class TableNamingConvention(str, Enum):
53+
# Causes table names at the physical layer to follow the convention:
54+
# <schema-name>__<table-name>__<fingerprint>
55+
SCHEMA_AND_TABLE = "schema_and_table"
56+
57+
# Causes table names at the physical layer to follow the convention:
58+
# <table-name>__<fingerprint>
59+
TABLE_ONLY = "table_only"
60+
61+
# Takes the table name that would be returned from SCHEMA_AND_TABLE and wraps it in md5()
62+
# to generate a hash and prefixes the has with `sqlmesh_md5__`, for the following reasons:
63+
# - at a glance, you can still see it's managed by sqlmesh and that md5 was used to generate the hash
64+
# - unquoted identifiers that start with numbers can trip up DB engine parsers, so having a text prefix prevents this
65+
# This causes table names at the physical layer to follow the convention:
66+
# sqlmesh_md5__3b07384d113edec49eaa6238ad5ff00d
67+
HASH_MD5 = "hash_md5"
68+
69+
@classproperty
70+
def default(cls) -> TableNamingConvention:
71+
return TableNamingConvention.SCHEMA_AND_TABLE
72+
73+
def __str__(self) -> str:
74+
return self.name
75+
76+
def __repr__(self) -> str:
77+
return str(self)
78+
79+
5280
def _concurrent_tasks_validator(v: t.Any) -> int:
5381
if isinstance(v, str):
5482
v = int(v)

sqlmesh/core/config/root.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
from sqlmesh.cicd.config import CICDBotConfig
1515
from sqlmesh.core import constants as c
1616
from sqlmesh.core.console import get_console
17-
from sqlmesh.core.config import EnvironmentSuffixTarget
17+
from sqlmesh.core.config import EnvironmentSuffixTarget, TableNamingConvention
1818
from sqlmesh.core.config.base import BaseConfig, UpdateStrategy
1919
from sqlmesh.core.config.common import variables_validator, compile_regex_mapping
2020
from sqlmesh.core.config.connection import (
@@ -106,6 +106,7 @@ class Config(BaseConfig):
106106
model_defaults: Default values for model definitions.
107107
physical_schema_mapping: A mapping from regular expressions to names of schemas in which physical tables for corresponding models will be placed.
108108
environment_suffix_target: Indicates whether to append the environment name to the schema or table name.
109+
physical_table_naming_convention: Indicates how tables should be named at the physical layer
109110
gateway_managed_virtual_layer: Whether the models' views in the virtual layer are created by the model-specific gateway rather than the default gateway.
110111
infer_python_dependencies: Whether to statically analyze Python code to automatically infer Python package requirements.
111112
environment_catalog_mapping: A mapping from regular expressions to catalog names. The catalog name is used to determine the target catalog for a given environment.
@@ -147,6 +148,7 @@ class Config(BaseConfig):
147148
environment_suffix_target: EnvironmentSuffixTarget = Field(
148149
default=EnvironmentSuffixTarget.default
149150
)
151+
physical_table_naming_convention: t.Optional[TableNamingConvention] = None
150152
gateway_managed_virtual_layer: bool = False
151153
infer_python_dependencies: bool = True
152154
environment_catalog_mapping: RegexKeyDict = {}

sqlmesh/core/context.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2904,9 +2904,11 @@ def _nodes_to_snapshots(self, nodes: t.Dict[str, Node]) -> t.Dict[str, Snapshot]
29042904
fingerprint_cache: t.Dict[str, SnapshotFingerprint] = {}
29052905

29062906
for node in nodes.values():
2907-
kwargs = {}
2907+
kwargs: t.Dict[str, t.Any] = {}
29082908
if node.project in self._projects:
2909-
kwargs["ttl"] = self.config_for_node(node).snapshot_ttl
2909+
config = self.config_for_node(node)
2910+
kwargs["ttl"] = config.snapshot_ttl
2911+
kwargs["table_naming_convention"] = config.physical_table_naming_convention
29102912

29112913
snapshot = Snapshot.from_node(
29122914
node,

sqlmesh/core/snapshot/definition.py

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from sqlglot import exp
1414
from sqlglot.optimizer.normalize_identifiers import normalize_identifiers
1515

16+
from sqlmesh.core.config import TableNamingConvention
1617
from sqlmesh.core import constants as c
1718
from sqlmesh.core.audit import StandaloneAudit
1819
from sqlmesh.core.environment import EnvironmentSuffixTarget
@@ -44,7 +45,7 @@
4445
format_evaluated_code_exception,
4546
Executable,
4647
)
47-
from sqlmesh.utils.hashing import hash_data
48+
from sqlmesh.utils.hashing import hash_data, md5
4849
from sqlmesh.utils.pydantic import PydanticModel, field_validator
4950

5051
if t.TYPE_CHECKING:
@@ -333,6 +334,7 @@ class SnapshotInfoMixin(ModelKindMixin):
333334
# This can be removed from this model once Pydantic 1 support is dropped (must remain in `Snapshot` though)
334335
base_table_name_override: t.Optional[str]
335336
dev_table_suffix: str
337+
table_naming_convention: t.Optional[TableNamingConvention] = None
336338

337339
@cached_property
338340
def identifier(self) -> str:
@@ -451,6 +453,7 @@ def _table_name(self, version: str, is_deployable: bool) -> str:
451453
version,
452454
catalog=self.fully_qualified_table.catalog,
453455
suffix=self.dev_table_suffix if is_dev_table else None,
456+
naming_convention=self.table_naming_convention,
454457
)
455458

456459
@property
@@ -580,6 +583,7 @@ class Snapshot(PydanticModel, SnapshotInfoMixin):
580583
migrated: Whether or not this snapshot has been created as a result of migration.
581584
unrestorable: Whether or not this snapshot can be used to revert its model to a previous version.
582585
next_auto_restatement_ts: The timestamp which indicates when is the next time this snapshot should be restated.
586+
table_naming_convention: Convention to follow when generating the physical table name
583587
"""
584588

585589
name: str
@@ -605,6 +609,9 @@ class Snapshot(PydanticModel, SnapshotInfoMixin):
605609
base_table_name_override: t.Optional[str] = None
606610
next_auto_restatement_ts: t.Optional[int] = None
607611
dev_table_suffix: str = "dev"
612+
table_naming_convention_: t.Optional[TableNamingConvention] = Field(
613+
default=None, alias="table_naming_convention"
614+
)
608615

609616
@field_validator("ttl")
610617
@classmethod
@@ -656,6 +663,7 @@ def from_node(
656663
ttl: str = c.DEFAULT_SNAPSHOT_TTL,
657664
version: t.Optional[str] = None,
658665
cache: t.Optional[t.Dict[str, SnapshotFingerprint]] = None,
666+
table_naming_convention: t.Optional[TableNamingConvention] = None,
659667
) -> Snapshot:
660668
"""Creates a new snapshot for a node.
661669
@@ -666,6 +674,7 @@ def from_node(
666674
ttl: A TTL to determine how long orphaned (snapshots that are not promoted anywhere) should live.
667675
version: The version that a snapshot is associated with. Usually set during the planning phase.
668676
cache: Cache of node name to fingerprints.
677+
table_naming_convention: Convention to follow when generating the physical table name
669678
670679
Returns:
671680
The newly created snapshot.
@@ -697,6 +706,7 @@ def from_node(
697706
updated_ts=created_ts,
698707
ttl=ttl,
699708
version=version,
709+
table_naming_convention=table_naming_convention,
700710
)
701711

702712
def __eq__(self, other: t.Any) -> bool:
@@ -1206,6 +1216,7 @@ def table_info(self) -> SnapshotTableInfo:
12061216
custom_materialization=custom_materialization,
12071217
dev_table_suffix=self.dev_table_suffix,
12081218
model_gateway=self.model_gateway,
1219+
table_naming_convention=self.table_naming_convention, # type: ignore
12091220
)
12101221

12111222
@property
@@ -1568,14 +1579,41 @@ def table_name(
15681579
version: str,
15691580
catalog: t.Optional[str] = None,
15701581
suffix: t.Optional[str] = None,
1582+
naming_convention: t.Optional[TableNamingConvention] = None,
15711583
) -> str:
15721584
table = exp.to_table(name)
15731585

1574-
# bigquery projects usually have "-" in them which is illegal in the table name, so we aggressively prune
1575-
name = "__".join(sanitize_name(part.name) for part in table.parts)
1586+
naming_convention = naming_convention or TableNamingConvention.default
1587+
1588+
if naming_convention == TableNamingConvention.HASH_MD5:
1589+
# just take a MD5 hash of what we would have generated anyway using SCHEMA_AND_TABLE
1590+
value_to_hash = table_name(
1591+
physical_schema=physical_schema,
1592+
name=name,
1593+
version=version,
1594+
catalog=catalog,
1595+
suffix=suffix,
1596+
naming_convention=TableNamingConvention.SCHEMA_AND_TABLE,
1597+
)
1598+
full_name = f"{c.SQLMESH}_md5__{md5(value_to_hash)}"
1599+
else:
1600+
# note: Snapshot._table_name() already strips the catalog from the model name before calling this function
1601+
# Therefore, a model with 3-part naming like "foo.bar.baz" gets passed as (name="bar.baz", catalog="foo") to this function
1602+
# This is why there is no TableNamingConvention.CATALOG_AND_SCHEMA_AND_TABLE
1603+
table_parts = table.parts
1604+
parts_to_consider = 2 if naming_convention == TableNamingConvention.SCHEMA_AND_TABLE else 1
1605+
1606+
# in case the parsed table name has less parts than what the naming convention says we should be considering
1607+
parts_to_consider = min(len(table_parts), parts_to_consider)
1608+
1609+
# bigquery projects usually have "-" in them which is illegal in the table name, so we aggressively prune
1610+
name = "__".join(sanitize_name(part.name) for part in table_parts[-parts_to_consider:])
1611+
1612+
full_name = f"{name}__{version}"
1613+
15761614
suffix = f"__{suffix}" if suffix else ""
15771615

1578-
table.set("this", exp.to_identifier(f"{name}__{version}{suffix}"))
1616+
table.set("this", exp.to_identifier(f"{full_name}{suffix}"))
15791617
table.set("db", exp.to_identifier(physical_schema))
15801618
if not table.catalog and catalog:
15811619
table.set("catalog", exp.to_identifier(catalog))

0 commit comments

Comments
 (0)