Feat: prevent other processes seeing missing intervals during restatement by erindru · Pull Request #5285 · SQLMesh/sqlmesh

erindru · 2025-09-03T04:42:31Z

This PR builds on #5273 and #5274

Currently, the RestatementStage clears intervals from state before the BackfillStage populates data.

This means that other processes that look at state while the restatement is running will see missing intervals and try to fill them, competing with the restatement process.

This PR adjusts the plan evaluation order to:

move RestatementStage to after BackfillStage
- note that despite the name, RestatementStage doesn't actually perform restatement, it's just responsible for clearing intervals from state
adjust RestatementStage to clear intervals from all environments except prod

The result of this means that:

Intervals are never cleared from prod so no other processes see missing intervals and compete with the current plan to try and fill them
Intervals are only cleared from dev environments if they were successfully restated in prod

There are some new failure modes:

If the plan fails during restatement, prod will be in an inconsistent state. It's expected that the user will run the plan again until it succeeds. Once it succeeds, prod will be back in a consistent state.
If a user deploys a new version of the model while the restatement is occurring (a side-effect of keeping restatements internal is that other plans don't know about them), there is a check for new snapshot versions of the restated models. If some are detected, they are output to the console before the plan fails with the usual conflict error:

sqlmesh/core/plan/stages.py

izeigerman · 2025-09-09T12:58:31Z

sqlmesh/core/plan/evaluator.py

+
        self.state_sync.remove_intervals(
-            snapshot_intervals=list(snapshot_intervals_to_restate),
+            snapshot_intervals=[(s.table_info, s.interval) for s in intervals_to_clear.values()],


I believe we should not nuke intervals for snapshots that have changed while the restatement was running. As per our discussion earlier:

There’s a chance that another snapshot sneaks into prod while restatement was running but before we nuked the intervals. This snapshot’s intervals will be nuked, but now we have an intermittent period during which the prod is in incorrect state. The subsequent run will, of course, catch up the missing intervals. However, this is confusing to the user since their next run will now backfill models it wasn’t suppose to. To prevent this, the restatement plan will now check whether the snapshots for restated models changed in prod while this plan was running. If some snapshots were changed, the intervals for those won’t be removed and instead the plan will return an error stating that the restatement for these models wasn’t successful. The user will then have a choice whether to reissue restatement for these models or accept that these models haven’t been restated properly and move on.

Specifically

If some snapshots were changed, the intervals for those won’t be removed

Ah, good point, I glossed over that when I was focusing on identifying the affected snapshots to begin with.

I've updated it to only clear intervals for snapshots whose names are not in the "deployed during restatement" list

sqlmesh/core/plan/evaluator.py

izeigerman · 2025-09-09T13:01:10Z

sqlmesh/core/plan/evaluator.py

+                    plan.environment.naming_info,
+                    self.default_catalog,
+                )
+                # note: the plan will automatically fail at the promotion stage with a ConflictingPlanError because the environment was changed by another plan


I see. Please note that this is not always going to be the case since we plan to support concurrent plan applications soon. So I'd suggest we fail explicitly and not rely on a failure downstream.

Additionally, we won't need log_models_updated_during_restatement since we can just put the entire message into exception.

I think we still need log_models_updated_during_restatement to have useful formatted console output about the affected models rather than trying to jam all this info into an exception message.

However, i've moved the "what to do about it" part into the exception message and now raise an exception here instead of relying on the finalize stage throwing the error.

I've updated the screenshot in the PR description to give an indication of what it looks like

izeigerman · 2025-09-09T13:02:53Z

sqlmesh/core/plan/explainer.py

+            # allows us to print the physical names of the tables affected in the console output
+            # note that we can't use the DeployabilityIndex on the plan because it only includes
+            # snapshots for the current environment, not across all environments
+            deployability_index = DeployabilityIndex.create(


We should only do this if the target environment is not prod.

Actually, I believe we should always use DeployabilityIndex.all_deployable() since "restatement" doesn't apply to "dev" tables. For "dev" tables it's always a preview.

RestatementStage isnt even produced if the target is not prod, because as of this PR it only refers to clearing intervals any environment that isnt prod, which is only necessary if the target is prod.

I think I see what you're saying - a prod restatement would never care about dev previews because that data has no chance of ever being deployed. So there is no need to build a deployability index, and in the console output we should just reference the deployable version of the model's physical table, which we can do with DeployabilityIndex.all_deployable()

erindru · 2025-09-10T03:04:22Z

sqlmesh/core/snapshot/definition.py

+    )
+
+
+def model_display_name(


I added this because I wanted to output model display names based on SnapshotIdAndVersion and the existing display_name function required an entire SnapshotTableInfo object just to read the node type and name.

If we already know it's a model then all we need to produce a display name is the snapshot name string, so model_display_name reduces the requirement to just this.

SnapshotIdLike would work as well in order to help narrow the inputs to still be related to Snapshots rather than any random str; let me know if this is preferred

Should we just add display_name to SnapshotIdAndVersion for consistency? Then you no longer need to do isinstance check

I didnt do that initially because SnapshotIdAndVersion didnt have enough information to implement SnapshotInfoMixin and I didn't want to duplicate the method signature for display_name.

However, i've added it and I guess mypy will keep the method signatures in sync

sqlmesh/core/plan/evaluator.py

izeigerman · 2025-09-11T20:16:29Z

sqlmesh/core/plan/evaluator.py

+                version = snapshot.table_info.version
+                if (
+                    prod_snapshot := promoted_snapshots_by_name.get(name)
+                ) and prod_snapshot.version != version:


Isn't it sufficient to just check the version? I'm sure the version will match if the snapshots match

Resolved internally, this is checking the version, := just got misread as =

izeigerman · 2025-09-11T20:18:56Z

sqlmesh/core/plan/explainer.py

+        if isinstance(snapshot, SnapshotInfoMixin):
+            return snapshot.display_name(**naming_kwargs)
+
+        return model_display_name(node_name=snapshot.name, **naming_kwargs)


See https://github.com/TobikoData/sqlmesh/pull/5285/files#r2342240695

izeigerman

Few more comments. LGTM once addressed

…ment

izeigerman reviewed Sep 3, 2025

View reviewed changes

sqlmesh/core/plan/stages.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 3, 2025

View reviewed changes

sqlmesh/core/plan/stages.py Outdated Show resolved Hide resolved

erindru force-pushed the erin/adjust-restatement-stage branch from 4ff0abb to ac4e372 Compare September 5, 2025 00:29

erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from 042e3a1 to f33490a Compare September 7, 2025 23:54

erindru force-pushed the erin/adjust-restatement-stage branch from 2e8a99a to 6a817e9 Compare September 8, 2025 00:30

erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from f33490a to 81bac93 Compare September 8, 2025 01:46

erindru force-pushed the erin/adjust-restatement-stage branch from 6a817e9 to 907dc44 Compare September 8, 2025 01:54

erindru marked this pull request as ready for review September 8, 2025 02:40

izeigerman reviewed Sep 9, 2025

View reviewed changes

sqlmesh/core/plan/evaluator.py Outdated Show resolved Hide resolved

izeigerman reviewed Sep 9, 2025

View reviewed changes

erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch 2 times, most recently from 1438970 to 924e59e Compare September 9, 2025 23:39

erindru force-pushed the erin/adjust-restatement-stage branch 2 times, most recently from bc326d7 to 4f19430 Compare September 10, 2025 02:26

erindru requested a review from izeigerman September 10, 2025 02:51

erindru force-pushed the erin/adjust-restatement-stage branch from 4f19430 to e71adf6 Compare September 10, 2025 02:59

erindru commented Sep 10, 2025

View reviewed changes

Base automatically changed from erin/fix-restatement-clear-across-all-environments to main September 10, 2025 21:08

erindru force-pushed the erin/adjust-restatement-stage branch from e71adf6 to 6eaa6eb Compare September 10, 2025 21:26

erindru mentioned this pull request Sep 11, 2025

Feat: Improve CLI and --explain output for restatements #5348

Merged

izeigerman reviewed Sep 11, 2025

View reviewed changes

sqlmesh/core/plan/evaluator.py Show resolved Hide resolved

izeigerman reviewed Sep 11, 2025

View reviewed changes

izeigerman approved these changes Sep 11, 2025

View reviewed changes

erindru added 4 commits September 11, 2025 21:46

Feat: prevent other processes seeing missing intervals during restate…

a17b905

…ment

fix tests

00b3d89

PR feedback

c1da66a

Add display_name to SnapshotIdAndVersion

acad2ed

erindru force-pushed the erin/adjust-restatement-stage branch from 6eaa6eb to acad2ed Compare September 11, 2025 21:57

erindru merged commit 26ebace into main Sep 11, 2025
36 checks passed

erindru deleted the erin/adjust-restatement-stage branch September 11, 2025 23:10

		)


		def model_display_name(

Conversation

erindru commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

izeigerman Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

izeigerman Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erindru Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

izeigerman Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erindru Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

izeigerman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erindru commented Sep 3, 2025 •

edited

Loading

izeigerman Sep 9, 2025 •

edited

Loading

izeigerman Sep 9, 2025 •

edited

Loading

erindru Sep 10, 2025 •

edited

Loading

izeigerman Sep 11, 2025 •

edited

Loading

erindru Sep 11, 2025 •

edited

Loading