Feat: prevent other processes seeing missing intervals during restatement#5285
Feat: prevent other processes seeing missing intervals during restatement#5285
Conversation
4ff0abb to
ac4e372
Compare
042e3a1 to
f33490a
Compare
2e8a99a to
6a817e9
Compare
f33490a to
81bac93
Compare
6a817e9 to
907dc44
Compare
sqlmesh/core/plan/evaluator.py
Outdated
|
|
||
| self.state_sync.remove_intervals( | ||
| snapshot_intervals=list(snapshot_intervals_to_restate), | ||
| snapshot_intervals=[(s.table_info, s.interval) for s in intervals_to_clear.values()], |
There was a problem hiding this comment.
I believe we should not nuke intervals for snapshots that have changed while the restatement was running. As per our discussion earlier:
There’s a chance that another snapshot sneaks into prod while restatement was running but before we nuked the intervals. This snapshot’s intervals will be nuked, but now we have an intermittent period during which the prod is in incorrect state. The subsequent run will, of course, catch up the missing intervals. However, this is confusing to the user since their next run will now backfill models it wasn’t suppose to. To prevent this, the restatement plan will now check whether the snapshots for restated models changed in prod while this plan was running. If some snapshots were changed, the intervals for those won’t be removed and instead the plan will return an error stating that the restatement for these models wasn’t successful. The user will then have a choice whether to reissue restatement for these models or accept that these models haven’t been restated properly and move on.
Specifically
If some snapshots were changed, the intervals for those won’t be removed
There was a problem hiding this comment.
Ah, good point, I glossed over that when I was focusing on identifying the affected snapshots to begin with.
I've updated it to only clear intervals for snapshots whose names are not in the "deployed during restatement" list
sqlmesh/core/plan/evaluator.py
Outdated
| plan.environment.naming_info, | ||
| self.default_catalog, | ||
| ) | ||
| # note: the plan will automatically fail at the promotion stage with a ConflictingPlanError because the environment was changed by another plan |
There was a problem hiding this comment.
I see. Please note that this is not always going to be the case since we plan to support concurrent plan applications soon. So I'd suggest we fail explicitly and not rely on a failure downstream.
There was a problem hiding this comment.
Additionally, we won't need log_models_updated_during_restatement since we can just put the entire message into exception.
There was a problem hiding this comment.
I think we still need log_models_updated_during_restatement to have useful formatted console output about the affected models rather than trying to jam all this info into an exception message.
However, i've moved the "what to do about it" part into the exception message and now raise an exception here instead of relying on the finalize stage throwing the error.
I've updated the screenshot in the PR description to give an indication of what it looks like
sqlmesh/core/plan/explainer.py
Outdated
| # allows us to print the physical names of the tables affected in the console output | ||
| # note that we can't use the DeployabilityIndex on the plan because it only includes | ||
| # snapshots for the current environment, not across all environments | ||
| deployability_index = DeployabilityIndex.create( |
There was a problem hiding this comment.
We should only do this if the target environment is not prod.
Actually, I believe we should always use DeployabilityIndex.all_deployable() since "restatement" doesn't apply to "dev" tables. For "dev" tables it's always a preview.
There was a problem hiding this comment.
RestatementStage isnt even produced if the target is not prod, because as of this PR it only refers to clearing intervals any environment that isnt prod, which is only necessary if the target is prod.
I think I see what you're saying - a prod restatement would never care about dev previews because that data has no chance of ever being deployed. So there is no need to build a deployability index, and in the console output we should just reference the deployable version of the model's physical table, which we can do with DeployabilityIndex.all_deployable()
1438970 to
924e59e
Compare
bc326d7 to
4f19430
Compare
4f19430 to
e71adf6
Compare
| ) | ||
|
|
||
|
|
||
| def model_display_name( |
There was a problem hiding this comment.
I added this because I wanted to output model display names based on SnapshotIdAndVersion and the existing display_name function required an entire SnapshotTableInfo object just to read the node type and name.
If we already know it's a model then all we need to produce a display name is the snapshot name string, so model_display_name reduces the requirement to just this.
SnapshotIdLike would work as well in order to help narrow the inputs to still be related to Snapshots rather than any random str; let me know if this is preferred
There was a problem hiding this comment.
Should we just add display_name to SnapshotIdAndVersion for consistency? Then you no longer need to do isinstance check
There was a problem hiding this comment.
I didnt do that initially because SnapshotIdAndVersion didnt have enough information to implement SnapshotInfoMixin and I didn't want to duplicate the method signature for display_name.
However, i've added it and I guess mypy will keep the method signatures in sync
e71adf6 to
6eaa6eb
Compare
| version = snapshot.table_info.version | ||
| if ( | ||
| prod_snapshot := promoted_snapshots_by_name.get(name) | ||
| ) and prod_snapshot.version != version: |
There was a problem hiding this comment.
Isn't it sufficient to just check the version? I'm sure the version will match if the snapshots match
There was a problem hiding this comment.
Resolved internally, this is checking the version, := just got misread as =
sqlmesh/core/plan/explainer.py
Outdated
| if isinstance(snapshot, SnapshotInfoMixin): | ||
| return snapshot.display_name(**naming_kwargs) | ||
|
|
||
| return model_display_name(node_name=snapshot.name, **naming_kwargs) |
There was a problem hiding this comment.
izeigerman
left a comment
There was a problem hiding this comment.
Few more comments. LGTM once addressed
6eaa6eb to
acad2ed
Compare
This PR builds on #5273 and #5274
Currently, the
RestatementStageclears intervals from state before theBackfillStagepopulates data.This means that other processes that look at state while the restatement is running will see missing intervals and try to fill them, competing with the restatement process.
This PR adjusts the plan evaluation order to:
RestatementStageto afterBackfillStageRestatementStagedoesn't actually perform restatement, it's just responsible for clearing intervals from stateRestatementStageto clear intervals from all environments except prodThe result of this means that:
prodso no other processes see missing intervals and compete with the current plan to try and fill themThere are some new failure modes:
prodwill be in an inconsistent state. It's expected that the user will run the plan again until it succeeds. Once it succeeds,prodwill be back in a consistent state.