Conversation
| return list(sorted(self.environment_names)) | ||
|
|
||
|
|
||
| def identify_restatement_intervals_across_snapshot_versions( |
There was a problem hiding this comment.
This got moved to common because I plan to call it outside the evaluator in an upcoming PR that improves the console output as well as the --explain output around restatements
6040d43 to
48d45aa
Compare
042e3a1 to
f33490a
Compare
48d45aa to
f234322
Compare
f33490a to
81bac93
Compare
| } | ||
|
|
||
| # identify the ones that we havent picked up yet, which are the ones that dont exist in any environment | ||
| if remaining_snapshot_ids := set(all_matching_non_prod_snapshots).difference( |
There was a problem hiding this comment.
I guess it's worth noting that there's still a slight risk of missing a relevant dependency. For example, if there's a snapshot A' that is not promoted anywhere which has a downstream dependency D which is a model that has been removed from all existing environments, we won't drop intervals for D like we're suppose to because its name won't show up here.
There was a problem hiding this comment.
Yes, that's true. We only determine the dependency tree based on promoted snapshots.
I couldn't think of a sane way to push dependency resolution to the db layer, and reading all snapshots to figure this out is a non-starter, so you're right in that this PR improves rather than fully eliminates the current situation.
sqlmesh/core/plan/common.py
Outdated
| # required by StateSync.remove_intervals() | ||
| # but at this point we have minimized the list by excluding the ones that are already present in prod | ||
| # and also excluding the ones we have already matched earlier while traversing the environment DAGs | ||
| remaining_snapshots = state_reader.get_snapshots(snapshot_ids=remaining_snapshot_ids) |
There was a problem hiding this comment.
Why are we still fetching full snapshots for all remaining snapshots and not just the ones that are full_history_restatement_only?
There was a problem hiding this comment.
The handling of full_history_restatement_only happens next on line 183.
This step is just for snapshots that are not promoted in any environment. They may or may not be full_history_restatement_only, at this point we are just identifying them and setting the intervals to clear based on the requested restatements.
Note that Environments only have SnapshotTableInfo available, not full Snapshot, so everything within this method is based on SnapshotTableInfo. The other reason for needing SnapshotTableInfo is that ultimately these get passed to StateSync.remove_intervals(), which requires SnapshotTableInfo at a minimum.
So the order currently goes:
- Iterate through each environment, pick up any snapshots and downstream dependencies matching the requested restatements and add the
SnapshotId/SnapshotTableInfosnapshot_intervals_to_clear. - Then, based on
name, pick up any remaining snapshots that weren't promoted in any environment, and add theSnapshotId/SnapshotTableInfotosnapshot_intervals_to_clear. In order to getSnapshotTableInfofor them, we have to look up the fullSnapshotrecord. Note that we extendloaded_snapshotswith these so we can use them in the next step. - Next, for everything we have identified in
snapshot_intervals_to_clear, we check if any arefull_history_restatement_onlyand if they are selectively widen intervals for those ones. To selectively widen, we need full snapshots to pass toSnapshot.get_removal_interval, so we fetch fullSnapshots here.
Snapshot-fetching optimizations exist in the following places:
remaining_snapshot_idsare just the ones we haven't already fetched. The ones we have already fetched are insnapshot_intervals_to_clearso we do a set difference to limit to just the additional ones we need and put those inremaining_snapshot_ids.- After fetching full snapshots for
remaining_snapshot_idsto get theirSnapshotTableInfo's, we add them to the list ofloaded_snapshotsso they dont need to be looked up again, - When handling
full_history_restatement_only, we only fetch fullSnapshots for the ones that don't already exist inloaded_snapshots
And of course if any of the affected snapshot id's had ever been loaded at a different time, they will already exist in the disk cache
There was a problem hiding this comment.
In order to get SnapshotTableInfo for them, we have to look up the full Snapshot record. Note that we extend loaded_snapshots with these so we can use them in the next step.
Didn't we introduce SnapshotIdAndVersion precisely to avoid fetching full snapshots here?
The other reason for needing SnapshotTableInfo is that ultimately these get passed to StateSync.remove_intervals(), which requires SnapshotTableInfo at a minimum.
I'm pretty sure we can update that interface to accept SnapshotIdAndVersion, no?
There was a problem hiding this comment.
Ok, i've:
- added
kind_nametoSnapshotIdAndVersion(so we can determinefull_history_restatement_onlywithout fetching fullSnapshots) - Created a
SnapshotIdAndVersionLiketype that encompassesSnapshotIdAndVersionplus its supersets - reduced the
StateSync.remove_intervalsapi to only requireSnapshotIdAndVersionLikerather thanSnapshotInfoLike
81bac93 to
1438970
Compare
sqlmesh/core/snapshot/definition.py
Outdated
|
|
||
| name: str | ||
| version: str | ||
| kind_name: t.Optional[ModelKindName] = None |
There was a problem hiding this comment.
Note: this is optional because it's defined as optional on SnapshotTableInfo and I wanted to keep treating SnapshotTableInfo as a superset
…s, not just promoted ones
1438970 to
924e59e
Compare
sqlmesh/core/plan/common.py
Outdated
| # So for now, these are not considered | ||
| s_id | ||
| for s_id, s in snapshot_intervals_to_clear.items() | ||
| if s.snapshot.kind_name and s.snapshot.kind_name.full_history_restatement_only |
There was a problem hiding this comment.
I'm pretty sure SnapshotIdAndVersion can implement ModelKindMixin and then you can continue doing just snapshot.full_history_restatement_only .
There was a problem hiding this comment.
Good idea, that didnt occur to me. Done
This builds on
StateSync.get_snapshots_by_namesintroduced in #5273The current implementation of prod restatements only prevents dev data from getting promoted to prod for snapshots that are currently "active" in a dev environment.
However, it's possible for snapshots to exist that aren't currently promoted in any dev environment (since they represent an older version of a model) but may get promoted if a user reverts to that older version.
If that revert happens and then the environment is subsequently deployed, the old data may still make its way to prod.
So this PR ensures that all versions of affected snapshots have their intervals cleared, regardless of if they are promoted in an environment or not