Skip to content

Fix: When restating prod, clear intervals across all related snapshots, not just promoted ones#5274

Merged
erindru merged 3 commits intomainfrom
erin/fix-restatement-clear-across-all-environments
Sep 10, 2025
Merged

Fix: When restating prod, clear intervals across all related snapshots, not just promoted ones#5274
erindru merged 3 commits intomainfrom
erin/fix-restatement-clear-across-all-environments

Conversation

@erindru
Copy link
Collaborator

@erindru erindru commented Sep 2, 2025

This builds on StateSync.get_snapshots_by_names introduced in #5273

The current implementation of prod restatements only prevents dev data from getting promoted to prod for snapshots that are currently "active" in a dev environment.

However, it's possible for snapshots to exist that aren't currently promoted in any dev environment (since they represent an older version of a model) but may get promoted if a user reverts to that older version.

If that revert happens and then the environment is subsequently deployed, the old data may still make its way to prod.

So this PR ensures that all versions of affected snapshots have their intervals cleared, regardless of if they are promoted in an environment or not

return list(sorted(self.environment_names))


def identify_restatement_intervals_across_snapshot_versions(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got moved to common because I plan to call it outside the evaluator in an upcoming PR that improves the console output as well as the --explain output around restatements

@erindru erindru force-pushed the erin/state-sync-snapshots-by-name branch 3 times, most recently from 6040d43 to 48d45aa Compare September 7, 2025 21:10
@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from 042e3a1 to f33490a Compare September 7, 2025 23:54
@erindru erindru force-pushed the erin/state-sync-snapshots-by-name branch from 48d45aa to f234322 Compare September 8, 2025 00:49
Base automatically changed from erin/state-sync-snapshots-by-name to main September 8, 2025 01:36
@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from f33490a to 81bac93 Compare September 8, 2025 01:46
@erindru erindru marked this pull request as ready for review September 8, 2025 01:46
}

# identify the ones that we havent picked up yet, which are the ones that dont exist in any environment
if remaining_snapshot_ids := set(all_matching_non_prod_snapshots).difference(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's worth noting that there's still a slight risk of missing a relevant dependency. For example, if there's a snapshot A' that is not promoted anywhere which has a downstream dependency D which is a model that has been removed from all existing environments, we won't drop intervals for D like we're suppose to because its name won't show up here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. We only determine the dependency tree based on promoted snapshots.

I couldn't think of a sane way to push dependency resolution to the db layer, and reading all snapshots to figure this out is a non-starter, so you're right in that this PR improves rather than fully eliminates the current situation.

# required by StateSync.remove_intervals()
# but at this point we have minimized the list by excluding the ones that are already present in prod
# and also excluding the ones we have already matched earlier while traversing the environment DAGs
remaining_snapshots = state_reader.get_snapshots(snapshot_ids=remaining_snapshot_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we still fetching full snapshots for all remaining snapshots and not just the ones that are full_history_restatement_only?

Copy link
Collaborator Author

@erindru erindru Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of full_history_restatement_only happens next on line 183.

This step is just for snapshots that are not promoted in any environment. They may or may not be full_history_restatement_only, at this point we are just identifying them and setting the intervals to clear based on the requested restatements.

Note that Environments only have SnapshotTableInfo available, not full Snapshot, so everything within this method is based on SnapshotTableInfo. The other reason for needing SnapshotTableInfo is that ultimately these get passed to StateSync.remove_intervals(), which requires SnapshotTableInfo at a minimum.

So the order currently goes:

  • Iterate through each environment, pick up any snapshots and downstream dependencies matching the requested restatements and add the SnapshotId/SnapshotTableInfo snapshot_intervals_to_clear.
  • Then, based on name, pick up any remaining snapshots that weren't promoted in any environment, and add the SnapshotId/SnapshotTableInfo to snapshot_intervals_to_clear. In order to get SnapshotTableInfo for them, we have to look up the full Snapshot record. Note that we extend loaded_snapshots with these so we can use them in the next step.
  • Next, for everything we have identified in snapshot_intervals_to_clear, we check if any are full_history_restatement_only and if they are selectively widen intervals for those ones. To selectively widen, we need full snapshots to pass to Snapshot.get_removal_interval, so we fetch full Snapshots here.

Snapshot-fetching optimizations exist in the following places:

  • remaining_snapshot_ids are just the ones we haven't already fetched. The ones we have already fetched are in snapshot_intervals_to_clear so we do a set difference to limit to just the additional ones we need and put those in remaining_snapshot_ids.
  • After fetching full snapshots for remaining_snapshot_ids to get their SnapshotTableInfo's, we add them to the list of loaded_snapshots so they dont need to be looked up again,
  • When handling full_history_restatement_only, we only fetch full Snapshots for the ones that don't already exist in loaded_snapshots

And of course if any of the affected snapshot id's had ever been loaded at a different time, they will already exist in the disk cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to get SnapshotTableInfo for them, we have to look up the full Snapshot record. Note that we extend loaded_snapshots with these so we can use them in the next step.

Didn't we introduce SnapshotIdAndVersion precisely to avoid fetching full snapshots here?

The other reason for needing SnapshotTableInfo is that ultimately these get passed to StateSync.remove_intervals(), which requires SnapshotTableInfo at a minimum.

I'm pretty sure we can update that interface to accept SnapshotIdAndVersion, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i've:

  • added kind_name to SnapshotIdAndVersion (so we can determine full_history_restatement_only without fetching full Snapshots)
  • Created a SnapshotIdAndVersionLike type that encompasses SnapshotIdAndVersion plus its supersets
  • reduced the StateSync.remove_intervals api to only require SnapshotIdAndVersionLike rather than SnapshotInfoLike

@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from 81bac93 to 1438970 Compare September 9, 2025 22:48

name: str
version: str
kind_name: t.Optional[ModelKindName] = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is optional because it's defined as optional on SnapshotTableInfo and I wanted to keep treating SnapshotTableInfo as a superset

@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from 1438970 to 924e59e Compare September 9, 2025 23:39
# So for now, these are not considered
s_id
for s_id, s in snapshot_intervals_to_clear.items()
if s.snapshot.kind_name and s.snapshot.kind_name.full_history_restatement_only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure SnapshotIdAndVersion can implement ModelKindMixin and then you can continue doing just snapshot.full_history_restatement_only .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, that didnt occur to me. Done

Copy link
Collaborator

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice 👍

@erindru erindru merged commit 203b74e into main Sep 10, 2025
36 checks passed
@erindru erindru deleted the erin/fix-restatement-clear-across-all-environments branch September 10, 2025 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants