[WIP] Fix stale atomic_update_versions_ trapping version installation in secondary DB#14439
Open
hx235 wants to merge 1 commit intofacebook:mainfrom
Open
[WIP] Fix stale atomic_update_versions_ trapping version installation in secondary DB#14439hx235 wants to merge 1 commit intofacebook:mainfrom
hx235 wants to merge 1 commit intofacebook:mainfrom
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 288.1s. |
3e409ae to
b3010a1
Compare
Contributor
Author
|
linter not related |
Contributor
Author
|
Ah history md was not auto-generated - just pushed a new one on top of clean testing signals (excluding the unrelated linter warning) |
…condary DB
Summary:
When `atomic_flush=true`, flush writes version edits for all CFs as an
atomic group in the manifest. `VersionEditHandlerPointInTime` replays
these groups using a staging map (`atomic_update_versions_`): versions
are held there until ALL CFs in the group have valid versions, then
moved to `versions_` for installation. If any CF has missing files, the
map is never cleared.
For secondary/follower DB (`ManifestTailer`), files can become
permanently unavailable when manifest rotation races with primary
compaction. Example: secondary opens MANIFEST-4, then the primary
compacts CF0 (deleting CF0's atomic-flushed SST) and rotates to
MANIFEST-5. When the secondary reads the manifest it already opened,
CF0's file is gone — the atomic group is permanently incomplete:
`atomic_update_versions_` stays {CF0: nullptr, CF1: valid_version}.
Once stuck, all subsequent version creation for CF0 and CF1 is routed
into this stale map, so nothing reaches `versions_` and no versions
are ever installed. Symptoms: remote compaction "Cannot find matched
SST files", secondary DB serving stale data.
TryCatchUpWithPrimary cannot be used as a fix because the primary
keeps writing new atomic groups, each of which can also become
incomplete due to the same manifest rotation race. The secondary
would be endlessly chasing a moving target — each catch-up attempt
encounters new incomplete atomic groups that re-pollute the staging
map.
We can relax the all-or-nothing guarantee for ManifestTailer because
it exists for primary best-effort recovery where missing files mean
data loss on a frozen filesystem. For secondary DB, files are missing
because the primary compacted them away — not data loss. Cross-CF
read consistency in secondary is provided at a different layer
(NewIterators/MultiGet use shared sequence numbers), so installing
valid CF versions individually is safe.
Fix: override OnAtomicGroupReplayEnd() in ManifestTailer to salvage
valid versions from incomplete atomic groups and clear the staging
map. The all-or-nothing invariant for primary best-effort recovery
is preserved.
Test Plan:
New test `AtomicFlushRemoteCompactionMissingFile` reproduces the race
where secondary DB gets an incomplete atomic group due to concurrent
primary compaction + manifest rotation.
- Before fix: "Cannot find matched SST files for the following file
numbers: 9 14"
- After fix: compaction succeeds, both CFs compacted to L1
b3010a1 to
5fbf050
Compare
Contributor
Author
|
Turning into draft mode in order to work better on "TODO: consider cross-CF read consistency requirement in secondary db and whether that conflicts with this change." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context/Summary:
When
atomic_flush=true, flush writes version edits for all CFs as an atomic group in the manifest.VersionEditHandlerPointInTimereplays these groups using a staging map (atomic_update_versions_): versions are held there until ALL CFs in the group have valid versions, then moved toversions_for installation. If any CF has missing files, the map is never cleared.Bug:
For secondary/follower DB (
ManifestTailer), files can become permanently unavailable when manifest rotation races with primary compaction. Example: secondary opens MANIFEST-4, then the primary compacts CF0 (deleting CF0's atomic-flushed SST) and rotates to MANIFEST-5. When the secondary reads the manifest it already opened, CF0's file is gone — the atomic group is permanently incomplete:atomic_update_versions_stays {CF0: nullptr, CF1: valid_version}.Once stuck, all subsequent version creation for CF0 and CF1 is routed into this stale map, so nothing reaches
versions_and no versions are ever installed. Symptoms: remote compaction "Cannot find matched SST files", secondary DB serving stale data.TryCatchUpWithPrimary cannot be used as a fix because the primary keeps writing new atomic groups, each of which can also become incomplete due to the same manifest rotation race. The secondary would be endlessly chasing a moving target — each catch-up attempt encounters new incomplete atomic groups that re-pollute the staging map.
We can relax the all-or-nothing guarantee for ManifestTailer because it exists for primary best-effort recovery where missing files mean data loss on a frozen filesystem. For secondary DB, files are missing because the primary compacted them away — not data loss. TODO: consider cross-CF read consistency requirement in secondary db and whether that conflicts with this change.
Fix: override OnAtomicGroupReplayEnd() in ManifestTailer to salvage valid versions from incomplete atomic groups and clear the staging map. The all-or-nothing invariant for primary best-effort recovery is preserved.
Test Plan:
New test
AtomicFlushRemoteCompactionMissingFilereproduces the race where secondary DB gets an incomplete atomic group due to concurrent primary compaction + manifest rotation.