Skip to content

[WIP] Fix stale atomic_update_versions_ trapping version installation in secondary DB#14439

Open
hx235 wants to merge 1 commit intofacebook:mainfrom
hx235:repro_af_secondary
Open

[WIP] Fix stale atomic_update_versions_ trapping version installation in secondary DB#14439
hx235 wants to merge 1 commit intofacebook:mainfrom
hx235:repro_af_secondary

Conversation

@hx235
Copy link
Contributor

@hx235 hx235 commented Mar 9, 2026

Context/Summary:

When atomic_flush=true, flush writes version edits for all CFs as an atomic group in the manifest. VersionEditHandlerPointInTime replays these groups using a staging map (atomic_update_versions_): versions are held there until ALL CFs in the group have valid versions, then moved to versions_ for installation. If any CF has missing files, the map is never cleared.

Bug:
For secondary/follower DB (ManifestTailer), files can become permanently unavailable when manifest rotation races with primary compaction. Example: secondary opens MANIFEST-4, then the primary compacts CF0 (deleting CF0's atomic-flushed SST) and rotates to MANIFEST-5. When the secondary reads the manifest it already opened, CF0's file is gone — the atomic group is permanently incomplete: atomic_update_versions_ stays {CF0: nullptr, CF1: valid_version}.

Once stuck, all subsequent version creation for CF0 and CF1 is routed into this stale map, so nothing reaches versions_ and no versions are ever installed. Symptoms: remote compaction "Cannot find matched SST files", secondary DB serving stale data.

TryCatchUpWithPrimary cannot be used as a fix because the primary keeps writing new atomic groups, each of which can also become incomplete due to the same manifest rotation race. The secondary would be endlessly chasing a moving target — each catch-up attempt encounters new incomplete atomic groups that re-pollute the staging map.

We can relax the all-or-nothing guarantee for ManifestTailer because it exists for primary best-effort recovery where missing files mean data loss on a frozen filesystem. For secondary DB, files are missing because the primary compacted them away — not data loss. TODO: consider cross-CF read consistency requirement in secondary db and whether that conflicts with this change.

Fix: override OnAtomicGroupReplayEnd() in ManifestTailer to salvage valid versions from incomplete atomic groups and clear the staging map. The all-or-nothing invariant for primary best-effort recovery is preserved.

Test Plan:

New test AtomicFlushRemoteCompactionMissingFile reproduces the race where secondary DB gets an incomplete atomic group due to concurrent primary compaction + manifest rotation.

  • Before fix:
[ RUN      ] CompactionServiceTest.AtomicFlushRemoteCompactionMissingFile
db/compaction/compaction_service_test.cc:2924: Failure
db_->CompactRange(cro, handles_[1], nullptr, nullptr)
Result incomplete: CompactionService failed to run the compaction job (even though the internal status is okay).
[  FAILED  ] CompactionServiceTest.AtomicFlushRemoteCompactionMissingFile (1211 ms)
[----------] 1 test from CompactionServiceTest (1211 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1211 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CompactionServiceTest.AtomicFlushRemoteCompactionMissingFile
  • After fix: compaction succeeds, both CFs compacted to L1

@meta-cla meta-cla bot added the CLA Signed label Mar 9, 2026
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

✅ clang-tidy: No findings on changed lines

Completed in 288.1s.

@hx235 hx235 force-pushed the repro_af_secondary branch from 3e409ae to b3010a1 Compare March 9, 2026 02:11
@meta-codesync
Copy link

meta-codesync bot commented Mar 9, 2026

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D95749838.

@hx235
Copy link
Contributor Author

hx235 commented Mar 9, 2026

linter not related

@hx235 hx235 requested review from anand1976 and jaykorean and removed request for jaykorean March 9, 2026 05:27
@hx235
Copy link
Contributor Author

hx235 commented Mar 9, 2026

Ah history md was not auto-generated - just pushed a new one on top of clean testing signals (excluding the unrelated linter warning)

…condary DB

Summary:
When `atomic_flush=true`, flush writes version edits for all CFs as an
atomic group in the manifest. `VersionEditHandlerPointInTime` replays
these groups using a staging map (`atomic_update_versions_`): versions
are held there until ALL CFs in the group have valid versions, then
moved to `versions_` for installation. If any CF has missing files, the
map is never cleared.

For secondary/follower DB (`ManifestTailer`), files can become
permanently unavailable when manifest rotation races with primary
compaction. Example: secondary opens MANIFEST-4, then the primary
compacts CF0 (deleting CF0's atomic-flushed SST) and rotates to
MANIFEST-5. When the secondary reads the manifest it already opened,
CF0's file is gone — the atomic group is permanently incomplete:
`atomic_update_versions_` stays {CF0: nullptr, CF1: valid_version}.

Once stuck, all subsequent version creation for CF0 and CF1 is routed
into this stale map, so nothing reaches `versions_` and no versions
are ever installed. Symptoms: remote compaction "Cannot find matched
SST files", secondary DB serving stale data.

TryCatchUpWithPrimary cannot be used as a fix because the primary
keeps writing new atomic groups, each of which can also become
incomplete due to the same manifest rotation race. The secondary
would be endlessly chasing a moving target — each catch-up attempt
encounters new incomplete atomic groups that re-pollute the staging
map.

We can relax the all-or-nothing guarantee for ManifestTailer because
it exists for primary best-effort recovery where missing files mean
data loss on a frozen filesystem. For secondary DB, files are missing
because the primary compacted them away — not data loss. Cross-CF
read consistency in secondary is provided at a different layer
(NewIterators/MultiGet use shared sequence numbers), so installing
valid CF versions individually is safe.

Fix: override OnAtomicGroupReplayEnd() in ManifestTailer to salvage
valid versions from incomplete atomic groups and clear the staging
map. The all-or-nothing invariant for primary best-effort recovery
is preserved.

Test Plan:
New test `AtomicFlushRemoteCompactionMissingFile` reproduces the race
where secondary DB gets an incomplete atomic group due to concurrent
primary compaction + manifest rotation.
- Before fix: "Cannot find matched SST files for the following file
  numbers: 9 14"
- After fix: compaction succeeds, both CFs compacted to L1
@hx235 hx235 force-pushed the repro_af_secondary branch from b3010a1 to 5fbf050 Compare March 9, 2026 21:45
@meta-codesync
Copy link

meta-codesync bot commented Mar 9, 2026

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D95749838.

@hx235
Copy link
Contributor Author

hx235 commented Mar 9, 2026

Turning into draft mode in order to work better on "TODO: consider cross-CF read consistency requirement in secondary db and whether that conflicts with this change."

@hx235 hx235 removed the request for review from anand1976 March 9, 2026 22:30
@hx235 hx235 changed the title Fix stale atomic_update_versions_ trapping version installation in secondary DB [WIP] Fix stale atomic_update_versions_ trapping version installation in secondary DB Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant