Convert sequential single deletes into range tombstones#14448
Draft
joshkang97 wants to merge 1 commit intofacebook:mainfrom
Draft
Convert sequential single deletes into range tombstones#14448joshkang97 wants to merge 1 commit intofacebook:mainfrom
joshkang97 wants to merge 1 commit intofacebook:mainfrom
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 632.6s. |
796ffa2 to
75dc66d
Compare
75dc66d to
b4664be
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a read-path optimization that converts contiguous point tombstones into range tombstones during forward iteration. When a configurable threshold of consecutive point deletions (with no live keys between them) is detected, a range tombstone covering
[first_tombstone_key, next_live_key)is inserted into the active mutable memtable. This benefits future iterators by enabling MergingIterator's cascading seek optimization to skip over large deleted ranges instead of scanning through individual tombstones one by one.If there is a memtable switch during the read iteration, then the range deletion entry is discarded.
The inserted range tombstones are logically redundant (they don't delete anything that isn't already deleted by point tombstones), skip WAL (they're a derived optimization regenerated by future reads on crash), and use the read snapshot's sequence number so they don't interfere with newer writes.
Key changes
min_tombstones_for_range_conversion(AdvancedColumnFamilyOptions): Threshold of contiguous point tombstones before converting to a range tombstone. Default 0 (disabled). Dynamically changeable viaSetOptions().DBItertracking logic (db/db_iter.cc,db/db_iter.h): Tracks contiguous tombstones duringFindNextUserEntryInternal(). When a live key (kTypeValue, kTypeMerge, etc.) is encountered after enough tombstones, callsMaybeInsertRangeTombstone(). Tracking is reset on Seek/Prev/direction changes.MemTable::AddLogicallyRedundantRangeTombstone()(db/memtable.cc,db/memtable.h): New method onReadOnlyMemTableinterface that inserts a range tombstone using concurrent memtable insert. Returns false if the memtable has already been switched to immutable (fragmented range tombstone list already constructed).READ_PATH_RANGE_TOMBSTONES_INSERTEDandREAD_PATH_RANGE_TOMBSTONES_TOSSEDto track how many range tombstones are inserted vs skipped.Test Plan
Benchmark results
Setup: fillseq,compact 1M entries, then deleteseq,flush 500K entries
Workload: seekrandom, seek_nexts=100, threads=8, duration=10, 3 runs averaged
No-regression benchmark (compacted DB, no tombstones):
Setup: fillseq,compact 1M entries (no deletes)
Workload: seekrandom, seek_nexts=100, threads=8, duration=10, 3 runs averaged
No measurable regression on a clean compacted DB with no tombstones.