Skip to content

Convert sequential single deletes into range tombstones#14448

Draft
joshkang97 wants to merge 1 commit intofacebook:mainfrom
joshkang97:range_tombstone_insert
Draft

Convert sequential single deletes into range tombstones#14448
joshkang97 wants to merge 1 commit intofacebook:mainfrom
joshkang97:range_tombstone_insert

Conversation

@joshkang97
Copy link
Contributor

Summary

Add a read-path optimization that converts contiguous point tombstones into range tombstones during forward iteration. When a configurable threshold of consecutive point deletions (with no live keys between them) is detected, a range tombstone covering [first_tombstone_key, next_live_key) is inserted into the active mutable memtable. This benefits future iterators by enabling MergingIterator's cascading seek optimization to skip over large deleted ranges instead of scanning through individual tombstones one by one.

If there is a memtable switch during the read iteration, then the range deletion entry is discarded.

The inserted range tombstones are logically redundant (they don't delete anything that isn't already deleted by point tombstones), skip WAL (they're a derived optimization regenerated by future reads on crash), and use the read snapshot's sequence number so they don't interfere with newer writes.

Key changes

  • New option min_tombstones_for_range_conversion (AdvancedColumnFamilyOptions): Threshold of contiguous point tombstones before converting to a range tombstone. Default 0 (disabled). Dynamically changeable via SetOptions().
  • DBIter tracking logic (db/db_iter.cc, db/db_iter.h): Tracks contiguous tombstones during FindNextUserEntryInternal(). When a live key (kTypeValue, kTypeMerge, etc.) is encountered after enough tombstones, calls MaybeInsertRangeTombstone(). Tracking is reset on Seek/Prev/direction changes.
  • MemTable::AddLogicallyRedundantRangeTombstone() (db/memtable.cc, db/memtable.h): New method on ReadOnlyMemTable interface that inserts a range tombstone using concurrent memtable insert. Returns false if the memtable has already been switched to immutable (fragmented range tombstone list already constructed).
  • New statistics tickers: READ_PATH_RANGE_TOMBSTONES_INSERTED and READ_PATH_RANGE_TOMBSTONES_TOSSED to track how many range tombstones are inserted vs skipped.

Test Plan

  • New unit tests
  • Stress test coverage

Benchmark results

Setup: fillseq,compact 1M entries, then deleteseq,flush 500K entries
Workload: seekrandom, seek_nexts=100, threads=8, duration=10, 3 runs averaged

Branch ops/s % change vs main
main 300
range_tombstone_insert (threshold=8) 269,389 +89,696%
range_tombstone_insert (threshold=0) 290 -3.3%

No-regression benchmark (compacted DB, no tombstones):

Setup: fillseq,compact 1M entries (no deletes)
Workload: seekrandom, seek_nexts=100, threads=8, duration=10, 3 runs averaged

Branch ops/s % change vs main
main 341,378
range_tombstone_insert (threshold=8) 342,449 +0.3%
range_tombstone_insert (threshold=0) 344,238 +0.8%

No measurable regression on a clean compacted DB with no tombstones.

@meta-cla meta-cla bot added the CLA Signed label Mar 10, 2026
@github-actions
Copy link

github-actions bot commented Mar 10, 2026

✅ clang-tidy: No findings on changed lines

Completed in 632.6s.

@joshkang97 joshkang97 force-pushed the range_tombstone_insert branch 2 times, most recently from 796ffa2 to 75dc66d Compare March 10, 2026 20:54
@joshkang97 joshkang97 requested review from hx235, pdillinger and xingbowang and removed request for hx235 March 10, 2026 21:05
@joshkang97 joshkang97 force-pushed the range_tombstone_insert branch from 75dc66d to b4664be Compare March 10, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant