feat: add range tombstones (delete_range / delete_prefix)#21
feat: add range tombstones (delete_range / delete_prefix)#21
Conversation
- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types (ported from upstream PR fjall-rs#242 algorithms, own SST persistence) - Add RangeTombstoneFilter for bidirectional iteration suppression - Integrate into Memtable with interval tree for O(log n) queries - Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno) to AbstractTree trait, implemented for Tree and BlobTree - Wire suppression into point reads (memtable + sealed + SST layers) - Wire RangeTombstoneFilter into range/prefix iteration pipeline - SST persistence: raw wire format in BlockType::RangeTombstone block - Flush: collect range tombstones from sealed memtables, write to SSTs - Compaction: propagate RTs from input to output tables with clipping - GC: evict range tombstones below watermark at bottom level - Table-skip: skip tables fully covered by a range tombstone - MultiWriter: clip RTs to each output table's key range on rotation - Handle RT-only tables (derive key range from tombstone bounds) - 16 integration tests covering all paths Closes #16
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds end-to-end range-tombstone support: new RangeTombstone types and filters, interval-tree-backed memtable storage and active trackers, SST on-disk tombstone blocks and writer distribution, flush/compaction wiring, read-path suppression, public remove_range/remove_prefix APIs, tests, and architecture/known-gaps docs. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Tree
participant Memtable
participant Flusher as Flush/Writer
participant SST as SST Files
participant Compactor
rect rgba(200,230,255,0.5)
Client->>Tree: remove_range(start,end,seq)
Tree->>Memtable: insert_range_tombstone(start,end,seq)
end
rect rgba(220,255,200,0.5)
Client->>Tree: read / iterate (read_seqno)
Tree->>Memtable: is_suppressed_by_range_tombstones(key, read_seqno)
Note over Tree,Memtable: suppression may consult sealed memtables and SSTs
end
rect rgba(255,240,200,0.5)
Flusher->>Memtable: flush_to_tables_with_rt(stream, tombstones)
Flusher->>SST: write_range_tombstone block into table
end
rect rgba(255,220,240,0.5)
Compactor->>SST: collect range tombstones from inputs
Compactor->>SST: write surviving tombstones to output tables
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstones to the LSM-tree to enable efficient range/prefix deletes (remove_range, remove_prefix) with MVCC-aware visibility, and persists tombstones through flush/compaction/recovery via a new SST block and table metadata.
Changes:
- Introduces core range-tombstone types and in-memory structures (interval tree in memtables; sweep-line active sets + iterator filter).
- Extends flush/compaction/table IO to write/read range tombstones from SSTs (
BlockType::RangeTombstone, TOC region, metadata count). - Adds integration tests validating suppression semantics and persistence across flush/compaction/recovery.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests for range tombstone behavior and persistence. |
| src/abstract_tree.rs | Adds remove_range API and remove_prefix helper; extends flush path to pass tombstones. |
| src/tree/mod.rs | Tree integration: point-read suppression check + flush now accepts tombstones. |
| src/blob_tree/mod.rs | BlobTree integration: passes tombstones through flush; exposes remove_range. |
| src/memtable/mod.rs | Stores memtable range tombstones in a mutex-protected interval tree and exposes query/flush helpers. |
| src/memtable/interval_tree.rs | AVL interval tree backing for memtable tombstone queries. |
| src/range_tombstone.rs | Defines RangeTombstone, clipping/cover helpers, and key upper-bound helper. |
| src/active_tombstone_set.rs | Sweep-line “active tombstone” trackers for forward/reverse iteration. |
| src/range_tombstone_filter.rs | Bidirectional iterator wrapper to suppress entries covered by visible range tombstones. |
| src/range.rs | Collects tombstones across layers, adds table-skip optimization, and wraps iterators with tombstone filter. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/writer/mod.rs | Writes range tombstone blocks + metadata; supports RT-only tables by deriving key range. |
| src/table/regions.rs | Adds range_tombstones section handle in parsed table regions. |
| src/table/multi_writer.rs | Distributes/clips tombstones per output table via set_range_tombstones. |
| src/table/inner.rs | Stores loaded range tombstones in table inner state. |
| src/table/mod.rs | Loads/decodes range tombstone block on table recovery; exposes range_tombstones(). |
| src/compaction/flavour.rs | Extends compaction flavour to accept output range tombstones. |
| src/compaction/worker.rs | Collects tombstones from input tables and propagates them to compaction outputs (with last-level GC filtering). |
| src/lib.rs | Registers new internal modules for range tombstone support. |
| .forge/16/known-gaps.md | Documents remaining limitations and known gaps for Issue #16. |
| .forge/16/analysis.md | Architecture/design analysis and rationale for the chosen approach. |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
src/table/mod.rs (2)
570-577: Consider validating the block type for range tombstones.The recovery path loads the range tombstone block with
CompressionType::Nonebut doesn't validate that the loaded block hasBlockType::RangeTombstone. This differs from other block loading paths (e.g., filter block at lines 553-560) that verify the block type.♻️ Proposed fix
let range_tombstones = if let Some(rt_handle) = regions.range_tombstones { log::trace!("Loading range tombstone block, with rt_ptr={rt_handle:?}"); let block = Block::from_file(&file, rt_handle, crate::CompressionType::None)?; + if block.header.block_type != BlockType::RangeTombstone { + return Err(crate::Error::InvalidTag(( + "BlockType", + block.header.block_type.into(), + ))); + } Self::decode_range_tombstones(block)? } else { Vec::new() };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 570 - 577, The range tombstone loading path doesn't validate the loaded block type; when regions.range_tombstones is Some you call Block::from_file(..., crate::CompressionType::None) and then Self::decode_range_tombstones(block) without checking block.block_type. Update the code to verify the loaded block's type equals BlockType::RangeTombstone (or return an error if not) before calling Self::decode_range_tombstones; reference the regions.range_tombstones branch, the Block::from_file call, crate::CompressionType::None, and Self::decode_range_tombstones to locate and modify the logic.
620-660: Consider validating tombstone invariantstart < endduring decoding.
RangeTombstone::newhas adebug_assert!(start < end)that only fires in debug builds. Corrupted on-disk data could produce invalid tombstones withstart >= endin release builds, potentially causing incorrect behavior in the interval tree or filtering logic.🛡️ Proposed validation
tombstones.push(RangeTombstone::new( UserKey::from(start_buf), UserKey::from(end_buf), seqno, )); + // Validate tombstone invariant + if let Some(rt) = tombstones.last() { + if rt.start >= rt.end { + return Err(crate::Error::Unrecoverable); + } + }Or validate before constructing:
+ let start_key = UserKey::from(start_buf); + let end_key = UserKey::from(end_buf); + if start_key >= end_key { + return Err(crate::Error::Unrecoverable); + } tombstones.push(RangeTombstone::new( - UserKey::from(start_buf), - UserKey::from(end_buf), + start_key, + end_key, seqno, ));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 620 - 660, The decoder (decode_range_tombstones) must validate the invariant start < end after reading start_buf and end_buf and before calling RangeTombstone::new; if the condition fails, return an appropriate crate::Error (e.g., a corruption/unrecoverable error) instead of constructing the tombstone so release builds don't accept invalid on-disk data; update the logic around UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and early-return Err(...) when start >= end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.forge/16/analysis.md:
- Around line 85-95: MD040 is triggered because the two fenced diff-stat blocks
in .forge/16/analysis.md lack a language tag; update the opening fences for the
two blocks that start with "origin/main vs upstream/main:" and "origin/main vs
upstream-pr-242:" to use a language tag (e.g., change ``` to ```text) so the
blocks become ```text; keep the content of the blocks unchanged.
In `@src/abstract_tree.rs`:
- Around line 722-743: The current remove_prefix implementation incorrectly
treats Bound::Unbounded by synthesizing a truncated exclusive end via
range_tombstone::upper_bound_exclusive(prefix), which doesn't form a true
lexicographic upper bound (e.g., b"\xff" still leaves b"\xff\x01"); fix by
adding an explicit unbounded-path instead of synthesizing an end: when
prefix_to_range yields hi == Bound::Unbounded, call a true unbounded-range
removal API (e.g., implement/use self.remove_range_unbounded(start, seqno) or
self.remove_range_from(start, seqno)) to remove [start, +∞), or if an unbounded
removal is not supported, explicitly reject the operation (return 0 or propagate
an error) rather than creating a truncated tombstone; update remove_prefix, and
add/modify remove_range* helper(s) accordingly (referencing remove_prefix,
prefix_to_range, range_tombstone::upper_bound_exclusive, and remove_range).
In `@src/active_tombstone_set.rs`:
- Around line 110-123: The initialize_from method currently activates every
provided RangeTombstone via activate(&rt) even if it starts in the future
relative to the iterator's current_key; change initialize_from to only activate
truly overlapping tombstones by either (A) adding a current_key parameter (e.g.,
initialize_from(&mut self, current_key: &KeyType, tombstones: impl
IntoIterator<Item = RangeTombstone>)) and filtering each rt with rt.start <=
*current_key && *current_key < rt.end before calling activate, or (B) make
initialize_from private/internal and require callers to pre-filter by that same
overlap predicate; apply the same fix to the analogous method/block at the
second occurrence (the code referenced around lines 226–230) and ensure
expire_until usage still enforces the end bound after activation.
In `@src/memtable/interval_tree.rs`:
- Around line 266-269: The insert method incorrectly increments self.len
unconditionally; change insert_node (the helper called by insert) to return
whether it actually created a new node (e.g., return a (node, bool) or a Node
plus created flag) when handling RangeTombstone duplicates, then update insert
to set self.root from the returned node and only increment self.len when the
created flag is true; update any calling sites or signatures for insert_node and
keep references to RangeTombstone, insert_node, and insert consistent.
In `@src/memtable/mod.rs`:
- Around line 31-36: Memtables that only contain range tombstones are treated as
empty because Memtable::is_empty() only checks items; update logic so
Memtable::is_empty() also inspects range_tombstones by locking the Mutex and
calling the interval tree's is_empty (e.g., check
self.range_tombstones.lock().is_empty()), and propagate this change to callers
(Tree::rotate_memtable(), Tree::clear_active_memtable()) so they seal/flush
memtables and compute get_highest_seqno() correctly when only tombstones exist;
ensure get_highest_seqno() and any durability paths (remove_range/remove_prefix)
treat a memtable with non-empty range_tombstones as non-empty.
- Around line 183-190: The insert_range_tombstone method must reject empty or
reversed intervals before persisting: in insert_range_tombstone (and before
calling RangeTombstone::new or inserting into self.range_tombstones), validate
that start < end (using UserKey ordering) and if not, return 0 (do not
compute/record the tombstone). Move or gate the size calculation and the
lock+insert behind this check so reversed/empty ranges are never stored; keep
the rest of the method behavior unchanged.
In `@src/range_tombstone_filter.rs`:
- Around line 39-52: The constructor new() should ensure fwd_tombstones are
ordered by start ascending (and tie-break by seqno descending) before storing,
because fwd_activate_up_to() relies on forward tombstones being sorted by start;
add a sort on fwd_tombstones in new() such as sorting by (start asc, seqno desc)
prior to assigning Self.fwd_tombstones so out-of-order input Vecs can't skip
activation.
In `@tests/range_tombstone.rs`:
- Around line 330-352: Add a new test that writes a range tombstone then tampers
with the SST on-disk range_tombstones block and asserts recovery fails: use
get_tmp_folder(), open_tree(), remove_range(), and flush_active_memtable(0) to
create an SST with the new range tombstone format
([start_len][start][end_len][end][seqno]), then locate the produced SST file in
the folder, corrupt/truncate bytes from the range_tombstones payload (e.g.
remove bytes from the end of the block or zero out length fields), attempt to
reopen the tree with open_tree(folder.path()) and assert that reopening returns
an Err (i.e. recovery fails) rather than succeeding; keep the test focused on
both truncated and malformed length scenarios so it fails on invalid RT
encoding.
- Around line 168-176: The test currently relies on behavior that could pass
even if the tombstone never left the active memtable; modify the test around the
remove_range/rotate_memtable sequence (the block using tree.remove_range,
tree.rotate_memtable, tree.insert, and tree.get) to explicitly assert that the
tombstone was actually sealed/flushed by checking rotate_memtable() returned
Some (or that tree.sealed_memtable_count() > 0) immediately after rotation, and
similarly for the later analogous block (lines ~367-374) assert table_count()
increases or reopen the DB to force RT-only flush visibility; update the
assertions in both places so the test fails if the tombstone stayed in the
active memtable.
---
Nitpick comments:
In `@src/table/mod.rs`:
- Around line 570-577: The range tombstone loading path doesn't validate the
loaded block type; when regions.range_tombstones is Some you call
Block::from_file(..., crate::CompressionType::None) and then
Self::decode_range_tombstones(block) without checking block.block_type. Update
the code to verify the loaded block's type equals BlockType::RangeTombstone (or
return an error if not) before calling Self::decode_range_tombstones; reference
the regions.range_tombstones branch, the Block::from_file call,
crate::CompressionType::None, and Self::decode_range_tombstones to locate and
modify the logic.
- Around line 620-660: The decoder (decode_range_tombstones) must validate the
invariant start < end after reading start_buf and end_buf and before calling
RangeTombstone::new; if the condition fails, return an appropriate crate::Error
(e.g., a corruption/unrecoverable error) instead of constructing the tombstone
so release builds don't accept invalid on-disk data; update the logic around
UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and
early-return Err(...) when start >= end.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0a4b651a-d997-4ee5-9e1a-ba1e0f088648
📒 Files selected for processing (21)
.forge/16/analysis.md.forge/16/known-gaps.mdsrc/abstract_tree.rssrc/active_tombstone_set.rssrc/blob_tree/mod.rssrc/compaction/flavour.rssrc/compaction/worker.rssrc/lib.rssrc/memtable/interval_tree.rssrc/memtable/mod.rssrc/range.rssrc/range_tombstone.rssrc/range_tombstone_filter.rssrc/table/block/type.rssrc/table/inner.rssrc/table/mod.rssrc/table/multi_writer.rssrc/table/regions.rssrc/table/writer/mod.rssrc/tree/mod.rstests/range_tombstone.rs
- Fix non-exhaustive BlockType match in table/util.rs (metrics feature) - Add #[allow(dead_code)] for future-use API (CoveringRt, initialize_from) - Add #[allow(clippy::unnecessary_box_returns)] for AVL tree functions - Fix Option<&T> vs &Option<T> in interval_tree traversal functions - Add missing # Panics and # Errors doc sections - Add backticks in doc comments for identifiers - Use let...else patterns where clippy suggests - Fix redundant clone, pass-by-value, pub(crate) visibility - Remove unfulfilled lint expectations (cfg-gated params)
- Validate start < end on SST decode (reject corrupted RT blocks) - Validate start >= end in insert_range_tombstone (release-safe guard) - Fix duplicate insert len accounting in IntervalTree - Fix 0xFF prefix edge case in remove_prefix (return 0 for unrepresentable) - Re-insert tombstones into active memtable when RT-only flush produces no valid SST (no index possible without KV items) - Account for range tombstones in Memtable::is_empty - Add cheap pre-check before collecting range tombstones in iteration - Sort fwd_tombstones inside RangeTombstoneFilter::new for safety - Harden test assertions (sealed count, table count)
Verify loaded block has BlockType::RangeTombstone before decoding, consistent with filter block validation pattern.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/table/iter.rs (1)
122-122: Prefer#[expect(clippy::too_many_arguments)]over#[allow(...)].Per coding guidelines,
#[expect]should be used instead of#[allow]because it warns when the suppression becomes unnecessary (e.g., if future refactoring reduces the argument count).♻️ Suggested fix
- #[allow(clippy::too_many_arguments)] + #[expect(clippy::too_many_arguments)]As per coding guidelines: "Prefer
#[expect(lint)]over#[allow(lint)]—#[expect]warns when suppression becomes unnecessary"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/iter.rs` at line 122, Replace the #[allow(clippy::too_many_arguments)] attribute on the function/impl annotated in this diff with #[expect(clippy::too_many_arguments)] so the suppression still applies but will warn if it becomes unnecessary; locate the exact attribute instance (the attribute decorating the function that currently uses #[allow(clippy::too_many_arguments)]) and change the attribute name only, leaving the lint and surrounding code unchanged.src/abstract_tree.rs (1)
239-245: Consider#[expect]instead of#[warn]for lint attribute.The
#[warn(clippy::type_complexity)]attribute is unusual here — it would emit warnings rather than suppress them. If the intent is to suppress the lint, use#[expect(clippy::type_complexity)]or#[allow(clippy::type_complexity)].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/abstract_tree.rs` around lines 239 - 245, The attribute on function flush_to_tables uses #[warn(clippy::type_complexity)] which will emit warnings instead of suppressing them; change it to #[expect(clippy::type_complexity)] (or #[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on the flush_to_tables function declaration so the clippy type_complexity lint is treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt code path.src/range.rs (1)
101-101: Prefer#[expect(clippy::too_many_lines)]over#[allow(...)].Per coding guidelines, use
#[expect]so you get a warning when the suppression becomes unnecessary.♻️ Suggested fix
- #[allow(clippy::too_many_lines)] + #[expect(clippy::too_many_lines)]As per coding guidelines: "Prefer
#[expect(lint)]over#[allow(lint)]—#[expect]warns when suppression becomes unnecessary"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/range.rs` at line 101, Replace the attribute #[allow(clippy::too_many_lines)] with #[expect(clippy::too_many_lines)] to follow the coding guideline; locate the attribute (attached to the function/module declared near the top of src/range.rs) and change the lint suppression to an expectation so you receive a warning if the suppression becomes unnecessary.src/tree/mod.rs (1)
778-816: Performance concern: Linear scan of all SST tables for range tombstone suppression.The current implementation iterates over every table in every level (lines 802-813) to check range tombstones. For point reads, this could be expensive with many tables.
Consider filtering tables by key range before checking their tombstones, similar to how
get_for_keyfilters tables inget_internal_entry_from_tables. The PR notes this as a known limitation ("linear SST scan for point-read suppression").💡 Potential optimization sketch
- for table in super_version - .version - .iter_levels() - .flat_map(|lvl| lvl.iter()) - .flat_map(|run| run.iter()) - { + // Only check tables whose key range contains the key + for table in super_version + .version + .iter_levels() + .flat_map(|lvl| lvl.iter()) + .filter_map(|run| run.get_for_key(key)) + { for rt in table.range_tombstones() {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tree/mod.rs` around lines 778 - 816, The is_suppressed_by_range_tombstones function currently scans every SST table; limit that by first filtering tables whose key range could contain the target key (like get_for_key / get_internal_entry_from_tables does) before iterating range_tombstones(): for each table from super_version.version.iter_levels().flat_map(...).flat_map(...), check the table's key range (e.g., smallest/largest key or a key_range() accessor on the Table) and skip tables where key is outside that range, then only call table.range_tombstones() and rt.should_suppress; keep the existing active_memtable/sealed_memtables checks and ensure you use the same table-range helpers used elsewhere to avoid duplicating range logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/table/util.rs`:
- Line 31: The attribute #[allow(clippy::too_many_arguments)] in
src/table/util.rs should be changed to #[expect(clippy::too_many_arguments)] to
comply with the repo's policy; locate the function or item annotated with
#[allow(clippy::too_many_arguments)] (the attribute on the item in util.rs) and
replace the allow attribute with #[expect(clippy::too_many_arguments)]
preserving any existing whitespace/comments and ensuring the attribute exactly
matches the expect form.
---
Nitpick comments:
In `@src/abstract_tree.rs`:
- Around line 239-245: The attribute on function flush_to_tables uses
#[warn(clippy::type_complexity)] which will emit warnings instead of suppressing
them; change it to #[expect(clippy::type_complexity)] (or
#[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on
the flush_to_tables function declaration so the clippy type_complexity lint is
treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt
code path.
In `@src/range.rs`:
- Line 101: Replace the attribute #[allow(clippy::too_many_lines)] with
#[expect(clippy::too_many_lines)] to follow the coding guideline; locate the
attribute (attached to the function/module declared near the top of
src/range.rs) and change the lint suppression to an expectation so you receive a
warning if the suppression becomes unnecessary.
In `@src/table/iter.rs`:
- Line 122: Replace the #[allow(clippy::too_many_arguments)] attribute on the
function/impl annotated in this diff with #[expect(clippy::too_many_arguments)]
so the suppression still applies but will warn if it becomes unnecessary; locate
the exact attribute instance (the attribute decorating the function that
currently uses #[allow(clippy::too_many_arguments)]) and change the attribute
name only, leaving the lint and surrounding code unchanged.
In `@src/tree/mod.rs`:
- Around line 778-816: The is_suppressed_by_range_tombstones function currently
scans every SST table; limit that by first filtering tables whose key range
could contain the target key (like get_for_key / get_internal_entry_from_tables
does) before iterating range_tombstones(): for each table from
super_version.version.iter_levels().flat_map(...).flat_map(...), check the
table's key range (e.g., smallest/largest key or a key_range() accessor on the
Table) and skip tables where key is outside that range, then only call
table.range_tombstones() and rt.should_suppress; keep the existing
active_memtable/sealed_memtables checks and ensure you use the same table-range
helpers used elsewhere to avoid duplicating range logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5037095-c6a0-47c3-a880-07c3d65eaa33
📒 Files selected for processing (14)
.forge/16/analysis.mdsrc/abstract_tree.rssrc/active_tombstone_set.rssrc/memtable/interval_tree.rssrc/memtable/mod.rssrc/range.rssrc/range_tombstone.rssrc/range_tombstone_filter.rssrc/table/iter.rssrc/table/mod.rssrc/table/util.rssrc/table/writer/mod.rssrc/tree/mod.rstests/range_tombstone.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/range_tombstone.rs
There was a problem hiding this comment.
Pull request overview
Adds MVCC-aware range tombstones to the LSM-tree, enabling efficient contiguous deletions via remove_range / remove_prefix, and persists/propagates tombstones through flush + compaction + recovery.
Changes:
- Introduces core range-tombstone data structures (
RangeTombstone, memtableIntervalTree, sweep-lineActiveTombstoneSet, andRangeTombstoneFilter). - Extends read/write paths to apply range-tombstone suppression (point reads, iterators, flush, compaction, and SST recovery).
- Adds integration + regression tests for correctness across memtables/SSTs/compaction/recovery.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests covering range tombstone behavior across layers and lifecycle events. |
| src/abstract_tree.rs | Adds remove_range API and default remove_prefix; extends flush to pass RTs through. |
| src/tree/mod.rs | Implements remove_range; applies RT suppression on point reads; flush now supports RT distribution. |
| src/blob_tree/mod.rs | Forwards range deletes through BlobTree to underlying index tree; flush now supports RT distribution. |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree; updates emptiness + adds RT query/insert APIs. |
| src/memtable/interval_tree.rs | New AVL interval tree for efficient RT overlap/suppression queries in memtables. |
| src/active_tombstone_set.rs | New forward/reverse active RT tracking for sweep-line iteration filtering. |
| src/range_tombstone.rs | Defines the RT type, ordering, and helper utilities (intersection, coverage checks). |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper that suppresses KV entries covered by RTs. |
| src/range.rs | Collects RTs from all layers; wraps merged iterator with RangeTombstoneFilter; adds table-skip optimization. |
| src/table/block/type.rs | Adds new BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Extends table TOC parsing to include optional range_tombstones region. |
| src/table/inner.rs | Stores recovered table RTs in Inner. |
| src/table/mod.rs | Loads/decodes RT blocks on table open; exposes range_tombstones() accessor. |
| src/table/writer/mod.rs | Adds RT collection and writes RT block + metadata count during table finish. |
| src/table/multi_writer.rs | Distributes/clips RTs across rotated output tables (flush/compaction output). |
| src/table/util.rs | Treats RT blocks like data/meta blocks for metrics counters; changes lint suppression. |
| src/table/iter.rs | Changes lint suppression for constructor with many args. |
| src/compaction/flavour.rs | Adds write_range_tombstones hook so compaction can propagate RTs to outputs. |
| src/compaction/worker.rs | Collects RTs from input tables and propagates them to compaction outputs (with GC at last level). |
| src/lib.rs | Exposes new internal modules for RT support. |
| .forge/16/known-gaps.md | Documents known limitations and optimization opportunities for RT feature. |
| .forge/16/analysis.md | Design/architecture analysis and upstream context for RT implementation choices. |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/table/mod.rs (1)
648-666: Consider preserving error context for easier debugging.The
map_err(|_| crate::Error::Unrecoverable)discards the original I/O error. WhileUnrecoverableis semantically appropriate for corrupted data, preserving context (e.g., which field failed, byte offset) could aid debugging in production.💡 Optional: Add context to decode errors
If a more descriptive error variant exists or is added later (e.g.,
Error::RangeTombstoneCorrupted { offset, reason }), consider using it here to preserve the cursor position and failure reason.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 648 - 666, The current reads (cursor.read_u16::<LE>(), cursor.read_exact(...), cursor.read_u64::<LE>()) map all IO errors to crate::Error::Unrecoverable, discarding the original error and offset; update these map_err calls to capture and wrap the original error and context (e.g., which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno" and the cursor position/byte offset) into a more informative Error variant or into Unrecoverable with a message, so failures when reading start_len, start_buf, end_len, end_buf, and seqno preserve the underlying io::Error and an indication of the failing field/offset for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/table/mod.rs`:
- Around line 671-675: Add a unit test named
decode_range_tombstones_invalid_interval_returns_error() that builds a
serialized range tombstone blob matching the production format (little-endian
lengths and seqno fields), intentionally sets the interval so start >= end,
calls decode_range_tombstones(...) and asserts it returns
Err(crate::Error::Unrecoverable); reference the decode_range_tombstones function
and use the same byteorder::<LE> encoding for length/seqno fields so the
deserializer exercises the start >= end validation path.
---
Nitpick comments:
In `@src/table/mod.rs`:
- Around line 648-666: The current reads (cursor.read_u16::<LE>(),
cursor.read_exact(...), cursor.read_u64::<LE>()) map all IO errors to
crate::Error::Unrecoverable, discarding the original error and offset; update
these map_err calls to capture and wrap the original error and context (e.g.,
which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno" and the cursor
position/byte offset) into a more informative Error variant or into
Unrecoverable with a message, so failures when reading start_len, start_buf,
end_len, end_buf, and seqno preserve the underlying io::Error and an indication
of the failing field/offset for easier debugging.
- Fix visible_at to use exclusive boundary (seqno < read_seqno), consistent with seqno_filter and Memtable::get conventions - Validate decoded field lengths against remaining block data before allocating buffers (prevent OOM on corrupted blocks) - Skip SST tables by key_range in point read RT suppression check - Add code comments explaining #[allow] vs #[expect] for cfg-gated params - Use #[expect] with reason for too_many_lines on create_range - Fix remove_range doc to match no-op behavior for invalid intervals - Update PR body to reflect RT-only flush semantics
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/table/util.rs (1)
31-32:⚠️ Potential issue | 🟡 MinorUse
cfg_attr(..., expect(...))instead of a permanent#[allow].The feature-gated parameter count is exactly what
cfg_attr(feature = "metrics", expect(...))is for. Leaving#[allow]here suppresses the lint in the one build where it should still fire.Suggested patch
-// NOTE: #[allow] not #[expect] because arg count changes with cfg(feature = "metrics") -#[allow(clippy::too_many_arguments)] +#[cfg_attr( + feature = "metrics", + expect( + clippy::too_many_arguments, + reason = "metrics adds the extra parameter; without that feature this stays at the lint threshold" + ) +)] pub fn load_block(As per coding guidelines, "Prefer
#[expect(lint)]over#[allow(lint)]—#[expect]warns when suppression becomes unnecessary."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/util.rs` around lines 31 - 32, Replace the permanent attribute #[allow(clippy::too_many_arguments)] with a conditional expect so the lint still fires when the metrics feature is disabled; specifically change the attribute to #[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the inverse depending on feature semantics) at the same item (the attribute applied in src/table/util.rs) so the suppression only applies under the cfg where extra arguments exist and the lint will warn in builds where it should.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/range.rs`:
- Around line 217-226: The table-cover check uses table.metadata.key_range.max()
which is an exclusive upper bound, but RangeTombstone::fully_covers() expects an
inclusive max; convert the exclusive upper bound to an inclusive one before
calling fully_covers (e.g., compute an inclusive_max from
table.metadata.key_range.max() using the appropriate predecessor/prev-key API or
method on the key type, handle empty ranges by skipping the check if min >
inclusive_max), and then call rt.fully_covers(table.metadata.key_range.min(),
inclusive_max) inside the all_range_tombstones.iter().any(...) closure while
keeping the visible_at and seqno comparison logic unchanged.
---
Duplicate comments:
In `@src/table/util.rs`:
- Around line 31-32: Replace the permanent attribute
#[allow(clippy::too_many_arguments)] with a conditional expect so the lint still
fires when the metrics feature is disabled; specifically change the attribute to
#[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the
inverse depending on feature semantics) at the same item (the attribute applied
in src/table/util.rs) so the suppression only applies under the cfg where extra
arguments exist and the lint will warn in builds where it should.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f4d3251b-ba68-46f2-8418-870e509cdcef
📒 Files selected for processing (7)
src/abstract_tree.rssrc/range.rssrc/range_tombstone.rssrc/table/iter.rssrc/table/mod.rssrc/table/util.rssrc/tree/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/table/iter.rs
Replace #[allow(clippy::too_many_arguments)] with #[cfg_attr(feature = "metrics", expect(...))] so the suppression only activates when the extra metrics parameter is compiled in.
There was a problem hiding this comment.
Pull request overview
This PR introduces MVCC-correct range tombstones to support efficient delete_range / delete_prefix semantics across memtables and SSTs, including persistence through flush/compaction/recovery and iterator-time suppression.
Changes:
- Add
remove_range/remove_prefixtoAbstractTree, with memtable storage via an AVL interval tree and iterator-time filtering via a sweep-lineRangeTombstoneFilter. - Persist range tombstones into SSTs via a new
BlockType::RangeTombstoneand load/validate them on table recovery. - Propagate/GC range tombstones during compaction and add an integration test suite for end-to-end semantics.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests for point reads, iteration, flush/compaction/recovery, prefix deletes, and RT-only flush behavior |
| src/abstract_tree.rs | Adds public remove_range/remove_prefix API + flush plumbing for range tombstones |
| src/tree/mod.rs | Flush writes RTs to SSTs and point-read path checks suppression by RTs across layers |
| src/blob_tree/mod.rs | RT support for blob-tree flush path + forwards remove_range |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree; exposes query/sorted export helpers |
| src/memtable/interval_tree.rs | New AVL interval tree implementation for efficient overlap/suppression queries |
| src/range.rs | Wraps merged iterators with RangeTombstoneFilter and adds table-skip optimization |
| src/range_tombstone.rs | Defines RangeTombstone semantics, ordering, clipping, and helpers |
| src/range_tombstone_filter.rs | Bidirectional sweep-line iterator filter for RT suppression |
| src/active_tombstone_set.rs | Active set structures used by the iterator filter (forward + reverse) |
| src/table/block/type.rs | Adds new on-disk block tag RangeTombstone = 4 |
| src/table/writer/mod.rs | Writes an RT block into SSTs and records range_tombstone_count in metadata |
| src/table/regions.rs | TOC parsing extended to locate the range_tombstones section |
| src/table/inner.rs | Stores loaded RTs in table inner state |
| src/table/mod.rs | Loads/decodes RT blocks on recovery and exposes range_tombstones() |
| src/table/multi_writer.rs | Attempts to distribute/clamp RTs across rotated output tables |
| src/table/util.rs | Treats RT blocks like data/meta for metrics bookkeeping; lint suppression change |
| src/table/iter.rs | Lint suppression change for constructor arg count |
| src/compaction/worker.rs | Collects RTs from input tables and attempts to propagate/GC them |
| src/compaction/flavour.rs | Adds write_range_tombstones hook and wires it to the table writer |
| src/lib.rs | Exposes new internal modules for range tombstone support |
| .forge/16/known-gaps.md | Documents limitations/known gaps for the feature |
| .forge/16/analysis.md | Architecture analysis and design trade-offs documentation |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/range.rs (2)
302-307: Strengthen the fast path to skip wrapping when no tombstone is visible.A non-empty
all_range_tombstonescan still have zero entries visible atseqno, so this currently wraps with a no-op filter in those cases.⚡ Proposed patch
- // Fast path: skip filter wrapping when no range tombstones exist - if all_range_tombstones.is_empty() { + // Fast path: skip wrapping when no tombstone is visible for this read seqno + if !all_range_tombstones.iter().any(|rt| rt.visible_at(seqno)) { Box::new(iter) } else { Box::new(RangeTombstoneFilter::new(iter, all_range_tombstones, seqno)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/range.rs` around lines 302 - 307, The fast-path should skip wrapping not only when all_range_tombstones.is_empty() but also when none of the tombstones are visible at the given seqno; update the conditional around Box::new(...) to inspect all_range_tombstones for any entry visible at seqno (e.g. using the visibility predicate/method used by RangeTombstoneFilter or by filtering/map-checking the collection) and only construct RangeTombstoneFilter::new(iter, all_range_tombstones, seqno) if at least one tombstone is visible; otherwise return Box::new(iter) unchanged.
176-196: Remove redundant tombstone sorting in this path.
rts.sort()here is duplicate work:RangeTombstoneFilter::newalready sorts the forward list (and builds a reverse-sorted copy). This adds avoidableO(n log n)cost per range open.♻️ Proposed patch
let all_range_tombstones = if rt_count > 0 { let mut rts: Vec<RangeTombstone> = Vec::with_capacity(rt_count); rts.extend(lock.version.active_memtable.range_tombstones_sorted()); for mt in lock.version.sealed_memtables.iter() { rts.extend(mt.range_tombstones_sorted()); } if let Some((mt, _)) = &lock.ephemeral { rts.extend(mt.range_tombstones_sorted()); } for table in lock .version .version .iter_levels() .flat_map(|lvl| lvl.iter()) .flat_map(|run| run.iter()) { rts.extend(table.range_tombstones().iter().cloned()); } - rts.sort(); rts } else { Vec::new() };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/range.rs` around lines 176 - 196, The vector `rts` constructed for `all_range_tombstones` is being sorted with `rts.sort()` redundantly; `RangeTombstoneFilter::new` already sorts the forward list and produces the reverse-sorted copy, so remove the `rts.sort()` call in the block that builds `rts` (the code that extends from `lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables, `lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while preserving correctness for `RangeTombstone` handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/range.rs`:
- Around line 302-307: The fast-path should skip wrapping not only when
all_range_tombstones.is_empty() but also when none of the tombstones are visible
at the given seqno; update the conditional around Box::new(...) to inspect
all_range_tombstones for any entry visible at seqno (e.g. using the visibility
predicate/method used by RangeTombstoneFilter or by filtering/map-checking the
collection) and only construct RangeTombstoneFilter::new(iter,
all_range_tombstones, seqno) if at least one tombstone is visible; otherwise
return Box::new(iter) unchanged.
- Around line 176-196: The vector `rts` constructed for `all_range_tombstones`
is being sorted with `rts.sort()` redundantly; `RangeTombstoneFilter::new`
already sorts the forward list and produces the reverse-sorted copy, so remove
the `rts.sort()` call in the block that builds `rts` (the code that extends from
`lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables,
`lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed
into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while
preserving correctness for `RangeTombstone` handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 36192350-8de8-4946-a95f-d0e3da61329a
📒 Files selected for processing (3)
src/range.rssrc/table/iter.rssrc/table/util.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…unds - Set range tombstones on MultiWriter before KV write loop in flush (Tree, BlobTree) so rotations carry RT metadata in earlier tables - Propagate RTs before merge_iter loop in compaction worker - Assert u16 bound on start/end key lengths in insert_range_tombstone - Use exclusive seqno boundary in visible_at doc strings - Strengthen iteration fast path: check visible_at not just is_empty - Remove redundant sort (RangeTombstoneFilter::new sorts internally)
- Compaction with MultiWriter rotation preserves RTs across tables - Table-skip optimization skips fully-covered tables - BlobTree range tombstone insert, suppress, flush - Invalid interval (start >= end) silently returns 0 - Multiple compaction rounds preserve range tombstones
There was a problem hiding this comment.
Pull request overview
Adds MVCC-aware range tombstones to enable efficient range/prefix deletion (remove_range / remove_prefix) with persistence across flush/compaction/recovery, plus iterator- and point-read suppression.
Changes:
- Introduces core range-tombstone data structures (range tombstone model, sweep-line active sets, memtable interval tree, iterator filter).
- Persists range tombstones in SSTs via a new
BlockType::RangeTombstoneand TOC section, propagating them through compaction. - Updates read path (point reads + range/prefix iteration) to suppress covered entries and adds comprehensive tests/docs.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/range_tombstone.rs |
New integration test suite for RT semantics across memtable/SST/compaction/recovery. |
src/abstract_tree.rs |
Adds remove_range API + remove_prefix helper; flush now forwards RTs to table writers. |
src/blob_tree/mod.rs |
Wires RT-aware flush through blob tree and forwards remove_range. |
src/tree/mod.rs |
Implements remove_range for Tree, RT-aware flush-to-tables, and point-read suppression checks. |
src/range.rs |
Collects RTs across layers for range/prefix iteration; adds RT filtering and table-skip optimization. |
src/memtable/mod.rs |
Stores memtable RTs (mutex-protected interval tree) and exposes query/flush helpers. |
src/memtable/interval_tree.rs |
New AVL interval tree for RT overlap and suppression queries in memtables. |
src/active_tombstone_set.rs |
New forward/reverse sweep-line active tombstone sets for iteration. |
src/range_tombstone.rs |
Defines RangeTombstone and helpers (visibility, suppression, clipping, coverage). |
src/range_tombstone_filter.rs |
New bidirectional iterator wrapper that suppresses keys covered by visible RTs. |
src/table/block/type.rs |
Adds BlockType::RangeTombstone = 4 to the SST block type enum. |
src/table/writer/mod.rs |
Writes RT blocks + count metadata during SST creation; updates writer state to track RTs. |
src/table/regions.rs |
Parses new "range_tombstones" TOC section into table regions. |
src/table/mod.rs |
Loads/decodes RT blocks on table recovery and exposes Table::range_tombstones(). |
src/table/inner.rs |
Stores recovered table RTs in Table inner state. |
src/table/multi_writer.rs |
Distributes/clips RTs across rotated output tables during flush/compaction. |
src/table/util.rs |
Includes RT blocks in metrics accounting for block loads (cache + IO). |
src/table/iter.rs |
Adjusts clippy lint gating for metrics-conditional argument count. |
src/compaction/worker.rs |
Collects RTs from compaction inputs and propagates them to outputs (with GC eviction at last level). |
src/compaction/flavour.rs |
Extends compaction flavour interface to pass RTs into table writers. |
src/lib.rs |
Exposes new internal modules for RT support. |
.forge/16/known-gaps.md |
Documents known limitations/optimization gaps for Issue #16 implementation. |
.forge/16/analysis.md |
Architecture/design analysis and upstream context for range tombstones. |
You can also share your feedback on Copilot code review. Take the survey.
- Write synthetic weak tombstone in RT-only tables to produce valid index - Fix interval_tree pruning: use >= for exclusive seqno boundary - Replace #[allow(dead_code)] with #[expect(dead_code, reason)] throughout - Remove key_range filter from SST RT suppression (RT range may exceed table key range)
There was a problem hiding this comment.
Pull request overview
Adds MVCC-aware range tombstones to the LSM tree to support efficient delete_range/delete_prefix-style deletes that persist through flush/compaction/recovery and are honored across point reads and iteration.
Changes:
- Introduces core range-tombstone data structures (tombstone model, memtable interval tree, active sweep sets, iterator filter).
- Extends the read/write/format pipeline to persist range tombstones in SSTs and propagate/GC them during compaction.
- Adds public APIs (
remove_range,remove_prefix) and comprehensive integration tests for semantics and persistence.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration test suite covering MVCC visibility, iteration, flush/compaction/recovery, RT-only flush, GC, and BlobTree. |
| src/abstract_tree.rs | Adds remove_range/remove_prefix API and a flush path that passes range tombstones into table writers. |
| src/active_tombstone_set.rs | Implements forward/reverse active tombstone tracking for iterator-time suppression. |
| src/blob_tree/mod.rs | Wires range tombstones through BlobTree flush and exposes remove_range. |
| src/compaction/flavour.rs | Adds compaction flavour hook to propagate range tombstones into output writers. |
| src/compaction/worker.rs | Collects and propagates input-table range tombstones during compaction, with last-level GC by watermark. |
| src/lib.rs | Exposes new internal modules used by the range tombstone feature. |
| src/memtable/mod.rs | Adds a mutex-protected interval tree for range tombstones and memtable APIs for insertion/query/flush extraction. |
| src/memtable/interval_tree.rs | New AVL interval tree implementation backing memtable range-tombstone queries and table-skip decisions. |
| src/range.rs | Integrates range tombstones into range iteration, including table-skip optimization and filter wrapping. |
| src/range_tombstone.rs | Defines RangeTombstone semantics (half-open intervals, MVCC visibility) plus helpers for clipping/coverage. |
| src/range_tombstone_filter.rs | Adds a bidirectional iterator wrapper that suppresses KVs covered by active range tombstones. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4 for persisted SST blocks. |
| src/table/inner.rs | Stores decoded per-table range tombstones in table inner state. |
| src/table/iter.rs | Adjusts clippy expectation gating for metrics feature builds. |
| src/table/mod.rs | Loads/decodes the SST range tombstone block and exposes Table::range_tombstones(). |
| src/table/multi_writer.rs | Distributes/clips range tombstones across rotated output tables (including RT-only output). |
| src/table/regions.rs | Adds TOC parsing support for the "range_tombstones" section handle. |
| src/table/util.rs | Includes RangeTombstone blocks in metrics accounting and gates clippy expects by feature. |
| src/table/writer/mod.rs | Persists range tombstones in an SST section and handles RT-only tables by emitting a sentinel for index validity. |
| src/tree/mod.rs | Adds remove_range and point-read suppression checks against range tombstones across memtable/sealed/SST. |
| .gitignore | Ignores .forge/ directory. |
You can also share your feedback on Copilot code review. Take the survey.
- Convert #[allow(clippy::unnecessary_box_returns)] to #[expect] with reasons in interval_tree.rs AVL rotation/balance/insert functions - Convert #[allow(dead_code)] to #[expect] on CoveringRt::covers_table - Remove unfired #[warn] attributes on flush_to_tables, finish - Optimize query_suppression to use early-exit tree traversal instead of collecting into Vec — avoids allocation on the hot read path - Add comment explaining range_tombstones.clone() necessity in flush
clippy::unnecessary_box_returns doesn't fire on functions that both accept and return Box<Node>, so #[expect] was unfulfilled and failed CI with --all-features -D warnings.
There was a problem hiding this comment.
Pull request overview
Adds end-to-end range tombstone (delete-range / delete-prefix) support to the LSM tree, including memtable representation, iterator-time suppression, SST persistence, and compaction propagation.
Changes:
- Introduces core range-tombstone types and sweep-line filtering for forward/reverse iteration.
- Persists range tombstones in SSTs via a dedicated
range_tombstonesTOC section and propagates them through compaction. - Extends
AbstractTreewithremove_range/remove_prefixand adds broad integration test coverage.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration tests for suppression, MVCC, persistence, compaction, and prefix deletes |
| src/abstract_tree.rs | Adds remove_range / remove_prefix and flush plumbing for RT persistence |
| src/tree/mod.rs | Implements remove_range and point-read suppression across memtables/SSTs |
| src/blob_tree/mod.rs | Wires range tombstones through BlobTree’s flush path + API |
| src/memtable/mod.rs | Stores/query RTs in memtable via a mutex-protected interval tree |
| src/memtable/interval_tree.rs | New AVL interval tree for RT overlap/suppression queries |
| src/range.rs | Adds RT collection + iteration filter; adds table-skip based on covering RTs |
| src/range_tombstone.rs | Defines RangeTombstone, helpers (intersection/coverage), and upper-bound utility |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper to suppress covered entries |
| src/active_tombstone_set.rs | New active-set trackers for sweep-line suppression (fwd + rev) |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4 |
| src/table/regions.rs | Parses range_tombstones TOC section |
| src/table/inner.rs | Stores decoded per-table RTs |
| src/table/mod.rs | Loads/decodes RT blocks on open; exposes range_tombstones() |
| src/table/writer/mod.rs | Writes RT blocks; supports RT-only table creation via a synthetic sentinel |
| src/table/multi_writer.rs | Distributes/clips RTs across rotated output tables |
| src/table/util.rs | Counts RT blocks in metrics as data/meta blocks |
| src/table/iter.rs | Adjusts clippy::too_many_arguments suppression to only apply under metrics |
| src/compaction/worker.rs | Propagates RTs from input tables into compaction output; evicts below GC watermark at bottom |
| src/compaction/flavour.rs | Adds compaction hook for writing RTs to output |
You can also share your feedback on Copilot code review. Take the survey.
…rsized keys - `upper_bound_exclusive` now returns `Option<UserKey>`, returning `None` when key length is at u16::MAX (appending 0x00 would overflow the u16 key-length invariant used in on-disk RT block encoding) - Multi-writer clipping handles None by writing tombstones unclipped when the table's last key is at max length, avoiding corrupt RT blocks - RT block serialization uses `u16::try_from(len)` instead of `as u16` cast, returning an error instead of silently truncating oversized keys - Add test for upper_bound_exclusive with max-length key
There was a problem hiding this comment.
Pull request overview
Adds native range tombstone support (delete-range / delete-prefix) to the LSM-tree, including MVCC-correct read suppression, persistence in SSTs, and propagation through compaction/GC.
Changes:
- Introduces
RangeTombstoneprimitives + in-mem interval tree tracking and iterator-time suppression filter. - Persists range tombstones in SSTs via a new
BlockType::RangeTombstoneand TOC section, with writer/reader support. - Wires range tombstones through flush + compaction pipelines, and updates point/range/prefix reads to respect suppression.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests covering MVCC suppression, persistence, compaction propagation, prefix delete, RT-only flush, and table-skip behavior. |
| src/abstract_tree.rs | Extends flush pipeline to pass range tombstones into table creation; adds remove_range API + default remove_prefix. |
| src/tree/mod.rs | Implements remove_range; applies RT suppression to point reads; plumbs RTs into flush-to-SST. |
| src/blob_tree/mod.rs | Plumbs RTs into BlobTree flush; delegates remove_range to underlying index. |
| src/memtable/mod.rs | Adds mutex-protected interval-tree storage for memtable range tombstones; exposes query/iteration helpers. |
| src/memtable/interval_tree.rs | AVL interval tree for efficient O(log n) overlap/suppression checks and sorted extraction for flush. |
| src/range.rs | Wraps range iteration with RangeTombstoneFilter and adds table-skip logic for fully-covered tables. |
| src/range_tombstone_filter.rs | Bidirectional sweep-line filter that suppresses KV items covered by visible range tombstones. |
| src/active_tombstone_set.rs | Active-set trackers for forward/reverse iteration (activate/expire + max-seqno suppression). |
| src/range_tombstone.rs | Core RT type + intersection/coverage helpers and upper_bound_exclusive helper. |
| src/lib.rs | Exposes new internal RT modules. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Adds range_tombstones TOC region parsing. |
| src/table/inner.rs | Stores loaded per-table range tombstones in table inner state. |
| src/table/mod.rs | Loads and decodes RT blocks on table recovery; exposes Table::range_tombstones(). |
| src/table/writer/mod.rs | Writes RT blocks + range_tombstone_count metadata; supports RT-only tables via sentinel KV. |
| src/table/multi_writer.rs | Distributes/clips RTs across rotated output tables during flush/compaction. |
| src/table/util.rs | Extends metrics accounting to include RT blocks as “data/meta-like” blocks. |
| src/table/iter.rs | Adjusts clippy gating for too_many_arguments when metrics feature is enabled. |
| src/compaction/worker.rs | Collects RTs from input tables and propagates them to compaction outputs (with last-level GC eviction). |
| src/compaction/flavour.rs | Adds a write_range_tombstones hook and wires it for standard/relocating compactions. |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Adds native range tombstone support (delete-range / delete-prefix) to the LSM tree, integrating tombstones into the memtable, iterators, SST persistence, compaction propagation, and recovery.
Changes:
- Add
remove_range/remove_prefixAPI onAbstractTreewith memtable storage via an interval tree and read-path suppression. - Persist range tombstones into SSTs via a dedicated
BlockType::RangeTombstoneTOC section, and load them on recovery. - Propagate range tombstones through compaction and support forward/reverse iteration filtering with
RangeTombstoneFilter, plus broad test coverage.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration tests covering MVCC visibility, iteration suppression, persistence (flush/compaction/recovery), GC eviction, and BlobTree support. |
| src/abstract_tree.rs | Extends public tree API with remove_range / remove_prefix; flush path collects and passes range tombstones to table writers. |
| src/tree/mod.rs | Implements remove_range for Tree and checks range tombstone suppression in point reads. |
| src/blob_tree/mod.rs | Wires range tombstones through BlobTree flush and delegates remove_range to the index tree. |
| src/memtable/mod.rs | Adds interval-tree-backed range tombstone storage, insertion, query helpers, and adjusts emptiness semantics. |
| src/memtable/interval_tree.rs | Implements an AVL-based interval tree specialized for range tombstone queries and pruning. |
| src/range.rs | Applies range tombstone suppression during range scans via RangeTombstoneFilter and adds table-skip logic. |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper that suppresses KV entries covered by visible range tombstones. |
| src/active_tombstone_set.rs | New sweep-line “active set” helpers for forward/reverse tombstone activation/expiry during iteration. |
| src/range_tombstone.rs | Defines RangeTombstone, ordering, intersection helpers, and upper_bound_exclusive. |
| src/table/writer/mod.rs | Writes a new RT block/TOC section and adds RT-only table handling via a synthetic sentinel KV. |
| src/table/multi_writer.rs | Distributes (and currently clips) range tombstones across rotated output tables. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Parses "range_tombstones" TOC section handle. |
| src/table/mod.rs | Loads/decodes range tombstone blocks during table recovery and exposes Table::range_tombstones(). |
| src/table/inner.rs | Stores loaded range tombstones in the table inner struct. |
| src/table/util.rs | Updates metrics categorization to include range tombstone blocks. |
| src/table/iter.rs | Makes #[expect(clippy::too_many_arguments)] conditional on the metrics feature. |
| src/compaction/worker.rs | Collects and propagates range tombstones from compaction inputs, with last-level GC eviction. |
| src/compaction/flavour.rs | Extends compaction flavour interface to pass range tombstones into output writers. |
| src/lib.rs | Registers new internal modules for range tombstone functionality. |
You can also share your feedback on Copilot code review. Take the survey.
Two correctness fixes for range tombstone handling: 1. Flush must NOT clip RTs to the output table's KV key range. During memtable flush, the KV data only reflects keys from that memtable, but range tombstones must cover keys in older SSTs outside this range. Clipping would cause deleted keys in older tables to "resurrect" after flush. Added `clip_range_tombstones` flag to MultiWriter (true only for compaction where input tables are consumed). 2. RT-only tables (no KV data, only range tombstones) now set `meta.first_key = min(rt.start)` and `meta.last_key = max(rt.end)` so the table's key range conservatively covers all its tombstones. Previously the key range was derived solely from the synthetic sentinel key, making the table invisible to compaction overlap checks and potentially preventing GC.
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstone support to the LSM tree, enabling efficient delete_range / delete_prefix semantics with MVCC visibility, persistence in SSTs, and iterator-time suppression.
Changes:
- Introduces
RangeTombstonecore types plus in-memory tracking (AVL interval tree in memtables, active sweep-line sets) and an iterator wrapper (RangeTombstoneFilter) to suppress covered KVs. - Persists range tombstones in SSTs via a new TOC section (
"range_tombstones") andBlockType::RangeTombstone. - Wires end-to-end support through flush + compaction + recovery, and adds extensive integration tests.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests covering MVCC behavior, iteration suppression, flush/compaction/recovery persistence, RT-only flush, and BlobTree. |
| src/abstract_tree.rs | Extends flush plumbing to pass range tombstones to table writers; adds remove_range/remove_prefix API. |
| src/tree/mod.rs | Implements remove_range; applies RT suppression to point reads across memtables and SSTs; updates flush impl to accept RTs. |
| src/blob_tree/mod.rs | Delegates remove_range to the index tree; updates flush impl to accept RTs. |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree; adds insert/query/sorted extraction APIs; updates is_empty. |
| src/memtable/interval_tree.rs | New AVL interval tree implementation for efficient RT overlap/suppression queries in memtables. |
| src/range.rs | Collects RTs across layers for range iteration; adds table-skip check and wraps iterators in RangeTombstoneFilter. |
| src/range_tombstone.rs | New RT type (half-open interval) with MVCC suppression logic and helpers (intersection, coverage, upper-bound). |
| src/range_tombstone_filter.rs | New bidirectional iterator filter using active tombstone sets to suppress covered entries. |
| src/active_tombstone_set.rs | New forward/reverse “active tombstone” sweep-line trackers for iteration-time suppression. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone (tag 4). |
| src/table/regions.rs | Extends TOC region parsing to include an optional "range_tombstones" section. |
| src/table/mod.rs | Loads and decodes persisted RT blocks on table open; exposes Table::range_tombstones(). |
| src/table/inner.rs | Stores loaded RTs in the table’s inner state. |
| src/table/writer/mod.rs | Writes RT blocks (and RT-only sentinel tables) and records RT metadata in table meta. |
| src/table/multi_writer.rs | Adds RT distribution/clipping behavior across rotated output tables. |
| src/compaction/worker.rs | Collects RTs from compaction inputs; propagates them to compaction outputs (with GC eviction on last level). |
| src/compaction/flavour.rs | Wires RT propagation into both compaction flavours; enables RT clipping on compaction MultiWriter. |
| src/table/util.rs | Treats RT blocks as “data-like” for metrics accounting; adjusts lint gating for metrics feature. |
| src/table/iter.rs | Adjusts lint gating for metrics feature in iterator constructor. |
| src/lib.rs | Registers new internal modules for RT functionality. |
You can also share your feedback on Copilot code review. Take the survey.
In flush mode, RTs disjoint from the table's KV range (e.g., delete_range targeting keys only in older SSTs) were silently dropped by the overlap check. Remove the filter so all RTs are persisted in at least one output table during flush.
Regression tests for the flush clipping fix: verify that a range tombstone targeting keys only in older SSTs (disjoint from the memtable's KV range) survives flush and subsequent compaction.
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstones (delete_range/delete_prefix) to the LSM tree, including in-memory indexing, read-path suppression, SST persistence, and compaction propagation.
Changes:
- Introduces core range-tombstone types and iterator filtering (forward + reverse) to suppress covered keys under MVCC.
- Persists range tombstones in SSTs via a dedicated
range_tombstonesTOC section andBlockType::RangeTombstone. - Extends flush/compaction plumbing (
MultiWriter, compaction worker/flavours, tree APIs) to carry and (optionally) clip tombstones.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration test suite exercising suppression, persistence, compaction, recovery, prefix deletes, and RT-only flushes. |
| src/abstract_tree.rs | Adds remove_range + default remove_prefix; threads RTs through flush via flush_to_tables_with_rt. |
| src/tree/mod.rs | Implements remove_range; point-read path checks RT suppression across memtables and tables. |
| src/blob_tree/mod.rs | Threads RT flush + exposes remove_range via underlying index tree. |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree; updates emptiness/size/seqno tracking. |
| src/memtable/interval_tree.rs | New AVL interval tree for RT overlap/suppression queries. |
| src/range.rs | Wraps range iteration with RangeTombstoneFilter and adds a table-skip precheck. |
| src/range_tombstone.rs | New RangeTombstone type + helpers (intersection, coverage, upper-bound). |
| src/range_tombstone_filter.rs | New bidirectional iterator filter using sweep-line activation/expiry sets. |
| src/active_tombstone_set.rs | New forward/reverse active-tombstone tracking sets for iteration. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/writer/mod.rs | Writes RT blocks into SSTs; supports RT-only tables via a synthetic sentinel KV; writes RT count to meta. |
| src/table/multi_writer.rs | Tracks RTs across writer rotation; supports optional clipping in compaction mode. |
| src/table/regions.rs | Parses new range_tombstones TOC section handle. |
| src/table/mod.rs | Loads and decodes RT blocks on table open; exposes Table::range_tombstones(). |
| src/table/inner.rs | Stores loaded table RTs. |
| src/table/util.rs | Metrics accounting includes RangeTombstone block type. |
| src/table/iter.rs | Makes clippy::too_many_arguments expect conditional on metrics. |
| src/compaction/flavour.rs | Enables RT clipping for compaction writers; adds write_range_tombstones hook. |
| src/compaction/worker.rs | Collects input RTs and forwards them to compaction output (with last-level filtering). |
| src/lib.rs | Wires new internal modules. |
You can also share your feedback on Copilot code review. Take the survey.
…collision SequenceNumberCounter starts at 0 (pre-increment), so seqno=0 is a valid user seqno. The synthetic weak tombstone in RT-only tables used seqno=0, which could collide with a real KV entry at the same user key, making merge order nondeterministic. Use the max RT seqno instead — by the time the RT was written, the counter has already passed this value.
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstones to the LSM-tree, enabling efficient [start, end) and prefix deletions that are MVCC-aware and persist through flush/compaction/recovery.
Changes:
- Introduces core range-tombstone types (range tombstone model, active sweep-line sets, memtable interval tree, iterator filter).
- Persists range tombstones in SSTs via a new
BlockType::RangeTombstoneand TOC section, with flush/compaction propagation viaMultiWriter. - Wires read path suppression into point reads and range/prefix iteration, plus extensive unit/integration tests.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration test suite covering MVCC suppression, iteration, flush/compaction/recovery, RT-only flush, GC, etc. |
| src/abstract_tree.rs | Adds remove_range/remove_prefix API and extends flush to pass range tombstones to writers. |
| src/tree/mod.rs | Implements remove_range, flush plumbing to pass RTs, and point-read suppression checks. |
| src/blob_tree/mod.rs | Delegates remove_range and plumbs RT-aware flushing for blob-separated trees. |
| src/memtable/mod.rs | Adds RT storage to memtables (mutex-protected interval tree) and query APIs. |
| src/memtable/interval_tree.rs | New AVL interval tree for RT queries (suppression/overlap/covering RT). |
| src/range.rs | Wraps merged iterators with RangeTombstoneFilter and adds table-skip optimization for fully-covered tables. |
| src/range_tombstone.rs | Defines RangeTombstone, intersection/coverage helpers, and upper_bound_exclusive. |
| src/active_tombstone_set.rs | New forward/reverse active RT sets used by the iterator filter. |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper that suppresses KVs covered by RTs. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Parses "range_tombstones" TOC section handle. |
| src/table/inner.rs | Stores recovered per-table RTs in Table::Inner. |
| src/table/mod.rs | Recovers and decodes RT blocks; exposes Table::range_tombstones(). |
| src/table/writer/mod.rs | Writes RT block; adds RT-only table handling via a synthetic sentinel KV. |
| src/table/multi_writer.rs | Propagates RTs across output tables; adds optional clipping for compaction. |
| src/table/util.rs | Updates metrics bookkeeping to include RT blocks. |
| src/table/iter.rs | Adjusts lint suppression to be cfg(feature="metrics")-dependent. |
| src/compaction/worker.rs | Collects and propagates input-table RTs into compaction outputs; GC eviction at bottom level. |
| src/compaction/flavour.rs | Enables RT clipping for compaction output and adds RT propagation hook to flavours. |
| src/lib.rs | Registers new internal RT modules. |
Comments suppressed due to low confidence (1)
src/abstract_tree.rs:137
flush()callsregister_tables(&tables, ...)unconditionally whenflush_to_tables_with_rtreturnsSome, even iftables.is_empty(). If a flush ever produces no tables (e.g. an empty sealed memtable / all items dropped), this would still delete the sealed memtables viasealed_ids, potentially losing data. Consider guardingregister_tablesbehind!tables.is_empty()(or makingflush_to_tables_with_rtreturnOk(None)when nothing is written).
self.register_tables(
&tables,
blob_files.as_deref(),
None,
&sealed_ids,
seqno_threshold,
)?;
You can also share your feedback on Copilot code review. Take the survey.
| // Set range tombstones BEFORE writing KV items so that if MultiWriter | ||
| // rotates to a new table during the write loop, earlier tables already | ||
| // carry the RT metadata. | ||
| table_writer.set_range_tombstones(range_tombstones); | ||
|
|
| // Use the minimum RT start for the sentinel key. The sentinel | ||
| // seqno must not collide with real user data: SequenceNumberCounter | ||
| // starts at 0 (pre-increment), so seqno 0 is a valid user seqno. | ||
| // Use max RT seqno instead — by the time the RT was written, the | ||
| // counter has already passed this value, so no real KV entry will | ||
| // share this (user_key, seqno) pair. | ||
| self.write(InternalValue::new_weak_tombstone( | ||
| start.clone(), | ||
| max_rt_seqno, |
| table_writer.set_range_tombstones(range_tombstones); | ||
|
|
||
| for item in stream { | ||
| table_writer.write(item?)?; | ||
| } |
Summary
remove_range(start, end, seqno)andremove_prefix(prefix, seqno)API onAbstractTreeTechnical Details
Core types (ported algorithms from upstream PR fjall-rs#242, own SST format):
RangeTombstone— half-open[start, end)interval with seqnoActiveTombstoneSet/ActiveTombstoneSetReverse— sweep-line trackers for iterationIntervalTree— AVL-balanced interval tree in memtable for O(log n) queriesRangeTombstoneFilter— bidirectional wrapper that suppresses covered entriesRead path:
RangeTombstoneFilterwith sweep-line activation/expiryCoveringRt) but not yet wired into read pathWrite path:
MultiWriterusesclip_range_tombstones=falsefor flushintersect_opt().MultiWriterusesclip_range_tombstones=true(via.use_clip_range_tombstones())meta.first_key/meta.last_keyset tomin(rt.start)/max(rt.end)so key range covers all tombstones for compaction overlap checksSST format:
BlockType::RangeTombstone = 4[start_len:u16_le][start][end_len:u16_le][end][seqno:u64_le](repeated)"range_tombstones"section in table TOCu16::try_from— oversized keys rejected with error instead of silent truncationupper_bound_exclusivereturnsOption<UserKey>(Nonewhen key atu16::MAXlength)Known Limitations
Test Plan
remove_prefixAPICloses #16