Skip to content

feat: add range tombstones (delete_range / delete_prefix)#21

Open
polaz wants to merge 25 commits intomainfrom
feat/#16-feat-range-tombstones--deleterange--deleteprefix
Open

feat: add range tombstones (delete_range / delete_prefix)#21
polaz wants to merge 25 commits intomainfrom
feat/#16-feat-range-tombstones--deleterange--deleteprefix

Conversation

@polaz
Copy link
Member

@polaz polaz commented Mar 16, 2026

Summary

  • Add native range tombstone support for efficient deletion of contiguous key ranges
  • remove_range(start, end, seqno) and remove_prefix(prefix, seqno) API on AbstractTree
  • MVCC-correct: tombstones respect snapshot visibility and do not suppress newer values
  • Full persistence through flush, compaction, and recovery

Technical Details

Core types (ported algorithms from upstream PR fjall-rs#242, own SST format):

  • RangeTombstone — half-open [start, end) interval with seqno
  • ActiveTombstoneSet / ActiveTombstoneSetReverse — sweep-line trackers for iteration
  • IntervalTree — AVL-balanced interval tree in memtable for O(log n) queries
  • RangeTombstoneFilter — bidirectional wrapper that suppresses covered entries

Read path:

  • Point reads check range tombstone suppression across all layers (active memtable → sealed → SST)
  • Range/prefix iteration uses RangeTombstoneFilter with sweep-line activation/expiry
  • Table-skip infrastructure prepared (CoveringRt) but not yet wired into read path

Write path:

  • Flush: range tombstones written unclipped to SST — they must cover keys in older SSTs outside the memtable's key range. MultiWriter uses clip_range_tombstones=false for flush
  • Compaction: range tombstones clipped to each output table's key range via intersect_opt(). MultiWriter uses clip_range_tombstones=true (via .use_clip_range_tombstones())
  • RT-only tables (no KV data): synthetic sentinel key written for valid index; meta.first_key/meta.last_key set to min(rt.start)/max(rt.end) so key range covers all tombstones for compaction overlap checks
  • GC: range tombstones below MVCC watermark evicted at bottom level
  • MultiWriter rotation: RTs distributed across all output tables

SST format:

  • New BlockType::RangeTombstone = 4
  • Raw wire format: [start_len:u16_le][start][end_len:u16_le][end][seqno:u64_le] (repeated)
  • New "range_tombstones" section in table TOC
  • RT-only memtable flush: tombstones re-inserted into active memtable (SST requires KV data for valid index)
  • Key length validated via u16::try_from — oversized keys rejected with error instead of silent truncation
  • upper_bound_exclusive returns Option<UserKey> (None when key at u16::MAX length)

Known Limitations

  • SST block format is fork-specific (will need migration when upstream settles on format)
  • Linear scan for SST range tombstone suppression in point reads (table-skip optimization prepared but not wired)
  • No WAL persistence (consistent with crate design — caller manages WAL)

Test Plan

  • 64 unit tests for core types (RangeTombstone, ActiveTombstoneSet, IntervalTree, RangeTombstoneFilter)
  • 16 integration tests covering:
    • Point read suppression, MVCC visibility, newer values not suppressed
    • Forward and reverse range iteration
    • Cross-layer suppression (memtable → SST)
    • Sealed memtable tombstones
    • remove_prefix API
    • Overlapping tombstones
    • Persistence through flush, compaction, and recovery
    • RT-only memtable flush
    • GC eviction at bottom level
    • Prefix iteration with SST data
  • Full existing test suite passes (0 regressions)

Closes #16

- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types
  (ported from upstream PR fjall-rs#242 algorithms, own SST persistence)
- Add RangeTombstoneFilter for bidirectional iteration suppression
- Integrate into Memtable with interval tree for O(log n) queries
- Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno)
  to AbstractTree trait, implemented for Tree and BlobTree
- Wire suppression into point reads (memtable + sealed + SST layers)
- Wire RangeTombstoneFilter into range/prefix iteration pipeline
- SST persistence: raw wire format in BlockType::RangeTombstone block
- Flush: collect range tombstones from sealed memtables, write to SSTs
- Compaction: propagate RTs from input to output tables with clipping
- GC: evict range tombstones below watermark at bottom level
- Table-skip: skip tables fully covered by a range tombstone
- MultiWriter: clip RTs to each output table's key range on rotation
- Handle RT-only tables (derive key range from tombstone bounds)
- 16 integration tests covering all paths

Closes #16
Copilot AI review requested due to automatic review settings March 16, 2026 19:59
@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds end-to-end range-tombstone support: new RangeTombstone types and filters, interval-tree-backed memtable storage and active trackers, SST on-disk tombstone blocks and writer distribution, flush/compaction wiring, read-path suppression, public remove_range/remove_prefix APIs, tests, and architecture/known-gaps docs.

Changes

Cohort / File(s) Summary
Documentation & Planning
\.forge/16/analysis.md, \.forge/16/known-gaps.md
New architecture analysis and known-gaps documents for Issue #16 describing approaches, trade-offs, implementation options, and remaining limitations.
Core Types & Filters
src/range_tombstone.rs, src/range_tombstone_filter.rs
Adds RangeTombstone and CoveringRt types, ordering/utilities, and a bidirectional RangeTombstoneFilter that suppresses KV stream items during forward/reverse iteration.
Memtable Storage
src/memtable/interval_tree.rs, src/memtable/mod.rs
Introduces AVL IntervalTree for tombstones; integrates it into Memtable with Mutex protection and APIs to insert/query/summarize tombstones.
Active Tombstone Tracking
src/active_tombstone_set.rs
Implements forward/reverse ActiveTombstoneSet trackers (activation/expiry, seqno multisets, suppression checks) with tests for iteration semantics.
Tree, BlobTree & Flush Integration
src/abstract_tree.rs, src/tree/mod.rs, src/blob_tree/mod.rs, src/abstract_tree.rs
Adds flush_to_tables_with_rt (range-tombstone-aware flush), remove_range/remove_prefix APIs, and centralized is_suppressed_by_range_tombstones suppression checks integrated into read/get paths.
Table Writer & On-disk Format
src/table/writer/mod.rs, src/table/block/type.rs, src/table/inner.rs, src/table/mod.rs, src/table/regions.rs
Adds BlockType::RangeTombstone, writer support to record tombstones and metadata, on-table decoding for tombstone blocks, and Table::range_tombstones() accessor.
MultiWriter & Distribution
src/table/multi_writer.rs, src/table/mod.rs
MultiWriter::set_range_tombstones added; tombstones are clipped/distributed per output table (including RT-only tables) and written on rotation/finish.
Compaction Path
src/compaction/flavour.rs, src/compaction/worker.rs
Compaction flavour API extended with write_range_tombstones; compaction worker collects tombstones from inputs, filters by GC watermark at last level, and propagates surviving tombstones to outputs.
Range Read & Iteration
src/range.rs, src/table/iter.rs, src/table/util.rs
Range iterator fast-path when no tombstones; conditional wrapping with RangeTombstoneFilter; metrics/lint updates to account for tombstone block type.
Crate Root & Modules
src/lib.rs
Registers crate-private modules active_tombstone_set, range_tombstone, range_tombstone_filter.
Tests
tests/range_tombstone.rs
Comprehensive MVCC/tombstone tests covering memtables, SST persistence, iteration, remove_prefix, flush/compaction/recovery, and GC interactions.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Tree
  participant Memtable
  participant Flusher as Flush/Writer
  participant SST as SST Files
  participant Compactor

  rect rgba(200,230,255,0.5)
  Client->>Tree: remove_range(start,end,seq)
  Tree->>Memtable: insert_range_tombstone(start,end,seq)
  end

  rect rgba(220,255,200,0.5)
  Client->>Tree: read / iterate (read_seqno)
  Tree->>Memtable: is_suppressed_by_range_tombstones(key, read_seqno)
  Note over Tree,Memtable: suppression may consult sealed memtables and SSTs
  end

  rect rgba(255,240,200,0.5)
  Flusher->>Memtable: flush_to_tables_with_rt(stream, tombstones)
  Flusher->>SST: write_range_tombstone block into table
  end

  rect rgba(255,220,240,0.5)
  Compactor->>SST: collect range tombstones from inputs
  Compactor->>SST: write surviving tombstones to output tables
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Poem

🐰 I hopped through nodes and trimmed the vine,

laid gentle ranges where old keys resign.
With sorted roots and careful sweep,
I tuck small bytes where shadows sleep.
Happy hops — the graph grows fine!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add range tombstones (delete_range / delete_prefix)' accurately and concisely describes the main feature addition: implementing range tombstone support with the two primary APIs for bulk deletion.
Linked Issues check ✅ Passed The PR fully implements the requirements from Issue #16: provides remove_range and remove_prefix APIs, ensures MVCC correctness, integrates across memtable/flush/SST/compaction/recovery, supports bulk deletions for graph workloads, and includes comprehensive tests.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing range tombstone support. No unrelated modifications detected; feature is self-contained and focused on the delete_range/delete_prefix implementation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/#16-feat-range-tombstones--deleterange--deleteprefix
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class range tombstones to the LSM-tree to enable efficient range/prefix deletes (remove_range, remove_prefix) with MVCC-aware visibility, and persists tombstones through flush/compaction/recovery via a new SST block and table metadata.

Changes:

  • Introduces core range-tombstone types and in-memory structures (interval tree in memtables; sweep-line active sets + iterator filter).
  • Extends flush/compaction/table IO to write/read range tombstones from SSTs (BlockType::RangeTombstone, TOC region, metadata count).
  • Adds integration tests validating suppression semantics and persistence across flush/compaction/recovery.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests for range tombstone behavior and persistence.
src/abstract_tree.rs Adds remove_range API and remove_prefix helper; extends flush path to pass tombstones.
src/tree/mod.rs Tree integration: point-read suppression check + flush now accepts tombstones.
src/blob_tree/mod.rs BlobTree integration: passes tombstones through flush; exposes remove_range.
src/memtable/mod.rs Stores memtable range tombstones in a mutex-protected interval tree and exposes query/flush helpers.
src/memtable/interval_tree.rs AVL interval tree backing for memtable tombstone queries.
src/range_tombstone.rs Defines RangeTombstone, clipping/cover helpers, and key upper-bound helper.
src/active_tombstone_set.rs Sweep-line “active tombstone” trackers for forward/reverse iteration.
src/range_tombstone_filter.rs Bidirectional iterator wrapper to suppress entries covered by visible range tombstones.
src/range.rs Collects tombstones across layers, adds table-skip optimization, and wraps iterators with tombstone filter.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4.
src/table/writer/mod.rs Writes range tombstone blocks + metadata; supports RT-only tables by deriving key range.
src/table/regions.rs Adds range_tombstones section handle in parsed table regions.
src/table/multi_writer.rs Distributes/clips tombstones per output table via set_range_tombstones.
src/table/inner.rs Stores loaded range tombstones in table inner state.
src/table/mod.rs Loads/decodes range tombstone block on table recovery; exposes range_tombstones().
src/compaction/flavour.rs Extends compaction flavour to accept output range tombstones.
src/compaction/worker.rs Collects tombstones from input tables and propagates them to compaction outputs (with last-level GC filtering).
src/lib.rs Registers new internal modules for range tombstone support.
.forge/16/known-gaps.md Documents remaining limitations and known gaps for Issue #16.
.forge/16/analysis.md Architecture/design analysis and rationale for the chosen approach.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
src/table/mod.rs (2)

570-577: Consider validating the block type for range tombstones.

The recovery path loads the range tombstone block with CompressionType::None but doesn't validate that the loaded block has BlockType::RangeTombstone. This differs from other block loading paths (e.g., filter block at lines 553-560) that verify the block type.

♻️ Proposed fix
         let range_tombstones = if let Some(rt_handle) = regions.range_tombstones {
             log::trace!("Loading range tombstone block, with rt_ptr={rt_handle:?}");
             let block = Block::from_file(&file, rt_handle, crate::CompressionType::None)?;
+            if block.header.block_type != BlockType::RangeTombstone {
+                return Err(crate::Error::InvalidTag((
+                    "BlockType",
+                    block.header.block_type.into(),
+                )));
+            }
             Self::decode_range_tombstones(block)?
         } else {
             Vec::new()
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/mod.rs` around lines 570 - 577, The range tombstone loading path
doesn't validate the loaded block type; when regions.range_tombstones is Some
you call Block::from_file(..., crate::CompressionType::None) and then
Self::decode_range_tombstones(block) without checking block.block_type. Update
the code to verify the loaded block's type equals BlockType::RangeTombstone (or
return an error if not) before calling Self::decode_range_tombstones; reference
the regions.range_tombstones branch, the Block::from_file call,
crate::CompressionType::None, and Self::decode_range_tombstones to locate and
modify the logic.

620-660: Consider validating tombstone invariant start < end during decoding.

RangeTombstone::new has a debug_assert!(start < end) that only fires in debug builds. Corrupted on-disk data could produce invalid tombstones with start >= end in release builds, potentially causing incorrect behavior in the interval tree or filtering logic.

🛡️ Proposed validation
             tombstones.push(RangeTombstone::new(
                 UserKey::from(start_buf),
                 UserKey::from(end_buf),
                 seqno,
             ));
+            // Validate tombstone invariant
+            if let Some(rt) = tombstones.last() {
+                if rt.start >= rt.end {
+                    return Err(crate::Error::Unrecoverable);
+                }
+            }

Or validate before constructing:

+            let start_key = UserKey::from(start_buf);
+            let end_key = UserKey::from(end_buf);
+            if start_key >= end_key {
+                return Err(crate::Error::Unrecoverable);
+            }
             tombstones.push(RangeTombstone::new(
-                UserKey::from(start_buf),
-                UserKey::from(end_buf),
+                start_key,
+                end_key,
                 seqno,
             ));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/mod.rs` around lines 620 - 660, The decoder
(decode_range_tombstones) must validate the invariant start < end after reading
start_buf and end_buf and before calling RangeTombstone::new; if the condition
fails, return an appropriate crate::Error (e.g., a corruption/unrecoverable
error) instead of constructing the tombstone so release builds don't accept
invalid on-disk data; update the logic around
UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and
early-return Err(...) when start >= end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.forge/16/analysis.md:
- Around line 85-95: MD040 is triggered because the two fenced diff-stat blocks
in .forge/16/analysis.md lack a language tag; update the opening fences for the
two blocks that start with "origin/main vs upstream/main:" and "origin/main vs
upstream-pr-242:" to use a language tag (e.g., change ``` to ```text) so the
blocks become ```text; keep the content of the blocks unchanged.

In `@src/abstract_tree.rs`:
- Around line 722-743: The current remove_prefix implementation incorrectly
treats Bound::Unbounded by synthesizing a truncated exclusive end via
range_tombstone::upper_bound_exclusive(prefix), which doesn't form a true
lexicographic upper bound (e.g., b"\xff" still leaves b"\xff\x01"); fix by
adding an explicit unbounded-path instead of synthesizing an end: when
prefix_to_range yields hi == Bound::Unbounded, call a true unbounded-range
removal API (e.g., implement/use self.remove_range_unbounded(start, seqno) or
self.remove_range_from(start, seqno)) to remove [start, +∞), or if an unbounded
removal is not supported, explicitly reject the operation (return 0 or propagate
an error) rather than creating a truncated tombstone; update remove_prefix, and
add/modify remove_range* helper(s) accordingly (referencing remove_prefix,
prefix_to_range, range_tombstone::upper_bound_exclusive, and remove_range).

In `@src/active_tombstone_set.rs`:
- Around line 110-123: The initialize_from method currently activates every
provided RangeTombstone via activate(&rt) even if it starts in the future
relative to the iterator's current_key; change initialize_from to only activate
truly overlapping tombstones by either (A) adding a current_key parameter (e.g.,
initialize_from(&mut self, current_key: &KeyType, tombstones: impl
IntoIterator<Item = RangeTombstone>)) and filtering each rt with rt.start <=
*current_key && *current_key < rt.end before calling activate, or (B) make
initialize_from private/internal and require callers to pre-filter by that same
overlap predicate; apply the same fix to the analogous method/block at the
second occurrence (the code referenced around lines 226–230) and ensure
expire_until usage still enforces the end bound after activation.

In `@src/memtable/interval_tree.rs`:
- Around line 266-269: The insert method incorrectly increments self.len
unconditionally; change insert_node (the helper called by insert) to return
whether it actually created a new node (e.g., return a (node, bool) or a Node
plus created flag) when handling RangeTombstone duplicates, then update insert
to set self.root from the returned node and only increment self.len when the
created flag is true; update any calling sites or signatures for insert_node and
keep references to RangeTombstone, insert_node, and insert consistent.

In `@src/memtable/mod.rs`:
- Around line 31-36: Memtables that only contain range tombstones are treated as
empty because Memtable::is_empty() only checks items; update logic so
Memtable::is_empty() also inspects range_tombstones by locking the Mutex and
calling the interval tree's is_empty (e.g., check
self.range_tombstones.lock().is_empty()), and propagate this change to callers
(Tree::rotate_memtable(), Tree::clear_active_memtable()) so they seal/flush
memtables and compute get_highest_seqno() correctly when only tombstones exist;
ensure get_highest_seqno() and any durability paths (remove_range/remove_prefix)
treat a memtable with non-empty range_tombstones as non-empty.
- Around line 183-190: The insert_range_tombstone method must reject empty or
reversed intervals before persisting: in insert_range_tombstone (and before
calling RangeTombstone::new or inserting into self.range_tombstones), validate
that start < end (using UserKey ordering) and if not, return 0 (do not
compute/record the tombstone). Move or gate the size calculation and the
lock+insert behind this check so reversed/empty ranges are never stored; keep
the rest of the method behavior unchanged.

In `@src/range_tombstone_filter.rs`:
- Around line 39-52: The constructor new() should ensure fwd_tombstones are
ordered by start ascending (and tie-break by seqno descending) before storing,
because fwd_activate_up_to() relies on forward tombstones being sorted by start;
add a sort on fwd_tombstones in new() such as sorting by (start asc, seqno desc)
prior to assigning Self.fwd_tombstones so out-of-order input Vecs can't skip
activation.

In `@tests/range_tombstone.rs`:
- Around line 330-352: Add a new test that writes a range tombstone then tampers
with the SST on-disk range_tombstones block and asserts recovery fails: use
get_tmp_folder(), open_tree(), remove_range(), and flush_active_memtable(0) to
create an SST with the new range tombstone format
([start_len][start][end_len][end][seqno]), then locate the produced SST file in
the folder, corrupt/truncate bytes from the range_tombstones payload (e.g.
remove bytes from the end of the block or zero out length fields), attempt to
reopen the tree with open_tree(folder.path()) and assert that reopening returns
an Err (i.e. recovery fails) rather than succeeding; keep the test focused on
both truncated and malformed length scenarios so it fails on invalid RT
encoding.
- Around line 168-176: The test currently relies on behavior that could pass
even if the tombstone never left the active memtable; modify the test around the
remove_range/rotate_memtable sequence (the block using tree.remove_range,
tree.rotate_memtable, tree.insert, and tree.get) to explicitly assert that the
tombstone was actually sealed/flushed by checking rotate_memtable() returned
Some (or that tree.sealed_memtable_count() > 0) immediately after rotation, and
similarly for the later analogous block (lines ~367-374) assert table_count()
increases or reopen the DB to force RT-only flush visibility; update the
assertions in both places so the test fails if the tombstone stayed in the
active memtable.

---

Nitpick comments:
In `@src/table/mod.rs`:
- Around line 570-577: The range tombstone loading path doesn't validate the
loaded block type; when regions.range_tombstones is Some you call
Block::from_file(..., crate::CompressionType::None) and then
Self::decode_range_tombstones(block) without checking block.block_type. Update
the code to verify the loaded block's type equals BlockType::RangeTombstone (or
return an error if not) before calling Self::decode_range_tombstones; reference
the regions.range_tombstones branch, the Block::from_file call,
crate::CompressionType::None, and Self::decode_range_tombstones to locate and
modify the logic.
- Around line 620-660: The decoder (decode_range_tombstones) must validate the
invariant start < end after reading start_buf and end_buf and before calling
RangeTombstone::new; if the condition fails, return an appropriate crate::Error
(e.g., a corruption/unrecoverable error) instead of constructing the tombstone
so release builds don't accept invalid on-disk data; update the logic around
UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and
early-return Err(...) when start >= end.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a4b651a-d997-4ee5-9e1a-ba1e0f088648

📥 Commits

Reviewing files that changed from the base of the PR and between 5103c70 and a3bc6c9.

📒 Files selected for processing (21)
  • .forge/16/analysis.md
  • .forge/16/known-gaps.md
  • src/abstract_tree.rs
  • src/active_tombstone_set.rs
  • src/blob_tree/mod.rs
  • src/compaction/flavour.rs
  • src/compaction/worker.rs
  • src/lib.rs
  • src/memtable/interval_tree.rs
  • src/memtable/mod.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/range_tombstone_filter.rs
  • src/table/block/type.rs
  • src/table/inner.rs
  • src/table/mod.rs
  • src/table/multi_writer.rs
  • src/table/regions.rs
  • src/table/writer/mod.rs
  • src/tree/mod.rs
  • tests/range_tombstone.rs

polaz added 2 commits March 16, 2026 22:26
- Fix non-exhaustive BlockType match in table/util.rs (metrics feature)
- Add #[allow(dead_code)] for future-use API (CoveringRt, initialize_from)
- Add #[allow(clippy::unnecessary_box_returns)] for AVL tree functions
- Fix Option<&T> vs &Option<T> in interval_tree traversal functions
- Add missing # Panics and # Errors doc sections
- Add backticks in doc comments for identifiers
- Use let...else patterns where clippy suggests
- Fix redundant clone, pass-by-value, pub(crate) visibility
- Remove unfulfilled lint expectations (cfg-gated params)
- Validate start < end on SST decode (reject corrupted RT blocks)
- Validate start >= end in insert_range_tombstone (release-safe guard)
- Fix duplicate insert len accounting in IntervalTree
- Fix 0xFF prefix edge case in remove_prefix (return 0 for unrepresentable)
- Re-insert tombstones into active memtable when RT-only flush
  produces no valid SST (no index possible without KV items)
- Account for range tombstones in Memtable::is_empty
- Add cheap pre-check before collecting range tombstones in iteration
- Sort fwd_tombstones inside RangeTombstoneFilter::new for safety
- Harden test assertions (sealed count, table count)
Verify loaded block has BlockType::RangeTombstone before decoding,
consistent with filter block validation pattern.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/table/iter.rs (1)

122-122: Prefer #[expect(clippy::too_many_arguments)] over #[allow(...)].

Per coding guidelines, #[expect] should be used instead of #[allow] because it warns when the suppression becomes unnecessary (e.g., if future refactoring reduces the argument count).

♻️ Suggested fix
-    #[allow(clippy::too_many_arguments)]
+    #[expect(clippy::too_many_arguments)]

As per coding guidelines: "Prefer #[expect(lint)] over #[allow(lint)]#[expect] warns when suppression becomes unnecessary"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/iter.rs` at line 122, Replace the
#[allow(clippy::too_many_arguments)] attribute on the function/impl annotated in
this diff with #[expect(clippy::too_many_arguments)] so the suppression still
applies but will warn if it becomes unnecessary; locate the exact attribute
instance (the attribute decorating the function that currently uses
#[allow(clippy::too_many_arguments)]) and change the attribute name only,
leaving the lint and surrounding code unchanged.
src/abstract_tree.rs (1)

239-245: Consider #[expect] instead of #[warn] for lint attribute.

The #[warn(clippy::type_complexity)] attribute is unusual here — it would emit warnings rather than suppress them. If the intent is to suppress the lint, use #[expect(clippy::type_complexity)] or #[allow(clippy::type_complexity)].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/abstract_tree.rs` around lines 239 - 245, The attribute on function
flush_to_tables uses #[warn(clippy::type_complexity)] which will emit warnings
instead of suppressing them; change it to #[expect(clippy::type_complexity)] (or
#[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on
the flush_to_tables function declaration so the clippy type_complexity lint is
treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt
code path.
src/range.rs (1)

101-101: Prefer #[expect(clippy::too_many_lines)] over #[allow(...)].

Per coding guidelines, use #[expect] so you get a warning when the suppression becomes unnecessary.

♻️ Suggested fix
-    #[allow(clippy::too_many_lines)]
+    #[expect(clippy::too_many_lines)]

As per coding guidelines: "Prefer #[expect(lint)] over #[allow(lint)]#[expect] warns when suppression becomes unnecessary"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range.rs` at line 101, Replace the attribute
#[allow(clippy::too_many_lines)] with #[expect(clippy::too_many_lines)] to
follow the coding guideline; locate the attribute (attached to the
function/module declared near the top of src/range.rs) and change the lint
suppression to an expectation so you receive a warning if the suppression
becomes unnecessary.
src/tree/mod.rs (1)

778-816: Performance concern: Linear scan of all SST tables for range tombstone suppression.

The current implementation iterates over every table in every level (lines 802-813) to check range tombstones. For point reads, this could be expensive with many tables.

Consider filtering tables by key range before checking their tombstones, similar to how get_for_key filters tables in get_internal_entry_from_tables. The PR notes this as a known limitation ("linear SST scan for point-read suppression").

💡 Potential optimization sketch
-        for table in super_version
-            .version
-            .iter_levels()
-            .flat_map(|lvl| lvl.iter())
-            .flat_map(|run| run.iter())
-        {
+        // Only check tables whose key range contains the key
+        for table in super_version
+            .version
+            .iter_levels()
+            .flat_map(|lvl| lvl.iter())
+            .filter_map(|run| run.get_for_key(key))
+        {
             for rt in table.range_tombstones() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tree/mod.rs` around lines 778 - 816, The
is_suppressed_by_range_tombstones function currently scans every SST table;
limit that by first filtering tables whose key range could contain the target
key (like get_for_key / get_internal_entry_from_tables does) before iterating
range_tombstones(): for each table from
super_version.version.iter_levels().flat_map(...).flat_map(...), check the
table's key range (e.g., smallest/largest key or a key_range() accessor on the
Table) and skip tables where key is outside that range, then only call
table.range_tombstones() and rt.should_suppress; keep the existing
active_memtable/sealed_memtables checks and ensure you use the same table-range
helpers used elsewhere to avoid duplicating range logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/table/util.rs`:
- Line 31: The attribute #[allow(clippy::too_many_arguments)] in
src/table/util.rs should be changed to #[expect(clippy::too_many_arguments)] to
comply with the repo's policy; locate the function or item annotated with
#[allow(clippy::too_many_arguments)] (the attribute on the item in util.rs) and
replace the allow attribute with #[expect(clippy::too_many_arguments)]
preserving any existing whitespace/comments and ensuring the attribute exactly
matches the expect form.

---

Nitpick comments:
In `@src/abstract_tree.rs`:
- Around line 239-245: The attribute on function flush_to_tables uses
#[warn(clippy::type_complexity)] which will emit warnings instead of suppressing
them; change it to #[expect(clippy::type_complexity)] (or
#[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on
the flush_to_tables function declaration so the clippy type_complexity lint is
treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt
code path.

In `@src/range.rs`:
- Line 101: Replace the attribute #[allow(clippy::too_many_lines)] with
#[expect(clippy::too_many_lines)] to follow the coding guideline; locate the
attribute (attached to the function/module declared near the top of
src/range.rs) and change the lint suppression to an expectation so you receive a
warning if the suppression becomes unnecessary.

In `@src/table/iter.rs`:
- Line 122: Replace the #[allow(clippy::too_many_arguments)] attribute on the
function/impl annotated in this diff with #[expect(clippy::too_many_arguments)]
so the suppression still applies but will warn if it becomes unnecessary; locate
the exact attribute instance (the attribute decorating the function that
currently uses #[allow(clippy::too_many_arguments)]) and change the attribute
name only, leaving the lint and surrounding code unchanged.

In `@src/tree/mod.rs`:
- Around line 778-816: The is_suppressed_by_range_tombstones function currently
scans every SST table; limit that by first filtering tables whose key range
could contain the target key (like get_for_key / get_internal_entry_from_tables
does) before iterating range_tombstones(): for each table from
super_version.version.iter_levels().flat_map(...).flat_map(...), check the
table's key range (e.g., smallest/largest key or a key_range() accessor on the
Table) and skip tables where key is outside that range, then only call
table.range_tombstones() and rt.should_suppress; keep the existing
active_memtable/sealed_memtables checks and ensure you use the same table-range
helpers used elsewhere to avoid duplicating range logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5037095-c6a0-47c3-a880-07c3d65eaa33

📥 Commits

Reviewing files that changed from the base of the PR and between a3bc6c9 and 718f2ba.

📒 Files selected for processing (14)
  • .forge/16/analysis.md
  • src/abstract_tree.rs
  • src/active_tombstone_set.rs
  • src/memtable/interval_tree.rs
  • src/memtable/mod.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/range_tombstone_filter.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/util.rs
  • src/table/writer/mod.rs
  • src/tree/mod.rs
  • tests/range_tombstone.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/range_tombstone.rs

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MVCC-aware range tombstones to the LSM-tree, enabling efficient contiguous deletions via remove_range / remove_prefix, and persists/propagates tombstones through flush + compaction + recovery.

Changes:

  • Introduces core range-tombstone data structures (RangeTombstone, memtable IntervalTree, sweep-line ActiveTombstoneSet, and RangeTombstoneFilter).
  • Extends read/write paths to apply range-tombstone suppression (point reads, iterators, flush, compaction, and SST recovery).
  • Adds integration + regression tests for correctness across memtables/SSTs/compaction/recovery.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests covering range tombstone behavior across layers and lifecycle events.
src/abstract_tree.rs Adds remove_range API and default remove_prefix; extends flush to pass RTs through.
src/tree/mod.rs Implements remove_range; applies RT suppression on point reads; flush now supports RT distribution.
src/blob_tree/mod.rs Forwards range deletes through BlobTree to underlying index tree; flush now supports RT distribution.
src/memtable/mod.rs Stores RTs in a mutex-protected interval tree; updates emptiness + adds RT query/insert APIs.
src/memtable/interval_tree.rs New AVL interval tree for efficient RT overlap/suppression queries in memtables.
src/active_tombstone_set.rs New forward/reverse active RT tracking for sweep-line iteration filtering.
src/range_tombstone.rs Defines the RT type, ordering, and helper utilities (intersection, coverage checks).
src/range_tombstone_filter.rs New bidirectional iterator wrapper that suppresses KV entries covered by RTs.
src/range.rs Collects RTs from all layers; wraps merged iterator with RangeTombstoneFilter; adds table-skip optimization.
src/table/block/type.rs Adds new BlockType::RangeTombstone = 4.
src/table/regions.rs Extends table TOC parsing to include optional range_tombstones region.
src/table/inner.rs Stores recovered table RTs in Inner.
src/table/mod.rs Loads/decodes RT blocks on table open; exposes range_tombstones() accessor.
src/table/writer/mod.rs Adds RT collection and writes RT block + metadata count during table finish.
src/table/multi_writer.rs Distributes/clips RTs across rotated output tables (flush/compaction output).
src/table/util.rs Treats RT blocks like data/meta blocks for metrics counters; changes lint suppression.
src/table/iter.rs Changes lint suppression for constructor with many args.
src/compaction/flavour.rs Adds write_range_tombstones hook so compaction can propagate RTs to outputs.
src/compaction/worker.rs Collects RTs from input tables and propagates them to compaction outputs (with GC at last level).
src/lib.rs Exposes new internal modules for RT support.
.forge/16/known-gaps.md Documents known limitations and optimization opportunities for RT feature.
.forge/16/analysis.md Design/architecture analysis and upstream context for RT implementation choices.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/table/mod.rs (1)

648-666: Consider preserving error context for easier debugging.

The map_err(|_| crate::Error::Unrecoverable) discards the original I/O error. While Unrecoverable is semantically appropriate for corrupted data, preserving context (e.g., which field failed, byte offset) could aid debugging in production.

💡 Optional: Add context to decode errors

If a more descriptive error variant exists or is added later (e.g., Error::RangeTombstoneCorrupted { offset, reason }), consider using it here to preserve the cursor position and failure reason.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/mod.rs` around lines 648 - 666, The current reads
(cursor.read_u16::<LE>(), cursor.read_exact(...), cursor.read_u64::<LE>()) map
all IO errors to crate::Error::Unrecoverable, discarding the original error and
offset; update these map_err calls to capture and wrap the original error and
context (e.g., which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno"
and the cursor position/byte offset) into a more informative Error variant or
into Unrecoverable with a message, so failures when reading start_len,
start_buf, end_len, end_buf, and seqno preserve the underlying io::Error and an
indication of the failing field/offset for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/table/mod.rs`:
- Around line 671-675: Add a unit test named
decode_range_tombstones_invalid_interval_returns_error() that builds a
serialized range tombstone blob matching the production format (little-endian
lengths and seqno fields), intentionally sets the interval so start >= end,
calls decode_range_tombstones(...) and asserts it returns
Err(crate::Error::Unrecoverable); reference the decode_range_tombstones function
and use the same byteorder::<LE> encoding for length/seqno fields so the
deserializer exercises the start >= end validation path.

---

Nitpick comments:
In `@src/table/mod.rs`:
- Around line 648-666: The current reads (cursor.read_u16::<LE>(),
cursor.read_exact(...), cursor.read_u64::<LE>()) map all IO errors to
crate::Error::Unrecoverable, discarding the original error and offset; update
these map_err calls to capture and wrap the original error and context (e.g.,
which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno" and the cursor
position/byte offset) into a more informative Error variant or into
Unrecoverable with a message, so failures when reading start_len, start_buf,
end_len, end_buf, and seqno preserve the underlying io::Error and an indication
of the failing field/offset for easier debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d0210b5-a114-4709-840a-169886e4144c

📥 Commits

Reviewing files that changed from the base of the PR and between 718f2ba and 343f31c.

📒 Files selected for processing (1)
  • src/table/mod.rs

- Fix visible_at to use exclusive boundary (seqno < read_seqno),
  consistent with seqno_filter and Memtable::get conventions
- Validate decoded field lengths against remaining block data before
  allocating buffers (prevent OOM on corrupted blocks)
- Skip SST tables by key_range in point read RT suppression check
- Add code comments explaining #[allow] vs #[expect] for cfg-gated params
- Use #[expect] with reason for too_many_lines on create_range
- Fix remove_range doc to match no-op behavior for invalid intervals
- Update PR body to reflect RT-only flush semantics
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/table/util.rs (1)

31-32: ⚠️ Potential issue | 🟡 Minor

Use cfg_attr(..., expect(...)) instead of a permanent #[allow].

The feature-gated parameter count is exactly what cfg_attr(feature = "metrics", expect(...)) is for. Leaving #[allow] here suppresses the lint in the one build where it should still fire.

Suggested patch
-// NOTE: #[allow] not #[expect] because arg count changes with cfg(feature = "metrics")
-#[allow(clippy::too_many_arguments)]
+#[cfg_attr(
+    feature = "metrics",
+    expect(
+        clippy::too_many_arguments,
+        reason = "metrics adds the extra parameter; without that feature this stays at the lint threshold"
+    )
+)]
 pub fn load_block(

As per coding guidelines, "Prefer #[expect(lint)] over #[allow(lint)]#[expect] warns when suppression becomes unnecessary."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/table/util.rs` around lines 31 - 32, Replace the permanent attribute
#[allow(clippy::too_many_arguments)] with a conditional expect so the lint still
fires when the metrics feature is disabled; specifically change the attribute to
#[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the
inverse depending on feature semantics) at the same item (the attribute applied
in src/table/util.rs) so the suppression only applies under the cfg where extra
arguments exist and the lint will warn in builds where it should.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/range.rs`:
- Around line 217-226: The table-cover check uses table.metadata.key_range.max()
which is an exclusive upper bound, but RangeTombstone::fully_covers() expects an
inclusive max; convert the exclusive upper bound to an inclusive one before
calling fully_covers (e.g., compute an inclusive_max from
table.metadata.key_range.max() using the appropriate predecessor/prev-key API or
method on the key type, handle empty ranges by skipping the check if min >
inclusive_max), and then call rt.fully_covers(table.metadata.key_range.min(),
inclusive_max) inside the all_range_tombstones.iter().any(...) closure while
keeping the visible_at and seqno comparison logic unchanged.

---

Duplicate comments:
In `@src/table/util.rs`:
- Around line 31-32: Replace the permanent attribute
#[allow(clippy::too_many_arguments)] with a conditional expect so the lint still
fires when the metrics feature is disabled; specifically change the attribute to
#[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the
inverse depending on feature semantics) at the same item (the attribute applied
in src/table/util.rs) so the suppression only applies under the cfg where extra
arguments exist and the lint will warn in builds where it should.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f4d3251b-ba68-46f2-8418-870e509cdcef

📥 Commits

Reviewing files that changed from the base of the PR and between 343f31c and 3e23f57.

📒 Files selected for processing (7)
  • src/abstract_tree.rs
  • src/range.rs
  • src/range_tombstone.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/util.rs
  • src/tree/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/table/iter.rs

Replace #[allow(clippy::too_many_arguments)] with
#[cfg_attr(feature = "metrics", expect(...))] so the suppression
only activates when the extra metrics parameter is compiled in.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces MVCC-correct range tombstones to support efficient delete_range / delete_prefix semantics across memtables and SSTs, including persistence through flush/compaction/recovery and iterator-time suppression.

Changes:

  • Add remove_range / remove_prefix to AbstractTree, with memtable storage via an AVL interval tree and iterator-time filtering via a sweep-line RangeTombstoneFilter.
  • Persist range tombstones into SSTs via a new BlockType::RangeTombstone and load/validate them on table recovery.
  • Propagate/GC range tombstones during compaction and add an integration test suite for end-to-end semantics.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests for point reads, iteration, flush/compaction/recovery, prefix deletes, and RT-only flush behavior
src/abstract_tree.rs Adds public remove_range/remove_prefix API + flush plumbing for range tombstones
src/tree/mod.rs Flush writes RTs to SSTs and point-read path checks suppression by RTs across layers
src/blob_tree/mod.rs RT support for blob-tree flush path + forwards remove_range
src/memtable/mod.rs Stores RTs in a mutex-protected interval tree; exposes query/sorted export helpers
src/memtable/interval_tree.rs New AVL interval tree implementation for efficient overlap/suppression queries
src/range.rs Wraps merged iterators with RangeTombstoneFilter and adds table-skip optimization
src/range_tombstone.rs Defines RangeTombstone semantics, ordering, clipping, and helpers
src/range_tombstone_filter.rs Bidirectional sweep-line iterator filter for RT suppression
src/active_tombstone_set.rs Active set structures used by the iterator filter (forward + reverse)
src/table/block/type.rs Adds new on-disk block tag RangeTombstone = 4
src/table/writer/mod.rs Writes an RT block into SSTs and records range_tombstone_count in metadata
src/table/regions.rs TOC parsing extended to locate the range_tombstones section
src/table/inner.rs Stores loaded RTs in table inner state
src/table/mod.rs Loads/decodes RT blocks on recovery and exposes range_tombstones()
src/table/multi_writer.rs Attempts to distribute/clamp RTs across rotated output tables
src/table/util.rs Treats RT blocks like data/meta for metrics bookkeeping; lint suppression change
src/table/iter.rs Lint suppression change for constructor arg count
src/compaction/worker.rs Collects RTs from input tables and attempts to propagate/GC them
src/compaction/flavour.rs Adds write_range_tombstones hook and wires it to the table writer
src/lib.rs Exposes new internal modules for range tombstone support
.forge/16/known-gaps.md Documents limitations/known gaps for the feature
.forge/16/analysis.md Architecture analysis and design trade-offs documentation

You can also share your feedback on Copilot code review. Take the survey.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/range.rs (2)

302-307: Strengthen the fast path to skip wrapping when no tombstone is visible.

A non-empty all_range_tombstones can still have zero entries visible at seqno, so this currently wraps with a no-op filter in those cases.

⚡ Proposed patch
-            // Fast path: skip filter wrapping when no range tombstones exist
-            if all_range_tombstones.is_empty() {
+            // Fast path: skip wrapping when no tombstone is visible for this read seqno
+            if !all_range_tombstones.iter().any(|rt| rt.visible_at(seqno)) {
                 Box::new(iter)
             } else {
                 Box::new(RangeTombstoneFilter::new(iter, all_range_tombstones, seqno))
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range.rs` around lines 302 - 307, The fast-path should skip wrapping not
only when all_range_tombstones.is_empty() but also when none of the tombstones
are visible at the given seqno; update the conditional around Box::new(...) to
inspect all_range_tombstones for any entry visible at seqno (e.g. using the
visibility predicate/method used by RangeTombstoneFilter or by
filtering/map-checking the collection) and only construct
RangeTombstoneFilter::new(iter, all_range_tombstones, seqno) if at least one
tombstone is visible; otherwise return Box::new(iter) unchanged.

176-196: Remove redundant tombstone sorting in this path.

rts.sort() here is duplicate work: RangeTombstoneFilter::new already sorts the forward list (and builds a reverse-sorted copy). This adds avoidable O(n log n) cost per range open.

♻️ Proposed patch
             let all_range_tombstones = if rt_count > 0 {
                 let mut rts: Vec<RangeTombstone> = Vec::with_capacity(rt_count);

                 rts.extend(lock.version.active_memtable.range_tombstones_sorted());
                 for mt in lock.version.sealed_memtables.iter() {
                     rts.extend(mt.range_tombstones_sorted());
                 }
                 if let Some((mt, _)) = &lock.ephemeral {
                     rts.extend(mt.range_tombstones_sorted());
                 }
                 for table in lock
                     .version
                     .version
                     .iter_levels()
                     .flat_map(|lvl| lvl.iter())
                     .flat_map(|run| run.iter())
                 {
                     rts.extend(table.range_tombstones().iter().cloned());
                 }
-                rts.sort();
                 rts
             } else {
                 Vec::new()
             };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/range.rs` around lines 176 - 196, The vector `rts` constructed for
`all_range_tombstones` is being sorted with `rts.sort()` redundantly;
`RangeTombstoneFilter::new` already sorts the forward list and produces the
reverse-sorted copy, so remove the `rts.sort()` call in the block that builds
`rts` (the code that extends from
`lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables,
`lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed
into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while
preserving correctness for `RangeTombstone` handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/range.rs`:
- Around line 302-307: The fast-path should skip wrapping not only when
all_range_tombstones.is_empty() but also when none of the tombstones are visible
at the given seqno; update the conditional around Box::new(...) to inspect
all_range_tombstones for any entry visible at seqno (e.g. using the visibility
predicate/method used by RangeTombstoneFilter or by filtering/map-checking the
collection) and only construct RangeTombstoneFilter::new(iter,
all_range_tombstones, seqno) if at least one tombstone is visible; otherwise
return Box::new(iter) unchanged.
- Around line 176-196: The vector `rts` constructed for `all_range_tombstones`
is being sorted with `rts.sort()` redundantly; `RangeTombstoneFilter::new`
already sorts the forward list and produces the reverse-sorted copy, so remove
the `rts.sort()` call in the block that builds `rts` (the code that extends from
`lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables,
`lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed
into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while
preserving correctness for `RangeTombstone` handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36192350-8de8-4946-a95f-d0e3da61329a

📥 Commits

Reviewing files that changed from the base of the PR and between 3e23f57 and 954c044.

📒 Files selected for processing (3)
  • src/range.rs
  • src/table/iter.rs
  • src/table/util.rs

@codecov
Copy link

codecov bot commented Mar 16, 2026

…unds

- Set range tombstones on MultiWriter before KV write loop in flush
  (Tree, BlobTree) so rotations carry RT metadata in earlier tables
- Propagate RTs before merge_iter loop in compaction worker
- Assert u16 bound on start/end key lengths in insert_range_tombstone
- Use exclusive seqno boundary in visible_at doc strings
- Strengthen iteration fast path: check visible_at not just is_empty
- Remove redundant sort (RangeTombstoneFilter::new sorts internally)
- Compaction with MultiWriter rotation preserves RTs across tables
- Table-skip optimization skips fully-covered tables
- BlobTree range tombstone insert, suppress, flush
- Invalid interval (start >= end) silently returns 0
- Multiple compaction rounds preserve range tombstones
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MVCC-aware range tombstones to enable efficient range/prefix deletion (remove_range / remove_prefix) with persistence across flush/compaction/recovery, plus iterator- and point-read suppression.

Changes:

  • Introduces core range-tombstone data structures (range tombstone model, sweep-line active sets, memtable interval tree, iterator filter).
  • Persists range tombstones in SSTs via a new BlockType::RangeTombstone and TOC section, propagating them through compaction.
  • Updates read path (point reads + range/prefix iteration) to suppress covered entries and adds comprehensive tests/docs.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/range_tombstone.rs New integration test suite for RT semantics across memtable/SST/compaction/recovery.
src/abstract_tree.rs Adds remove_range API + remove_prefix helper; flush now forwards RTs to table writers.
src/blob_tree/mod.rs Wires RT-aware flush through blob tree and forwards remove_range.
src/tree/mod.rs Implements remove_range for Tree, RT-aware flush-to-tables, and point-read suppression checks.
src/range.rs Collects RTs across layers for range/prefix iteration; adds RT filtering and table-skip optimization.
src/memtable/mod.rs Stores memtable RTs (mutex-protected interval tree) and exposes query/flush helpers.
src/memtable/interval_tree.rs New AVL interval tree for RT overlap and suppression queries in memtables.
src/active_tombstone_set.rs New forward/reverse sweep-line active tombstone sets for iteration.
src/range_tombstone.rs Defines RangeTombstone and helpers (visibility, suppression, clipping, coverage).
src/range_tombstone_filter.rs New bidirectional iterator wrapper that suppresses keys covered by visible RTs.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4 to the SST block type enum.
src/table/writer/mod.rs Writes RT blocks + count metadata during SST creation; updates writer state to track RTs.
src/table/regions.rs Parses new "range_tombstones" TOC section into table regions.
src/table/mod.rs Loads/decodes RT blocks on table recovery and exposes Table::range_tombstones().
src/table/inner.rs Stores recovered table RTs in Table inner state.
src/table/multi_writer.rs Distributes/clips RTs across rotated output tables during flush/compaction.
src/table/util.rs Includes RT blocks in metrics accounting for block loads (cache + IO).
src/table/iter.rs Adjusts clippy lint gating for metrics-conditional argument count.
src/compaction/worker.rs Collects RTs from compaction inputs and propagates them to outputs (with GC eviction at last level).
src/compaction/flavour.rs Extends compaction flavour interface to pass RTs into table writers.
src/lib.rs Exposes new internal modules for RT support.
.forge/16/known-gaps.md Documents known limitations/optimization gaps for Issue #16 implementation.
.forge/16/analysis.md Architecture/design analysis and upstream context for range tombstones.

You can also share your feedback on Copilot code review. Take the survey.

polaz added 2 commits March 17, 2026 00:52
- Write synthetic weak tombstone in RT-only tables to produce valid index
- Fix interval_tree pruning: use >= for exclusive seqno boundary
- Replace #[allow(dead_code)] with #[expect(dead_code, reason)] throughout
- Remove key_range filter from SST RT suppression (RT range may exceed table key range)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MVCC-aware range tombstones to the LSM tree to support efficient delete_range/delete_prefix-style deletes that persist through flush/compaction/recovery and are honored across point reads and iteration.

Changes:

  • Introduces core range-tombstone data structures (tombstone model, memtable interval tree, active sweep sets, iterator filter).
  • Extends the read/write/format pipeline to persist range tombstones in SSTs and propagate/GC them during compaction.
  • Adds public APIs (remove_range, remove_prefix) and comprehensive integration tests for semantics and persistence.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/range_tombstone.rs New integration test suite covering MVCC visibility, iteration, flush/compaction/recovery, RT-only flush, GC, and BlobTree.
src/abstract_tree.rs Adds remove_range/remove_prefix API and a flush path that passes range tombstones into table writers.
src/active_tombstone_set.rs Implements forward/reverse active tombstone tracking for iterator-time suppression.
src/blob_tree/mod.rs Wires range tombstones through BlobTree flush and exposes remove_range.
src/compaction/flavour.rs Adds compaction flavour hook to propagate range tombstones into output writers.
src/compaction/worker.rs Collects and propagates input-table range tombstones during compaction, with last-level GC by watermark.
src/lib.rs Exposes new internal modules used by the range tombstone feature.
src/memtable/mod.rs Adds a mutex-protected interval tree for range tombstones and memtable APIs for insertion/query/flush extraction.
src/memtable/interval_tree.rs New AVL interval tree implementation backing memtable range-tombstone queries and table-skip decisions.
src/range.rs Integrates range tombstones into range iteration, including table-skip optimization and filter wrapping.
src/range_tombstone.rs Defines RangeTombstone semantics (half-open intervals, MVCC visibility) plus helpers for clipping/coverage.
src/range_tombstone_filter.rs Adds a bidirectional iterator wrapper that suppresses KVs covered by active range tombstones.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4 for persisted SST blocks.
src/table/inner.rs Stores decoded per-table range tombstones in table inner state.
src/table/iter.rs Adjusts clippy expectation gating for metrics feature builds.
src/table/mod.rs Loads/decodes the SST range tombstone block and exposes Table::range_tombstones().
src/table/multi_writer.rs Distributes/clips range tombstones across rotated output tables (including RT-only output).
src/table/regions.rs Adds TOC parsing support for the "range_tombstones" section handle.
src/table/util.rs Includes RangeTombstone blocks in metrics accounting and gates clippy expects by feature.
src/table/writer/mod.rs Persists range tombstones in an SST section and handles RT-only tables by emitting a sentinel for index validity.
src/tree/mod.rs Adds remove_range and point-read suppression checks against range tombstones across memtable/sealed/SST.
.gitignore Ignores .forge/ directory.

You can also share your feedback on Copilot code review. Take the survey.

polaz added 3 commits March 17, 2026 04:08
- Convert #[allow(clippy::unnecessary_box_returns)] to #[expect] with
  reasons in interval_tree.rs AVL rotation/balance/insert functions
- Convert #[allow(dead_code)] to #[expect] on CoveringRt::covers_table
- Remove unfired #[warn] attributes on flush_to_tables, finish
- Optimize query_suppression to use early-exit tree traversal instead
  of collecting into Vec — avoids allocation on the hot read path
- Add comment explaining range_tombstones.clone() necessity in flush
clippy::unnecessary_box_returns doesn't fire on functions that both
accept and return Box<Node>, so #[expect] was unfulfilled and failed
CI with --all-features -D warnings.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end range tombstone (delete-range / delete-prefix) support to the LSM tree, including memtable representation, iterator-time suppression, SST persistence, and compaction propagation.

Changes:

  • Introduces core range-tombstone types and sweep-line filtering for forward/reverse iteration.
  • Persists range tombstones in SSTs via a dedicated range_tombstones TOC section and propagates them through compaction.
  • Extends AbstractTree with remove_range / remove_prefix and adds broad integration test coverage.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/range_tombstone.rs New integration tests for suppression, MVCC, persistence, compaction, and prefix deletes
src/abstract_tree.rs Adds remove_range / remove_prefix and flush plumbing for RT persistence
src/tree/mod.rs Implements remove_range and point-read suppression across memtables/SSTs
src/blob_tree/mod.rs Wires range tombstones through BlobTree’s flush path + API
src/memtable/mod.rs Stores/query RTs in memtable via a mutex-protected interval tree
src/memtable/interval_tree.rs New AVL interval tree for RT overlap/suppression queries
src/range.rs Adds RT collection + iteration filter; adds table-skip based on covering RTs
src/range_tombstone.rs Defines RangeTombstone, helpers (intersection/coverage), and upper-bound utility
src/range_tombstone_filter.rs New bidirectional iterator wrapper to suppress covered entries
src/active_tombstone_set.rs New active-set trackers for sweep-line suppression (fwd + rev)
src/table/block/type.rs Adds BlockType::RangeTombstone = 4
src/table/regions.rs Parses range_tombstones TOC section
src/table/inner.rs Stores decoded per-table RTs
src/table/mod.rs Loads/decodes RT blocks on open; exposes range_tombstones()
src/table/writer/mod.rs Writes RT blocks; supports RT-only table creation via a synthetic sentinel
src/table/multi_writer.rs Distributes/clips RTs across rotated output tables
src/table/util.rs Counts RT blocks in metrics as data/meta blocks
src/table/iter.rs Adjusts clippy::too_many_arguments suppression to only apply under metrics
src/compaction/worker.rs Propagates RTs from input tables into compaction output; evicts below GC watermark at bottom
src/compaction/flavour.rs Adds compaction hook for writing RTs to output

You can also share your feedback on Copilot code review. Take the survey.

polaz added 2 commits March 17, 2026 05:48
…rsized keys

- `upper_bound_exclusive` now returns `Option<UserKey>`, returning
  `None` when key length is at u16::MAX (appending 0x00 would overflow
  the u16 key-length invariant used in on-disk RT block encoding)
- Multi-writer clipping handles None by writing tombstones unclipped
  when the table's last key is at max length, avoiding corrupt RT blocks
- RT block serialization uses `u16::try_from(len)` instead of `as u16`
  cast, returning an error instead of silently truncating oversized keys
- Add test for upper_bound_exclusive with max-length key
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native range tombstone support (delete-range / delete-prefix) to the LSM-tree, including MVCC-correct read suppression, persistence in SSTs, and propagation through compaction/GC.

Changes:

  • Introduces RangeTombstone primitives + in-mem interval tree tracking and iterator-time suppression filter.
  • Persists range tombstones in SSTs via a new BlockType::RangeTombstone and TOC section, with writer/reader support.
  • Wires range tombstones through flush + compaction pipelines, and updates point/range/prefix reads to respect suppression.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests covering MVCC suppression, persistence, compaction propagation, prefix delete, RT-only flush, and table-skip behavior.
src/abstract_tree.rs Extends flush pipeline to pass range tombstones into table creation; adds remove_range API + default remove_prefix.
src/tree/mod.rs Implements remove_range; applies RT suppression to point reads; plumbs RTs into flush-to-SST.
src/blob_tree/mod.rs Plumbs RTs into BlobTree flush; delegates remove_range to underlying index.
src/memtable/mod.rs Adds mutex-protected interval-tree storage for memtable range tombstones; exposes query/iteration helpers.
src/memtable/interval_tree.rs AVL interval tree for efficient O(log n) overlap/suppression checks and sorted extraction for flush.
src/range.rs Wraps range iteration with RangeTombstoneFilter and adds table-skip logic for fully-covered tables.
src/range_tombstone_filter.rs Bidirectional sweep-line filter that suppresses KV items covered by visible range tombstones.
src/active_tombstone_set.rs Active-set trackers for forward/reverse iteration (activate/expire + max-seqno suppression).
src/range_tombstone.rs Core RT type + intersection/coverage helpers and upper_bound_exclusive helper.
src/lib.rs Exposes new internal RT modules.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4.
src/table/regions.rs Adds range_tombstones TOC region parsing.
src/table/inner.rs Stores loaded per-table range tombstones in table inner state.
src/table/mod.rs Loads and decodes RT blocks on table recovery; exposes Table::range_tombstones().
src/table/writer/mod.rs Writes RT blocks + range_tombstone_count metadata; supports RT-only tables via sentinel KV.
src/table/multi_writer.rs Distributes/clips RTs across rotated output tables during flush/compaction.
src/table/util.rs Extends metrics accounting to include RT blocks as “data/meta-like” blocks.
src/table/iter.rs Adjusts clippy gating for too_many_arguments when metrics feature is enabled.
src/compaction/worker.rs Collects RTs from input tables and propagates them to compaction outputs (with last-level GC eviction).
src/compaction/flavour.rs Adds a write_range_tombstones hook and wires it for standard/relocating compactions.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native range tombstone support (delete-range / delete-prefix) to the LSM tree, integrating tombstones into the memtable, iterators, SST persistence, compaction propagation, and recovery.

Changes:

  • Add remove_range / remove_prefix API on AbstractTree with memtable storage via an interval tree and read-path suppression.
  • Persist range tombstones into SSTs via a dedicated BlockType::RangeTombstone TOC section, and load them on recovery.
  • Propagate range tombstones through compaction and support forward/reverse iteration filtering with RangeTombstoneFilter, plus broad test coverage.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/range_tombstone.rs New integration tests covering MVCC visibility, iteration suppression, persistence (flush/compaction/recovery), GC eviction, and BlobTree support.
src/abstract_tree.rs Extends public tree API with remove_range / remove_prefix; flush path collects and passes range tombstones to table writers.
src/tree/mod.rs Implements remove_range for Tree and checks range tombstone suppression in point reads.
src/blob_tree/mod.rs Wires range tombstones through BlobTree flush and delegates remove_range to the index tree.
src/memtable/mod.rs Adds interval-tree-backed range tombstone storage, insertion, query helpers, and adjusts emptiness semantics.
src/memtable/interval_tree.rs Implements an AVL-based interval tree specialized for range tombstone queries and pruning.
src/range.rs Applies range tombstone suppression during range scans via RangeTombstoneFilter and adds table-skip logic.
src/range_tombstone_filter.rs New bidirectional iterator wrapper that suppresses KV entries covered by visible range tombstones.
src/active_tombstone_set.rs New sweep-line “active set” helpers for forward/reverse tombstone activation/expiry during iteration.
src/range_tombstone.rs Defines RangeTombstone, ordering, intersection helpers, and upper_bound_exclusive.
src/table/writer/mod.rs Writes a new RT block/TOC section and adds RT-only table handling via a synthetic sentinel KV.
src/table/multi_writer.rs Distributes (and currently clips) range tombstones across rotated output tables.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4.
src/table/regions.rs Parses "range_tombstones" TOC section handle.
src/table/mod.rs Loads/decodes range tombstone blocks during table recovery and exposes Table::range_tombstones().
src/table/inner.rs Stores loaded range tombstones in the table inner struct.
src/table/util.rs Updates metrics categorization to include range tombstone blocks.
src/table/iter.rs Makes #[expect(clippy::too_many_arguments)] conditional on the metrics feature.
src/compaction/worker.rs Collects and propagates range tombstones from compaction inputs, with last-level GC eviction.
src/compaction/flavour.rs Extends compaction flavour interface to pass range tombstones into output writers.
src/lib.rs Registers new internal modules for range tombstone functionality.

You can also share your feedback on Copilot code review. Take the survey.

Two correctness fixes for range tombstone handling:

1. Flush must NOT clip RTs to the output table's KV key range. During
   memtable flush, the KV data only reflects keys from that memtable,
   but range tombstones must cover keys in older SSTs outside this
   range. Clipping would cause deleted keys in older tables to
   "resurrect" after flush. Added `clip_range_tombstones` flag to
   MultiWriter (true only for compaction where input tables are consumed).

2. RT-only tables (no KV data, only range tombstones) now set
   `meta.first_key = min(rt.start)` and `meta.last_key = max(rt.end)`
   so the table's key range conservatively covers all its tombstones.
   Previously the key range was derived solely from the synthetic
   sentinel key, making the table invisible to compaction overlap
   checks and potentially preventing GC.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class range tombstone support to the LSM tree, enabling efficient delete_range / delete_prefix semantics with MVCC visibility, persistence in SSTs, and iterator-time suppression.

Changes:

  • Introduces RangeTombstone core types plus in-memory tracking (AVL interval tree in memtables, active sweep-line sets) and an iterator wrapper (RangeTombstoneFilter) to suppress covered KVs.
  • Persists range tombstones in SSTs via a new TOC section ("range_tombstones") and BlockType::RangeTombstone.
  • Wires end-to-end support through flush + compaction + recovery, and adds extensive integration tests.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/range_tombstone.rs Integration tests covering MVCC behavior, iteration suppression, flush/compaction/recovery persistence, RT-only flush, and BlobTree.
src/abstract_tree.rs Extends flush plumbing to pass range tombstones to table writers; adds remove_range/remove_prefix API.
src/tree/mod.rs Implements remove_range; applies RT suppression to point reads across memtables and SSTs; updates flush impl to accept RTs.
src/blob_tree/mod.rs Delegates remove_range to the index tree; updates flush impl to accept RTs.
src/memtable/mod.rs Stores RTs in a mutex-protected interval tree; adds insert/query/sorted extraction APIs; updates is_empty.
src/memtable/interval_tree.rs New AVL interval tree implementation for efficient RT overlap/suppression queries in memtables.
src/range.rs Collects RTs across layers for range iteration; adds table-skip check and wraps iterators in RangeTombstoneFilter.
src/range_tombstone.rs New RT type (half-open interval) with MVCC suppression logic and helpers (intersection, coverage, upper-bound).
src/range_tombstone_filter.rs New bidirectional iterator filter using active tombstone sets to suppress covered entries.
src/active_tombstone_set.rs New forward/reverse “active tombstone” sweep-line trackers for iteration-time suppression.
src/table/block/type.rs Adds BlockType::RangeTombstone (tag 4).
src/table/regions.rs Extends TOC region parsing to include an optional "range_tombstones" section.
src/table/mod.rs Loads and decodes persisted RT blocks on table open; exposes Table::range_tombstones().
src/table/inner.rs Stores loaded RTs in the table’s inner state.
src/table/writer/mod.rs Writes RT blocks (and RT-only sentinel tables) and records RT metadata in table meta.
src/table/multi_writer.rs Adds RT distribution/clipping behavior across rotated output tables.
src/compaction/worker.rs Collects RTs from compaction inputs; propagates them to compaction outputs (with GC eviction on last level).
src/compaction/flavour.rs Wires RT propagation into both compaction flavours; enables RT clipping on compaction MultiWriter.
src/table/util.rs Treats RT blocks as “data-like” for metrics accounting; adjusts lint gating for metrics feature.
src/table/iter.rs Adjusts lint gating for metrics feature in iterator constructor.
src/lib.rs Registers new internal modules for RT functionality.

You can also share your feedback on Copilot code review. Take the survey.

In flush mode, RTs disjoint from the table's KV range (e.g.,
delete_range targeting keys only in older SSTs) were silently dropped
by the overlap check. Remove the filter so all RTs are persisted in
at least one output table during flush.
Regression tests for the flush clipping fix: verify that a range
tombstone targeting keys only in older SSTs (disjoint from the
memtable's KV range) survives flush and subsequent compaction.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class range tombstones (delete_range/delete_prefix) to the LSM tree, including in-memory indexing, read-path suppression, SST persistence, and compaction propagation.

Changes:

  • Introduces core range-tombstone types and iterator filtering (forward + reverse) to suppress covered keys under MVCC.
  • Persists range tombstones in SSTs via a dedicated range_tombstones TOC section and BlockType::RangeTombstone.
  • Extends flush/compaction plumbing (MultiWriter, compaction worker/flavours, tree APIs) to carry and (optionally) clip tombstones.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/range_tombstone.rs New integration test suite exercising suppression, persistence, compaction, recovery, prefix deletes, and RT-only flushes.
src/abstract_tree.rs Adds remove_range + default remove_prefix; threads RTs through flush via flush_to_tables_with_rt.
src/tree/mod.rs Implements remove_range; point-read path checks RT suppression across memtables and tables.
src/blob_tree/mod.rs Threads RT flush + exposes remove_range via underlying index tree.
src/memtable/mod.rs Stores RTs in a mutex-protected interval tree; updates emptiness/size/seqno tracking.
src/memtable/interval_tree.rs New AVL interval tree for RT overlap/suppression queries.
src/range.rs Wraps range iteration with RangeTombstoneFilter and adds a table-skip precheck.
src/range_tombstone.rs New RangeTombstone type + helpers (intersection, coverage, upper-bound).
src/range_tombstone_filter.rs New bidirectional iterator filter using sweep-line activation/expiry sets.
src/active_tombstone_set.rs New forward/reverse active-tombstone tracking sets for iteration.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4.
src/table/writer/mod.rs Writes RT blocks into SSTs; supports RT-only tables via a synthetic sentinel KV; writes RT count to meta.
src/table/multi_writer.rs Tracks RTs across writer rotation; supports optional clipping in compaction mode.
src/table/regions.rs Parses new range_tombstones TOC section handle.
src/table/mod.rs Loads and decodes RT blocks on table open; exposes Table::range_tombstones().
src/table/inner.rs Stores loaded table RTs.
src/table/util.rs Metrics accounting includes RangeTombstone block type.
src/table/iter.rs Makes clippy::too_many_arguments expect conditional on metrics.
src/compaction/flavour.rs Enables RT clipping for compaction writers; adds write_range_tombstones hook.
src/compaction/worker.rs Collects input RTs and forwards them to compaction output (with last-level filtering).
src/lib.rs Wires new internal modules.

You can also share your feedback on Copilot code review. Take the survey.

…collision

SequenceNumberCounter starts at 0 (pre-increment), so seqno=0 is a
valid user seqno. The synthetic weak tombstone in RT-only tables used
seqno=0, which could collide with a real KV entry at the same user key,
making merge order nondeterministic. Use the max RT seqno instead — by
the time the RT was written, the counter has already passed this value.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class range tombstones to the LSM-tree, enabling efficient [start, end) and prefix deletions that are MVCC-aware and persist through flush/compaction/recovery.

Changes:

  • Introduces core range-tombstone types (range tombstone model, active sweep-line sets, memtable interval tree, iterator filter).
  • Persists range tombstones in SSTs via a new BlockType::RangeTombstone and TOC section, with flush/compaction propagation via MultiWriter.
  • Wires read path suppression into point reads and range/prefix iteration, plus extensive unit/integration tests.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/range_tombstone.rs Integration test suite covering MVCC suppression, iteration, flush/compaction/recovery, RT-only flush, GC, etc.
src/abstract_tree.rs Adds remove_range/remove_prefix API and extends flush to pass range tombstones to writers.
src/tree/mod.rs Implements remove_range, flush plumbing to pass RTs, and point-read suppression checks.
src/blob_tree/mod.rs Delegates remove_range and plumbs RT-aware flushing for blob-separated trees.
src/memtable/mod.rs Adds RT storage to memtables (mutex-protected interval tree) and query APIs.
src/memtable/interval_tree.rs New AVL interval tree for RT queries (suppression/overlap/covering RT).
src/range.rs Wraps merged iterators with RangeTombstoneFilter and adds table-skip optimization for fully-covered tables.
src/range_tombstone.rs Defines RangeTombstone, intersection/coverage helpers, and upper_bound_exclusive.
src/active_tombstone_set.rs New forward/reverse active RT sets used by the iterator filter.
src/range_tombstone_filter.rs New bidirectional iterator wrapper that suppresses KVs covered by RTs.
src/table/block/type.rs Adds BlockType::RangeTombstone = 4.
src/table/regions.rs Parses "range_tombstones" TOC section handle.
src/table/inner.rs Stores recovered per-table RTs in Table::Inner.
src/table/mod.rs Recovers and decodes RT blocks; exposes Table::range_tombstones().
src/table/writer/mod.rs Writes RT block; adds RT-only table handling via a synthetic sentinel KV.
src/table/multi_writer.rs Propagates RTs across output tables; adds optional clipping for compaction.
src/table/util.rs Updates metrics bookkeeping to include RT blocks.
src/table/iter.rs Adjusts lint suppression to be cfg(feature="metrics")-dependent.
src/compaction/worker.rs Collects and propagates input-table RTs into compaction outputs; GC eviction at bottom level.
src/compaction/flavour.rs Enables RT clipping for compaction output and adds RT propagation hook to flavours.
src/lib.rs Registers new internal RT modules.
Comments suppressed due to low confidence (1)

src/abstract_tree.rs:137

  • flush() calls register_tables(&tables, ...) unconditionally when flush_to_tables_with_rt returns Some, even if tables.is_empty(). If a flush ever produces no tables (e.g. an empty sealed memtable / all items dropped), this would still delete the sealed memtables via sealed_ids, potentially losing data. Consider guarding register_tables behind !tables.is_empty() (or making flush_to_tables_with_rt return Ok(None) when nothing is written).
            self.register_tables(
                &tables,
                blob_files.as_deref(),
                None,
                &sealed_ids,
                seqno_threshold,
            )?;

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +443 to +447
// Set range tombstones BEFORE writing KV items so that if MultiWriter
// rotates to a new table during the write loop, earlier tables already
// carry the RT metadata.
table_writer.set_range_tombstones(range_tombstones);

Comment on lines +424 to +432
// Use the minimum RT start for the sentinel key. The sentinel
// seqno must not collide with real user data: SequenceNumberCounter
// starts at 0 (pre-increment), so seqno 0 is a valid user seqno.
// Use max RT seqno instead — by the time the RT was written, the
// counter has already passed this value, so no real KV entry will
// share this (user_key, seqno) pair.
self.write(InternalValue::new_weak_tombstone(
start.clone(),
max_rt_seqno,
Comment on lines +403 to 407
table_writer.set_range_tombstones(range_tombstones);

for item in stream {
table_writer.write(item?)?;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: range tombstones / delete_range / delete_prefix

2 participants