Skip to content

[wip] Read-triggered compactions#14426

Draft
joshkang97 wants to merge 1 commit intofacebook:mainfrom
joshkang97:read_triggered_compactions
Draft

[wip] Read-triggered compactions#14426
joshkang97 wants to merge 1 commit intofacebook:mainfrom
joshkang97:read_triggered_compactions

Conversation

@joshkang97
Copy link
Contributor

@joshkang97 joshkang97 commented Mar 4, 2026

Summary

Add read-triggered compaction, a new feature that reduces read amplification by compacting SST files that receive high read traffic. When an SST file's read frequency (num_reads_sampled / file_size) exceeds a configurable threshold, it is marked for compaction to a lower level.

The feature introduces two new options: a CF option read_triggered_compaction_threshold (default 0, disabled) and a DB option max_periodic_compaction_trigger_seconds (default 43200s) that controls how often the background thread re-evaluates compaction scores on quiet databases. Both options are dynamically changeable.

Lowering max_periodic_compaction_trigger_seconds does add some overhead, but generally is minimal, so running this every couple of minutes in a production environment seems fairly reasonable.

Key changes

  • New CF option read_triggered_compaction_threshold (advanced_options.h): When positive, files with reads_per_byte > threshold are marked for compaction. Files at the last non-empty level are skipped (bottommost compaction handles those separately). Marked files are sorted by hotness (reads_per_byte descending).
  • New DB option max_periodic_compaction_trigger_seconds (options.h): Replaces the hardcoded 12-hour ceiling in ComputeTriggerCompactionPeriod(). Essential for read-triggered compaction on quiet DBs since there are no writes to trigger score re-evaluation.
  • Leveled compaction picker (compaction_picker_level.cc): Adds read-triggered as the lowest-priority compaction reason in SetupInitialFiles(), using the existing PickFileToCompact helper.
  • Universal compaction picker (compaction_picker_universal.cc): Adds PickReadTriggeredCompaction as lowest priority. Refactors shared "find output level + compute overlapping inputs + create Compaction" logic from both PickDeleteTriggeredCompaction and PickReadTriggeredCompaction into BuildCompactionToNextLevel, handling both single-level and multi-level universal cases.
  • Periodic trigger integration (db_impl.cc): TriggerPeriodicCompaction now also fires for CFs with read_triggered_compaction_threshold > 0, even without time-based compaction configured.
  • Stress test & db_bench support: Both db_stress and db_bench support the new options. db_crashtest.py randomly enables read-triggered compaction and sets a short periodic trigger interval when enabled.

Test Plan

Unit tests:

  • compaction_picker_test — 7 new tests: ReadTriggeredCompactionDisabled, ReadTriggeredCompactionBelowThreshold, ReadTriggeredCompactionAboveThreshold, NeedsCompactionReadTriggered, ReadTriggeredPicksFile, UniversalReadTriggeredCompaction, ReadTriggeredSkipsLastLevel, UniversalReadTriggeredNoPickWhenNotMarked
  • db_compaction_testReadTriggeredCompaction integration test verifying end-to-end behavior with sync points
  • Stress test coverage

Benchmark (db_bench):

Setup: 5M keys (100B values, 16B keys), leveled compaction, 5 levels, 4MB target file size. DB fully compacted, then 2M overlapping keys written without compaction to create L0/L1 overlap (82 files, ~294MB).

LSM shape change during readrandom with read-triggered compaction:

BEFORE: L0=9 files (15MB), L1=4 (16MB), L2=20 (69MB), L3=49 (194MB) — 82 files, 294MB
AFTER:  L3=66 files (223MB)
Benchmark Config avg ops/s % change
readrandom (8 threads, 5M reads) baseline (threshold=0) 1,086,965
readrandom (8 threads, 5M reads) threshold=0.000001, trigger=5s 1,453,697 +33.7%

@meta-cla meta-cla bot added the CLA Signed label Mar 4, 2026
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

✅ clang-tidy: No findings on changed lines

Completed in 780.9s.

@joshkang97 joshkang97 force-pushed the read_triggered_compactions branch from b2abb89 to 2e89ff5 Compare March 5, 2026 00:54
@meta-codesync
Copy link

meta-codesync bot commented Mar 5, 2026

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D95307362.

@hx235
Copy link
Contributor

hx235 commented Mar 5, 2026

Since this is a WIP requested review, I assume early directional feedback is desired.

My understanding is that excessive tombstone scanning was the main near-term motivation—please correct me if that's no longer true.

If still true, how does the proposed read-triggered compaction address this? Does it mainly reduce IOs by decreasing the number of files(levels) to open post-compaction? What about the case where many tombstones exist in a few files? Does it help drop the tombstones earlier or somehow allows reading less tombstones during scan, with DTC? Any data to show that? Edit: it also appears to me that the read frequency is counted by per file open not per entries of a file. Does it mean it treats "one scan over 100 tombstones in file 10.sst" less urgent to compact compared to "100 scan each touches 11.sst once" (for simplicity assuming kFileReadSampleRate = 1)?

In addition, I think it's important to highlight the trade-off between read and write amplification across varying degrees of overlapping keys and thresholds in WIP phase. This informs us of the cost of doing more compaction, potentially better guidance or automation on threshold=..../trigger=....s that are apparently hard to tune.

@pdillinger
Copy link
Contributor

(In addition to the other feedback...)

Please update description in light of #14396 landing.

max_periodic_compaction_trigger_seconds

"Periodic compaction" is a specific kind of compaction, but this trigger is used for several other things, now one more. I would prefer a name that acknowledges the more general nature of it. Perhaps max_compaction_trigger_wakeup_seconds.

With this feature increasing the importance of that wakeup period on a quiet DB, it might be worth the complexity to fix the "known limitation" about dynamic option changes in the implementation of that PR.

@joshkang97
Copy link
Contributor Author

@hx235 Thanks for the feedback.

I originally started the work as what level DB originally implemented for a generic read-trigger compaction (but later removed from RocksDB), which utilized reads_per_byte as well.

My sort of understanding is that tombstone reads is more problematic on a "quiet DB" than an active one since active one is constantly compacting and tombstones get removed within some reasonable time. Once a DB becomes quiet, the LSM shape is stable and there may be many tombstones that never reach the last level to get cleaned up. But with read-triggered compactions, even for a "quiet DB", the hot ranges of the LSM will collapse into the last level.

It also doesn't only benefit tombstone reads. For example if you have hot keys that exist only in the last level, each level above is searched for and queried in the bloom filter.

Does it mean it treats "one scan over 100 tombstones in file 10.sst" less urgent to compact compared to "100 scan each touches 11.sst once"

Yes the current implementation does do that because sampled_reads is only incremented on get, multiget and seek. But I am thinking of reworking the heuristic so that it is some f(num_missed_reads, tombstone_reads, file_size), as these are the main files that benefit from being pushed down.

@hx235
Copy link
Contributor

hx235 commented Mar 5, 2026

My sort of understanding is that tombstone reads is more problematic on a "quiet DB" than an active one since active one is constantly compacting and tombstones get removed within some reasonable time. Once a DB becomes quiet, the LSM shape is stable and there may be many tombstones that never reach the last level to get cleaned up. But with read-triggered compactions, even for a "quiet DB", the hot ranges of the LSM will collapse into the last level.

Can I understand this as an upgraded periodic compaction (time triggered) but, instead of compacting all files meeting the duration limitation, it tries to compact just the ones being read the most?

One important aspect I found missing in the WIP (in addition to "the read frequency" not yet counting tombstone read per file) is how the read-hotness gets relayed from compaction input to output when we can't have one-shot compaction directly to the bottommost level or level where we can drop stale versions (more likely for current level compaction since it only touches two levels each compaction). It seems like the WIP resets the file hotness between input and output file. Do we need the read to suffer at least once, each time for compaction output in new level till we can drop stale entries ?

Somewhat related to above is another interesting aspect that more data can help give early insight: how fast those compaction can actually improve scan (assuming repeated scans like in retry cases) under different interleaving pattern deletion and puts (e.g, queued vs random) scattered in different levels (e.g, far apart between delete and put vs close distance) under cpu-bounded and io-bounded cases. Edit: for example, in the case where many levels of pushing down is needed like 7 or many other data has to be included due to overlapping with selected input, making it a giant compaction, how immediate can the read after the triggering read benefit?

Edit: Note that the readrandom in your benchmark is point read. While it may demonstrates some issues faced by range scan, it's more direct to show range scan, potentially exposing more issues.

@joshkang97
Copy link
Contributor Author

Can I understand this as an upgraded periodic compaction (time triggered) but, instead of compacting all files meeting the duration limitation, it tries to compact just the ones being read the most?

Yes

One important aspect I found missing in the WIP (in addition to "the read frequency" not yet counting tombstone read per file) is how the read-hotness gets relayed from compaction input to output when we can't have one-shot compaction directly to the bottommost level or level where we can drop stale versions (more likely for current level compaction since it only touches two levels each compaction). It seems like the WIP resets the file hotness between input and output file. Do we need the read to suffer at least once, each time for compaction output in new level till we can drop stale entries ?

For simplicity we do not pass the read hotness onto the output file. This is because for it to be accurate we'd also need to account for additional reads that would no longer apply once compacted with the next level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants