Conversation
✅ clang-tidy: No findings on changed linesCompleted in 780.9s. |
b2abb89 to
2e89ff5
Compare
|
Since this is a WIP requested review, I assume early directional feedback is desired. My understanding is that excessive tombstone scanning was the main near-term motivation—please correct me if that's no longer true. If still true, how does the proposed read-triggered compaction address this? Does it mainly reduce IOs by decreasing the number of files(levels) to open post-compaction? What about the case where many tombstones exist in a few files? Does it help drop the tombstones earlier or somehow allows reading less tombstones during scan, with DTC? Any data to show that? Edit: it also appears to me that the read frequency is counted by per file open not per entries of a file. Does it mean it treats "one scan over 100 tombstones in file 10.sst" less urgent to compact compared to "100 scan each touches 11.sst once" (for simplicity assuming kFileReadSampleRate = 1)? In addition, I think it's important to highlight the trade-off between read and write amplification across varying degrees of overlapping keys and thresholds in WIP phase. This informs us of the cost of doing more compaction, potentially better guidance or automation on threshold=..../trigger=....s that are apparently hard to tune. |
|
(In addition to the other feedback...) Please update description in light of #14396 landing.
"Periodic compaction" is a specific kind of compaction, but this trigger is used for several other things, now one more. I would prefer a name that acknowledges the more general nature of it. Perhaps With this feature increasing the importance of that wakeup period on a quiet DB, it might be worth the complexity to fix the "known limitation" about dynamic option changes in the implementation of that PR. |
|
@hx235 Thanks for the feedback. I originally started the work as what level DB originally implemented for a generic read-trigger compaction (but later removed from RocksDB), which utilized reads_per_byte as well. My sort of understanding is that tombstone reads is more problematic on a "quiet DB" than an active one since active one is constantly compacting and tombstones get removed within some reasonable time. Once a DB becomes quiet, the LSM shape is stable and there may be many tombstones that never reach the last level to get cleaned up. But with read-triggered compactions, even for a "quiet DB", the hot ranges of the LSM will collapse into the last level. It also doesn't only benefit tombstone reads. For example if you have hot keys that exist only in the last level, each level above is searched for and queried in the bloom filter.
Yes the current implementation does do that because sampled_reads is only incremented on get, multiget and seek. But I am thinking of reworking the heuristic so that it is some |
Can I understand this as an upgraded periodic compaction (time triggered) but, instead of compacting all files meeting the duration limitation, it tries to compact just the ones being read the most? One important aspect I found missing in the WIP (in addition to "the read frequency" not yet counting tombstone read per file) is how the read-hotness gets relayed from compaction input to output when we can't have one-shot compaction directly to the bottommost level or level where we can drop stale versions (more likely for current level compaction since it only touches two levels each compaction). It seems like the WIP resets the file hotness between input and output file. Do we need the read to suffer at least once, each time for compaction output in new level till we can drop stale entries ? Somewhat related to above is another interesting aspect that more data can help give early insight: how fast those compaction can actually improve scan (assuming repeated scans like in retry cases) under different interleaving pattern deletion and puts (e.g, queued vs random) scattered in different levels (e.g, far apart between delete and put vs close distance) under cpu-bounded and io-bounded cases. Edit: for example, in the case where many levels of pushing down is needed like 7 or many other data has to be included due to overlapping with selected input, making it a giant compaction, how immediate can the read after the triggering read benefit? Edit: Note that the readrandom in your benchmark is point read. While it may demonstrates some issues faced by range scan, it's more direct to show range scan, potentially exposing more issues. |
Yes
For simplicity we do not pass the read hotness onto the output file. This is because for it to be accurate we'd also need to account for additional reads that would no longer apply once compacted with the next level. |
Summary
Add read-triggered compaction, a new feature that reduces read amplification by compacting SST files that receive high read traffic. When an SST file's read frequency (
num_reads_sampled / file_size) exceeds a configurable threshold, it is marked for compaction to a lower level.The feature introduces two new options: a CF option
read_triggered_compaction_threshold(default 0, disabled) and a DB optionmax_periodic_compaction_trigger_seconds(default 43200s) that controls how often the background thread re-evaluates compaction scores on quiet databases. Both options are dynamically changeable.Lowering
max_periodic_compaction_trigger_secondsdoes add some overhead, but generally is minimal, so running this every couple of minutes in a production environment seems fairly reasonable.Key changes
read_triggered_compaction_threshold(advanced_options.h): When positive, files withreads_per_byte > thresholdare marked for compaction. Files at the last non-empty level are skipped (bottommost compaction handles those separately). Marked files are sorted by hotness (reads_per_byte descending).max_periodic_compaction_trigger_seconds(options.h): Replaces the hardcoded 12-hour ceiling inComputeTriggerCompactionPeriod(). Essential for read-triggered compaction on quiet DBs since there are no writes to trigger score re-evaluation.compaction_picker_level.cc): Adds read-triggered as the lowest-priority compaction reason inSetupInitialFiles(), using the existingPickFileToCompacthelper.compaction_picker_universal.cc): AddsPickReadTriggeredCompactionas lowest priority. Refactors shared "find output level + compute overlapping inputs + create Compaction" logic from bothPickDeleteTriggeredCompactionandPickReadTriggeredCompactionintoBuildCompactionToNextLevel, handling both single-level and multi-level universal cases.db_impl.cc):TriggerPeriodicCompactionnow also fires for CFs withread_triggered_compaction_threshold > 0, even without time-based compaction configured.db_stressanddb_benchsupport the new options.db_crashtest.pyrandomly enables read-triggered compaction and sets a short periodic trigger interval when enabled.Test Plan
Unit tests:
compaction_picker_test— 7 new tests:ReadTriggeredCompactionDisabled,ReadTriggeredCompactionBelowThreshold,ReadTriggeredCompactionAboveThreshold,NeedsCompactionReadTriggered,ReadTriggeredPicksFile,UniversalReadTriggeredCompaction,ReadTriggeredSkipsLastLevel,UniversalReadTriggeredNoPickWhenNotMarkeddb_compaction_test—ReadTriggeredCompactionintegration test verifying end-to-end behavior with sync pointsBenchmark (
db_bench):Setup: 5M keys (100B values, 16B keys), leveled compaction, 5 levels, 4MB target file size. DB fully compacted, then 2M overlapping keys written without compaction to create L0/L1 overlap (82 files, ~294MB).
LSM shape change during readrandom with read-triggered compaction: