Prepopulate block cache during compaction#14445
Prepopulate block cache during compaction#14445mszeszko-meta wants to merge 1 commit intofacebook:mainfrom
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 194.8s. |
333c727 to
19e1238
Compare
|
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this in D95997952. |
| @@ -1203,6 +1204,14 @@ struct BlockBasedTableBuilder::Rep { | |||
| switch (table_options.prepopulate_block_cache) { | |||
| case BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly: | |||
| warm_cache = (reason == TableFileCreationReason::kFlush); | |||
There was a problem hiding this comment.
Should this be updated? also would suggest to move below the switch block.
warm_cache = (reason != TableFileCreationReason::kDisable);
There was a problem hiding this comment.
This would result in warm_cache = true for table_options.prepopulate_block_cache = kFlushOnly and reason TableFileCreationReason::kCompaction, while we would expect the value to be false.
|
Some AI review result. H3: Missing release notes — No entry in unreleased_history/new_features/ or unreleased_history/public_api_changes/. Required per RocksDB conventions, especially important for the forward-incompatibility with older OPTIONS file parsers. (3 agents agreed) H1: No test verifies BOTTOM priority — The core differentiator (compaction at BOTTOM, flush at LOW) has zero test coverage. The existing MockCache conflates BOTTOM with HIGH. A reproducer using PriorityTrackingCache was built and passes — the feature works correctly but lacks verification. (3 agents agreed) |
Summary:
When RocksDB operates with tiered or remote storage (e.g., Warm Storage,
HDFS, S3), reading recently compacted data incurs high-latency remote reads
because compaction output files are not present in the block cache. The
existing `prepopulate_block_cache = kFlushOnly` avoids this for flush output
but leaves compaction output cold until first access.
Add a new `PrepopulateBlockCache::kFlushAndCompaction` enum value that warms
all block types (data, index, filter, compression dict) into the block cache
during both flush and compaction. Flush-warmed blocks use `LOW` priority
(unchanged from kFlushOnly behavior), while compaction-warmed blocks use
`BOTTOM` priority — compaction data is less temporally local than freshly
flushed data, so it should be the first to be evicted when the cache is full.
This gives the remote-read avoidance benefit without risking cache thrashing.
Unlike flush output (which is inherently hot — just written by the user), it
is hard to distinguish hot from cold blocks in compaction output. Warming all
compaction output therefore risks polluting the block cache and evicting
genuinely hot entries. The kFlushAndCompaction mode is recommended only for
use cases where most or all of the database is expected to reside in cache
(e.g., the working set fits in cache). For workloads where only a fraction
of the data is hot, kFlushOnly remains the safer choice.
The enum uses `kFlushAndCompaction` rather than separate `kCompactionOnly` +
`kFlushAndCompaction` values because there is no practical use case for
warming compaction output without also warming flush output. Flush output is
by definition the hottest data (just written by the user), so if a workload
benefits from warming the colder compaction output, it would always benefit
from warming flush output too.
The implementation reuses the existing `InsertBlockInCacheHelper` /
`WarmInCache` infrastructure in BlockBasedTableBuilder. The only internal
change is adding a `warm_cache_priority` field to `Rep` alongside the
existing `warm_cache` bool, and plumbing it through to the `WarmInCache`
call instead of the previously hardcoded `Cache::Priority::LOW`.
Key changes:
- New `PrepopulateBlockCache::kFlushAndCompaction` enum value in table.h
- `Rep::warm_cache_priority` field in BlockBasedTableBuilder for
per-reason priority control
- Serialization support ("kFlushAndCompaction" in string map)
- db_bench support (--prepopulate_block_cache=2)
- Crash test coverage (random choice includes new value)
- Release notes in unreleased_history/
Test Plan:
- New `WarmCacheWithDataBlocksDuringCompaction` test: verifies data blocks
from compaction output are present in the block cache and served without
misses
- New `WarmCachePriorityFlushVsCompaction` test: uses a PriorityTrackingCache
wrapper to verify flush inserts at LOW and compaction inserts at BOTTOM
- Extended `DynamicOptions` test: verifies dynamic switching through
kDisable -> kFlushAndCompaction -> kFlushOnly -> kDisable via SetOptions
- Existing `WarmCacheWithDataBlocksDuringFlush` and parameterized
`WarmCacheWithBlocksDuringFlush` tests continue to pass (kFlushOnly
behavior unchanged)
- db_block_cache_test: 82/82 passed
- options_test: 74/74 passed
- table_test: 6910/6910 passed
19e1238 to
fef6353
Compare
Addressed. |
Summary
When RocksDB operates with tiered or remote storage (e.g., Warm Storage, HDFS, S3), reading recently compacted data incurs high-latency remote reads because compaction output files are not present in the block cache. The existing
prepopulate_block_cache = kFlushOnlyavoids this for flush output but leaves compaction output cold until first access.Add a new
PrepopulateBlockCache::kFlushAndCompactionenum value that warms all block types (data, index, filter, compression dict) into the block cache during both flush and compaction. Flush-warmed blocks useLOWpriority (unchanged from kFlushOnly behavior), while compaction-warmed blocks useBOTTOMpriority — compaction data is less temporally local than freshly flushed data, so it should be the first to be evicted when the cache is full. This gives the remote-read avoidance benefit without risking cache thrashing.The enum uses
kFlushAndCompactionrather than separatekCompactionOnly+kFlushAndCompactionvalues because there is no practical use case for warming compaction output without also warming flush output. Flush output is by definition the hottest data (just written by the user), so if a workload benefits from warming the colder compaction output, it would always benefit from warming flush output too.The implementation reuses the existing
InsertBlockInCacheHelper/WarmInCacheinfrastructure inBlockBasedTableBuilder. The only internal change is adding awarm_cache_priorityfield toRepalongside the existingwarm_cachebool, and plumbing it through to theWarmInCachecall instead of the previously hardcodedCache::Priority::LOW.Key changes
PrepopulateBlockCache::kFlushAndCompactionenum value in table.hRep::warm_cache_priorityfield in BlockBasedTableBuilder for per-reason priority controlNOTE: Unlike flush output (which is inherently hot — just written by the user), it is hard to distinguish hot from cold blocks in compaction output. Warming all compaction output therefore risks polluting the block cache and evicting genuinely hot entries. The kFlushAndCompaction mode is recommended only for use cases where most or all of the database is expected to reside in cache (e.g., the working set fits in cache). For workloads where only a fraction of the data is hot, kFlushOnly remains the safer choice.
Test Plan
WarmCacheWithDataBlocksDuringCompactiontest: verifies data blocks from compaction output are present in the block cache and served without missesDynamicOptionstest: verifies dynamic switching through kDisable -> kFlushAndCompaction -> kFlushOnly -> kDisable via SetOptionsWarmCacheWithDataBlocksDuringFlushand parameterizedWarmCacheWithBlocksDuringFlushtests continue to pass (kFlushOnly behavior unchanged)