Conversation
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
This reverts commit b84bd77. Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
|
Could you pass facebook's CI, or at least figure out why they failed? I see you have several builds timed out. That usually means certain cases are hanging (they don't output PASSED log nor error log). |
|
/test |
It seems to be a unstable test. These tests run a long time and may fail because of environment. |
|
@Little-Wallace I don't think so. There are some unstable ones, but they won't cause nearly half of the tests to fail (usually at most 2). |
The tests did not fail. But the process was killed because out of time |
|
@Little-Wallace Some tests are hanging, mostly likely on some sync point conditions. It's better to figure out exactly what are those conditions, even though we might not need to fix them. |
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
tabokie
left a comment
There was a problem hiding this comment.
Could you explain the bug you just fixed in upstream? So I know where the review process is flawed.
db/error_handler.h
Outdated
| // A flag indicating whether automatic recovery from errors is enabled | ||
| bool auto_recovery_; | ||
| bool recovery_in_prog_; | ||
| std::atomic<bool> stop_state_; |
There was a problem hiding this comment.
| std::atomic<bool> stop_state_; | |
| std::atomic<bool> db_stopped_; |
db/error_handler.h
Outdated
| return !bg_error_.ok() && | ||
| bg_error_.severity() >= Status::Severity::kHardError; | ||
| } | ||
| bool IsDBStopped() { return stop_state_.load(std::memory_order_acquire); } |
There was a problem hiding this comment.
Add comment "Do not require DB mutex held."
| new_options.max_background_compactions, | ||
| new_options.max_background_jobs, | ||
| new_options.base_background_compactions, | ||
| /* parallelize_compactions */ true); |
There was a problem hiding this comment.
| /* parallelize_compactions */ true); | |
| true /* parallelize_compactions */); |
There was a problem hiding this comment.
It seems that all code of rocksdb place the comment before param
There was a problem hiding this comment.
You're half right. It's a per-file styling and this file happens to use postposition which is rare within the whole codebase...
| // "number < current_log_number". | ||
| MarkLogsSynced(current_log_number - 1, true, s); | ||
| if (!s.ok()) { | ||
| error_handler_.SetBGError(s, BackgroundErrorReason::kFlush); |
There was a problem hiding this comment.
Where do you set this error now?
There was a problem hiding this comment.
It will be saved at "db/db_impl/db_impl_compaction_flush.cc" L549 and L210
There was a problem hiding this comment.
In this function, we will not hold mutex so that we can not save BG error here.
The function |
|
I fix the problem of facebook#7516 in this commit: facebook@5fe832a There are some other bugs which are bout code only in main branch of facebook/rocksdb, so I did not port them here. |
db/db_impl/db_impl.cc
Outdated
| ROCKS_LOG_INFO(immutable_db_options_.info_log, | ||
| "Created column family [%s] (ID %u)", | ||
| column_family_name.c_str(), (unsigned)cfd->GetID()); | ||
| single_column_family_mode_.store(false, std::memory_order_release); |
There was a problem hiding this comment.
I forget why I place it here.... Maybe I can try to place it back to origin
| @@ -175,7 +174,9 @@ Status DBImpl::FlushMemTableToOutputFile( | |||
| // flushed SST may contain data from write batches whose updates to | |||
| // other column families are missing. | |||
| // SyncClosedLogs() may unlock and re-lock the db_mutex. | |||
| ret = s; | ||
| { | ||
| InstrumentedMutexLock lock(&log_write_mutex_); | ||
| for (auto l : logs_to_free_) { |
There was a problem hiding this comment.
What's the benefit to make logs_to_free_ be protected by both log_write_mutex_ and mutex_
There was a problem hiding this comment.
It makes no sense. But i think unlock mutex_ during this function may cause other bad case.
tabokie
left a comment
There was a problem hiding this comment.
Looks good overall. But I'd prefer seeing a full test report before merging.
| } | ||
| }; | ||
|
|
||
| struct LogContext { |
There was a problem hiding this comment.
Add newlines around this struct definition.
db/db_impl/db_impl.h
Outdated
| InstrumentedCondVar atomic_flush_install_cv_; | ||
|
|
||
| bool wal_in_db_path_; | ||
| std::atomic<uint64_t> max_total_wal_size_; |
There was a problem hiding this comment.
Put it together with max_total_in_memory_state_.
db/db_impl/db_impl_write.cc
Outdated
| if (max_total_wal_size > 0) { | ||
| return max_total_wal_size; | ||
| } | ||
| return 4 * max_total_in_memory_state_; |
There was a problem hiding this comment.
max_total_in_memory_state_ is an atomic now, need to specify memory order as well?
|
I have finished the benchmark test. See details in https://docs.google.com/document/d/1f72BGmN1kREorkaCzhbsB0oocTP13IKJw4jy5mfm-TE/edit#heading=h.mtjjjfyw1ei |
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
| mutex_.Lock(); | ||
| bool need_log_sync = !write_options.disableWAL && write_options.sync; | ||
| bool need_log_dir_sync = need_log_sync && !log_dir_synced_; | ||
| LogContext log_context; |
There was a problem hiding this comment.
why not init need_log_dir_sync here?
There was a problem hiding this comment.
need_log_dir_sync was protected by mutex_ before. This pr will protected this member variable by log_write_mutex_ so it could only be accessed in the method PreprocessWrite.
There was a problem hiding this comment.
you mean log_dir_synced_ is protected by mutex_? Oh, reasonable.
Connor1996
left a comment
There was a problem hiding this comment.
rocksdb/db/db_impl/db_impl_write.cc
Line 1194 in 46d7b19
change mutex_ to log_write_mutex_
| @@ -220,10 +216,7 @@ void DBImpl::FindObsoleteFiles(JobContext* job_context, bool force, | |||
| continue; | |||
There was a problem hiding this comment.
the log_sync_cv_ may wait here while holding the mutex
There was a problem hiding this comment.
I think Oh, you mean condvar::Wait will implicitly release mutex (a mutex is paired with it when constructed).mutex_. That's a good catch.
There was a problem hiding this comment.
yes, it will release the pairing mutex log_write_mutex_, but not mutex_
There was a problem hiding this comment.
Good catch... I'm thinking about how to solve this problem....
There was a problem hiding this comment.
I'm not sure whether it is still correctly to release mutex_ here because if some other thread calling FindObsoleteFiles but this thread is only finish half of its task.
There was a problem hiding this comment.
mutex_ is released before, I don't see a problem here. But you can't simply re-acquire the mutex after Wait(). That will break the lock order constraint.
Maybe you can skip those unsync-ed log files, and clean them in the next round?
There was a problem hiding this comment.
OK I will unlock mutex_ before FindObsoleteLogFiles method. PTAL
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
Connor1996
left a comment
There was a problem hiding this comment.
rest LGTM. Can we add a test to cover the case of the mutex?
I thought there are enough tests in RocksDB to cover this condition? I have passed the CI of facebook to make sure that it is no problem |

Signed-off-by: Little-Wallace bupt2013211450@gmail.com
Summary
Cherry-pick facebook#7516
This PR is to avoid write operation blocking on mutex when compaction or other method hold the mutex too long.