Add injected error log ring buffer for fault injection diagnostics#14431
Add injected error log ring buffer for fault injection diagnostics#14431xingbowang wants to merge 2 commits intofacebook:mainfrom
Conversation
Add a circular ring buffer (InjectedErrorLog) to FaultInjectionTestFS
that records the last 1000 injected errors. Each entry captures the
timestamp, thread ID, FS API name with all arguments (file path, offset,
buffer size, first 8 bytes of data in hex), and the injected error
status. For example:
Append("/path/035354.sst", size=4096, head=[1a 2b 3c ...]) -> IOError (Retryable): injected write error
RenameFile("/path/tmp.sst", "/path/035355.sst") -> IOError: injected metadata write error
Read(offset=16384, size=4096) -> IOError (Retryable): injected read error
The ring buffer is printed to stderr automatically:
- On any fatal signal (SIGABRT, SIGSEGV, SIGTERM, SIGINT, SIGHUP,
SIGFPE, SIGBUS, SIGILL, SIGQUIT, SIGXCPU, SIGXFSZ, SIGSYS) via a
registered crash callback
- At the end of db_stress main(), for diagnostic visibility even when
the test completes normally
This addresses a key debugging gap: when write fault injection causes
secondary failures (e.g., the builder error propagation issue in
T257612259), the injected errors were previously completely silent with
no logging trail. The ring buffer provides the missing diagnostic
context to correlate fault injection with downstream failures.
Changes:
- port/stack_trace.h/.cc: Add RegisterCrashCallback() API; extend
InstallStackTraceHandler() to catch all catchable termination signals
- utilities/fault_injection_fs.h: Add InjectedErrorLog class with
printf-style Record(), HexHead() for data bytes, and signal-safe
PrintAll()
- utilities/fault_injection_fs.cc: Record full API arguments and error
status at all 31 fault injection call sites
- db_stress_tool/db_stress_tool.cc: Register crash callback and print
ring buffer at end of main()
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D95435430. |
|
| Check | Count |
|---|---|
cert-err58-cpp |
1 |
concurrency-mt-unsafe |
1 |
| Total | 2 |
Details
db_stress_tool/db_stress_tool.cc (1 warning(s))
db_stress_tool/db_stress_tool.cc:289:7: warning: function is not thread safe [concurrency-mt-unsafe]
tools/ldb_cmd.cc (1 warning(s))
tools/ldb_cmd.cc:91:31: warning: initialization of 'ARG_UNIFORM_CV_THRESHOLD' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
|
I think it's useful to be able to debug the crash tests more easily, but doesn't this also signal a broader problem. If these IO problems do show up in other filesystems, then it will be very difficult to debug? |
- Change PrintAll() to use fprintf(stderr) instead of fprintf(stdout) for consistency with StackTraceHandler and reliable signal-handler output - Add comment documenting the benign torn-read race in PrintAll() - Add missing error status to first Append() Record() call for consistency with all other injection sites - Replace strsignal() with async-signal-safe SignalName() lookup table in TerminationHandler - Fix TerminationHandler comment to list only the signals actually handled - Use value-initialization instead of memset in InjectedErrorLog constructor
In most of the case, RocksDB reports the IO error or retry, but in some cases, mostly bug, it silently swallow the error. This is what was observed in stress test and triggered this change. You are right the goal is to make it easier to discover errors from file system. It is one of the goal for the fault inject in our stress test. But it does not harm to add the the additional logging, as it will help us to root cause some of the bugs that silent swallow the error much more easily. |
Add a circular ring buffer (InjectedErrorLog) to FaultInjectionTestFS that records the last 1000 injected errors. Each entry captures the timestamp, thread ID, FS API name with all arguments (file path, offset, buffer size, first 8 bytes of data in hex), and the injected error status. For example:
Append("/path/035354.sst", size=4096, head=[1a 2b 3c ...]) -> IOError (Retryable): injected write error
RenameFile("/path/tmp.sst", "/path/035355.sst") -> IOError: injected metadata write error
Read(offset=16384, size=4096) -> IOError (Retryable): injected read error
The ring buffer is printed to stderr automatically:
This addresses a key debugging gap: when write fault injection causes secondary failures (e.g., the builder error propagation issue in T257612259), the injected errors were previously completely silent with no logging trail. The ring buffer provides the missing diagnostic context to correlate fault injection with downstream failures.
Changes: