Add injected error log ring buffer for fault injection diagnostics by xingbowang · Pull Request #14431 · facebook/rocksdb

xingbowang · 2026-03-05T21:09:33Z

Add a circular ring buffer (InjectedErrorLog) to FaultInjectionTestFS that records the last 1000 injected errors. Each entry captures the timestamp, thread ID, FS API name with all arguments (file path, offset, buffer size, first 8 bytes of data in hex), and the injected error status. For example:

Append("/path/035354.sst", size=4096, head=[1a 2b 3c ...]) -> IOError (Retryable): injected write error
RenameFile("/path/tmp.sst", "/path/035355.sst") -> IOError: injected metadata write error
Read(offset=16384, size=4096) -> IOError (Retryable): injected read error

The ring buffer is printed to stderr automatically:

On any fatal signal (SIGABRT, SIGSEGV, SIGTERM, SIGINT, SIGHUP, SIGFPE, SIGBUS, SIGILL, SIGQUIT, SIGXCPU, SIGXFSZ, SIGSYS) via a registered crash callback
At the end of db_stress main(), for diagnostic visibility even when the test completes normally

This addresses a key debugging gap: when write fault injection causes secondary failures (e.g., the builder error propagation issue in T257612259), the injected errors were previously completely silent with no logging trail. The ring buffer provides the missing diagnostic context to correlate fault injection with downstream failures.

Changes:

port/stack_trace.h/.cc: Add RegisterCrashCallback() API; extend InstallStackTraceHandler() to catch all catchable termination signals
utilities/fault_injection_fs.h: Add InjectedErrorLog class with printf-style Record(), HexHead() for data bytes, and signal-safe PrintAll()
utilities/fault_injection_fs.cc: Record full API arguments and error status at all 31 fault injection call sites
db_stress_tool/db_stress_tool.cc: Register crash callback and print ring buffer at end of main()

Add a circular ring buffer (InjectedErrorLog) to FaultInjectionTestFS that records the last 1000 injected errors. Each entry captures the timestamp, thread ID, FS API name with all arguments (file path, offset, buffer size, first 8 bytes of data in hex), and the injected error status. For example: Append("/path/035354.sst", size=4096, head=[1a 2b 3c ...]) -> IOError (Retryable): injected write error RenameFile("/path/tmp.sst", "/path/035355.sst") -> IOError: injected metadata write error Read(offset=16384, size=4096) -> IOError (Retryable): injected read error The ring buffer is printed to stderr automatically: - On any fatal signal (SIGABRT, SIGSEGV, SIGTERM, SIGINT, SIGHUP, SIGFPE, SIGBUS, SIGILL, SIGQUIT, SIGXCPU, SIGXFSZ, SIGSYS) via a registered crash callback - At the end of db_stress main(), for diagnostic visibility even when the test completes normally This addresses a key debugging gap: when write fault injection causes secondary failures (e.g., the builder error propagation issue in T257612259), the injected errors were previously completely silent with no logging trail. The ring buffer provides the missing diagnostic context to correlate fault injection with downstream failures. Changes: - port/stack_trace.h/.cc: Add RegisterCrashCallback() API; extend InstallStackTraceHandler() to catch all catchable termination signals - utilities/fault_injection_fs.h: Add InjectedErrorLog class with printf-style Record(), HexHead() for data bytes, and signal-safe PrintAll() - utilities/fault_injection_fs.cc: Record full API arguments and error status at all 31 fault injection call sites - db_stress_tool/db_stress_tool.cc: Register crash callback and print ring buffer at end of main()

meta-codesync · 2026-03-05T21:10:18Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D95435430.

github-actions · 2026-03-05T21:12:38Z

⚠️ clang-tidy: 2 warning(s) on changed lines

Completed in 1304.4s.

Summary by check

Check	Count
`cert-err58-cpp`	1
`concurrency-mt-unsafe`	1
Total	2

Details

db_stress_tool/db_stress_tool.cc (1 warning(s))

db_stress_tool/db_stress_tool.cc:289:7: warning: function is not thread safe [concurrency-mt-unsafe]

tools/ldb_cmd.cc (1 warning(s))

tools/ldb_cmd.cc:91:31: warning: initialization of 'ARG_UNIFORM_CV_THRESHOLD' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]

joshkang97 · 2026-03-06T21:52:25Z

I think it's useful to be able to debug the crash tests more easily, but doesn't this also signal a broader problem. If these IO problems do show up in other filesystems, then it will be very difficult to debug?

- Change PrintAll() to use fprintf(stderr) instead of fprintf(stdout) for consistency with StackTraceHandler and reliable signal-handler output - Add comment documenting the benign torn-read race in PrintAll() - Add missing error status to first Append() Record() call for consistency with all other injection sites - Replace strsignal() with async-signal-safe SignalName() lookup table in TerminationHandler - Fix TerminationHandler comment to list only the signals actually handled - Use value-initialization instead of memset in InjectedErrorLog constructor

xingbowang · 2026-03-10T18:47:17Z

I think it's useful to be able to debug the crash tests more easily, but doesn't this also signal a broader problem. If these IO problems do show up in other filesystems, then it will be very difficult to debug?

In most of the case, RocksDB reports the IO error or retry, but in some cases, mostly bug, it silently swallow the error. This is what was observed in stress test and triggered this change. You are right the goal is to make it easier to discover errors from file system. It is one of the goal for the fault inject in our stress test. But it does not harm to add the the additional logging, as it will help us to root cause some of the bugs that silent swallow the error much more easily.

meta-cla bot added the CLA Signed label Mar 5, 2026

xingbowang requested a review from hx235 March 5, 2026 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add injected error log ring buffer for fault injection diagnostics#14431

Add injected error log ring buffer for fault injection diagnostics#14431
xingbowang wants to merge 2 commits intofacebook:mainfrom
xingbowang:2026_03_05_fault_inject_log

xingbowang commented Mar 5, 2026

Uh oh!

meta-codesync bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

joshkang97 commented Mar 6, 2026

Uh oh!

xingbowang commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xingbowang commented Mar 5, 2026

Uh oh!

meta-codesync bot commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ clang-tidy: 2 warning(s) on changed lines

Summary by check

Details

Uh oh!

joshkang97 commented Mar 6, 2026

Uh oh!

xingbowang commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 5, 2026 •

edited

Loading