Skip to content

Add injected error log ring buffer for fault injection diagnostics#14431

Open
xingbowang wants to merge 2 commits intofacebook:mainfrom
xingbowang:2026_03_05_fault_inject_log
Open

Add injected error log ring buffer for fault injection diagnostics#14431
xingbowang wants to merge 2 commits intofacebook:mainfrom
xingbowang:2026_03_05_fault_inject_log

Conversation

@xingbowang
Copy link
Contributor

Add a circular ring buffer (InjectedErrorLog) to FaultInjectionTestFS that records the last 1000 injected errors. Each entry captures the timestamp, thread ID, FS API name with all arguments (file path, offset, buffer size, first 8 bytes of data in hex), and the injected error status. For example:

Append("/path/035354.sst", size=4096, head=[1a 2b 3c ...]) -> IOError (Retryable): injected write error
RenameFile("/path/tmp.sst", "/path/035355.sst") -> IOError: injected metadata write error
Read(offset=16384, size=4096) -> IOError (Retryable): injected read error

The ring buffer is printed to stderr automatically:

  • On any fatal signal (SIGABRT, SIGSEGV, SIGTERM, SIGINT, SIGHUP, SIGFPE, SIGBUS, SIGILL, SIGQUIT, SIGXCPU, SIGXFSZ, SIGSYS) via a registered crash callback
  • At the end of db_stress main(), for diagnostic visibility even when the test completes normally

This addresses a key debugging gap: when write fault injection causes secondary failures (e.g., the builder error propagation issue in T257612259), the injected errors were previously completely silent with no logging trail. The ring buffer provides the missing diagnostic context to correlate fault injection with downstream failures.

Changes:

  • port/stack_trace.h/.cc: Add RegisterCrashCallback() API; extend InstallStackTraceHandler() to catch all catchable termination signals
  • utilities/fault_injection_fs.h: Add InjectedErrorLog class with printf-style Record(), HexHead() for data bytes, and signal-safe PrintAll()
  • utilities/fault_injection_fs.cc: Record full API arguments and error status at all 31 fault injection call sites
  • db_stress_tool/db_stress_tool.cc: Register crash callback and print ring buffer at end of main()

Add a circular ring buffer (InjectedErrorLog) to FaultInjectionTestFS
that records the last 1000 injected errors. Each entry captures the
timestamp, thread ID, FS API name with all arguments (file path, offset,
buffer size, first 8 bytes of data in hex), and the injected error
status. For example:

  Append("/path/035354.sst", size=4096, head=[1a 2b 3c ...]) -> IOError (Retryable): injected write error
  RenameFile("/path/tmp.sst", "/path/035355.sst") -> IOError: injected metadata write error
  Read(offset=16384, size=4096) -> IOError (Retryable): injected read error

The ring buffer is printed to stderr automatically:
- On any fatal signal (SIGABRT, SIGSEGV, SIGTERM, SIGINT, SIGHUP,
  SIGFPE, SIGBUS, SIGILL, SIGQUIT, SIGXCPU, SIGXFSZ, SIGSYS) via a
  registered crash callback
- At the end of db_stress main(), for diagnostic visibility even when
  the test completes normally

This addresses a key debugging gap: when write fault injection causes
secondary failures (e.g., the builder error propagation issue in
T257612259), the injected errors were previously completely silent with
no logging trail. The ring buffer provides the missing diagnostic
context to correlate fault injection with downstream failures.

Changes:
- port/stack_trace.h/.cc: Add RegisterCrashCallback() API; extend
  InstallStackTraceHandler() to catch all catchable termination signals
- utilities/fault_injection_fs.h: Add InjectedErrorLog class with
  printf-style Record(), HexHead() for data bytes, and signal-safe
  PrintAll()
- utilities/fault_injection_fs.cc: Record full API arguments and error
  status at all 31 fault injection call sites
- db_stress_tool/db_stress_tool.cc: Register crash callback and print
  ring buffer at end of main()
@meta-cla meta-cla bot added the CLA Signed label Mar 5, 2026
@xingbowang xingbowang requested a review from hx235 March 5, 2026 21:09
@meta-codesync
Copy link

meta-codesync bot commented Mar 5, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D95435430.

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

⚠️ clang-tidy: 2 warning(s) on changed lines

Completed in 1304.4s.

Summary by check

Check Count
cert-err58-cpp 1
concurrency-mt-unsafe 1
Total 2

Details

db_stress_tool/db_stress_tool.cc (1 warning(s))
db_stress_tool/db_stress_tool.cc:289:7: warning: function is not thread safe [concurrency-mt-unsafe]
tools/ldb_cmd.cc (1 warning(s))
tools/ldb_cmd.cc:91:31: warning: initialization of 'ARG_UNIFORM_CV_THRESHOLD' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]

@joshkang97
Copy link
Contributor

I think it's useful to be able to debug the crash tests more easily, but doesn't this also signal a broader problem. If these IO problems do show up in other filesystems, then it will be very difficult to debug?

- Change PrintAll() to use fprintf(stderr) instead of fprintf(stdout)
  for consistency with StackTraceHandler and reliable signal-handler output
- Add comment documenting the benign torn-read race in PrintAll()
- Add missing error status to first Append() Record() call for consistency
  with all other injection sites
- Replace strsignal() with async-signal-safe SignalName() lookup table
  in TerminationHandler
- Fix TerminationHandler comment to list only the signals actually handled
- Use value-initialization instead of memset in InjectedErrorLog constructor
@xingbowang
Copy link
Contributor Author

I think it's useful to be able to debug the crash tests more easily, but doesn't this also signal a broader problem. If these IO problems do show up in other filesystems, then it will be very difficult to debug?

In most of the case, RocksDB reports the IO error or retry, but in some cases, mostly bug, it silently swallow the error. This is what was observed in stress test and triggered this change. You are right the goal is to make it easier to discover errors from file system. It is one of the goal for the fault inject in our stress test. But it does not harm to add the the additional logging, as it will help us to root cause some of the bugs that silent swallow the error much more easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants