From cd6902483ac3cfd3a85c1ab87a9c268ab1fcc0cb Mon Sep 17 00:00:00 2001 From: Pawel Rutka Date: Mon, 22 Dec 2025 11:36:16 +0100 Subject: [PATCH 1/3] fmea: initial FMEA for discussion --- .../health_monitor/safety_analysis/fmea.rst | 98 +++++++++++++++++++ 1 file changed, 98 insertions(+) diff --git a/docs/module/health_monitor/safety_analysis/fmea.rst b/docs/module/health_monitor/safety_analysis/fmea.rst index 9916551e..7e39f145 100644 --- a/docs/module/health_monitor/safety_analysis/fmea.rst +++ b/docs/module/health_monitor/safety_analysis/fmea.rst @@ -34,6 +34,104 @@ FMEA (Failure Modes and Effects Analysis) - Adjust ``status`` to be ``valid`` - Adjust ``safety`` and ``tags`` according to your needs + +Failure Mode Evaluation Table +----------------- + +.. comp_saf_fmea:: + +.. list-table:: + :header-rows: 1 + :widths: auto + + * - Title + - id + - failure_effect + - mitigation_proposal + - sufficient + * - Missing processing time + - HM_FMEA_001 + - Background thread does not receive CPU time slice, leading to miss specified alive notification internal towards Launch Daemon + - | **Detection:** + - Missing notifications will be detected by Launch Daemon and lead to safety reaction at Launch Daemon. + | **Mitigation:** + - Provide `AoU` that integrator has to ensure Health Monitor background thread receives sufficient CPU time slice by configuring it's scheduling parameters accordingly. + - Yes + * - Loss of execution + - HM_FMEA_002 + - Background thread does not advance in its execution (ie. deadlock, endless loop failure), leading to miss specified alive notification internal towards Launch Daemon + - | **Detection:** + - Missing notifications will be detected by Launch Daemon and lead to safety reaction at Launch Daemon. + - Yes + * - Memory corruption of monitoring data structures + - HM_FMEA_003 + - Corruption of internal data structures used for monitoring, leading to missed detection of failure of monitored components (bitflips, out of range data, etc.) + - | **Detection:** + - SEE NOTES BELOW + - Yes + +HM_FMEA_003 +------------ + +Health Monitoring Library is placed in same process as monitored components. Therefore, any other component that shares same process can corrupt memory of Health Monitoring Library. This can lead to missed detection of failure of monitored components. +Since we are using **Rust** as programming language for Health Monitoring Library implementation, we could rely on Rust memory safety guarantees and avoid memory corruption due to programming errors. However we are also supporting C/C++ components +that can introduce memory issues due to programming errors. Therefore, we need to consider additional detection mechanisms. Below description of possible detection mechanisms: + +Checksums +^^^^^^^^^^ +One of possible detection mechanisms is to use checksums for data structures used for monitors. + +**Pros** + +- Low performance overhead + +**Cons** + +- hard to implement for complex data structures that are mutated frequently + +.. note:: + Not implemented due to complexity of implementation for complex data structures. + +Protected pages +^^^^^^^^^^^^^^^^ +Another possible detection mechanism is to use protected memory pages. Region before and after data structures used for monitors can be marked as non-accessible. +Since other components do not have knowledge where internal structures were allocated, **likelihood** of memory corruption only in data structures used for monitors and not around them is **low**. + +**Pros** +- Easy to implement +- 0 performance overhead + +**Cons** +- Increased memory overhead +- Can detect only `pass through` corruption - ie. bulk write over memory area +- Protection can be disabled by malicious component however this shall be judged as **extremely low** likelihood because this requires knowledge of internal memory layout of Health Monitoring Library and code review approval. + +Dual banking +^^^^^^^^^^^^^ +Another possible detection mechanism is to use dual banking for data structures used for monitors. One bank is actively used, while the other one is kept on side +as either **mirrored data** or **inverse byte copy**. During runtime, background checking thread will fetch both banks and compare them. If mismatch is detected, then memory corruption is detected. + +.. note:: + This pattern can be extended to triple banking where voting can be used to **recover** corrupted data if needed. + +**Pros** +- Can recover corrupted data if triple banking with voting is used +- Detects wide range of memory corruption patterns + +**Cons** +- Increased memory overhead +- More complex internal implementation + + + +Summary +^^^^^^^^ + +In above considerations we assumed: + - Performance overhead shall be minimal since Monitoring entries are simple operation on low memory chunks + +**Take into consideration that those are protections to `discuss` and agree what makes most sense in our use case.** + Failure Mode List ----------------- From b3698ef136c8559cafa727d3823a4ac2d205cc82 Mon Sep 17 00:00:00 2001 From: Pawel Rutka Date: Fri, 9 Jan 2026 13:40:46 +0100 Subject: [PATCH 2/3] Update after CFT meeting --- .../health_monitor/safety_analysis/fmea.rst | 56 ++++++++++--------- 1 file changed, 31 insertions(+), 25 deletions(-) diff --git a/docs/module/health_monitor/safety_analysis/fmea.rst b/docs/module/health_monitor/safety_analysis/fmea.rst index 7e39f145..e07a6526 100644 --- a/docs/module/health_monitor/safety_analysis/fmea.rst +++ b/docs/module/health_monitor/safety_analysis/fmea.rst @@ -24,21 +24,8 @@ FMEA (Failure Modes and Effects Analysis) :realizes: wp__sw_component_fmea :tags: template -.. note:: Use the content of the document to describe e.g. why a fault model is not applicable for the diagram. - -.. attention:: - The above directive must be updated according to your Component. - - - Modify ``Your Component Name`` to be your Component Name - - Modify ``id`` to be your Component Name in upper snake case preceded by ``doc__`` and succeeded by ``_fmea`` - - Adjust ``status`` to be ``valid`` - - Adjust ``safety`` and ``tags`` according to your needs - - Failure Mode Evaluation Table ------------------ - -.. comp_saf_fmea:: +------------------------------ .. list-table:: :header-rows: 1 @@ -53,21 +40,32 @@ Failure Mode Evaluation Table - HM_FMEA_001 - Background thread does not receive CPU time slice, leading to miss specified alive notification internal towards Launch Daemon - | **Detection:** - - Missing notifications will be detected by Launch Daemon and lead to safety reaction at Launch Daemon. + | - Missing notifications will be detected by Launch Daemon and lead to safety reaction at Launch Daemon. + | **Mitigation:** - - Provide `AoU` that integrator has to ensure Health Monitor background thread receives sufficient CPU time slice by configuring it's scheduling parameters accordingly. + | - Provide `AoU` that integrator has to ensure Health Monitor background thread receives sufficient CPU time slice by configuring it's scheduling parameters accordingly. + | - All code within process is developed according to **ASIL-B** development process + - Yes * - Loss of execution - HM_FMEA_002 - Background thread does not advance in its execution (ie. deadlock, endless loop failure), leading to miss specified alive notification internal towards Launch Daemon - | **Detection:** - - Missing notifications will be detected by Launch Daemon and lead to safety reaction at Launch Daemon. + | - Missing notifications will be detected by Launch Daemon and lead to safety reaction at Launch Daemon. + + | **Mitigation:** + | - All code within process is developed according to **ASIL-B** development process + - Yes * - Memory corruption of monitoring data structures - HM_FMEA_003 - Corruption of internal data structures used for monitoring, leading to missed detection of failure of monitored components (bitflips, out of range data, etc.) - | **Detection:** - - SEE NOTES BELOW + | - Using protected memory pages around internal data structures used for monitoring to detect memory corruption (see below) + + | **Mitigation:** + | - All code within process is developed according to **ASIL-B** development process + - Yes HM_FMEA_003 @@ -98,10 +96,12 @@ Another possible detection mechanism is to use protected memory pages. Region be Since other components do not have knowledge where internal structures were allocated, **likelihood** of memory corruption only in data structures used for monitors and not around them is **low**. **Pros** + - Easy to implement - 0 performance overhead **Cons** + - Increased memory overhead - Can detect only `pass through` corruption - ie. bulk write over memory area - Protection can be disabled by malicious component however this shall be judged as **extremely low** likelihood because this requires knowledge of internal memory layout of Health Monitoring Library and code review approval. @@ -115,22 +115,28 @@ as either **mirrored data** or **inverse byte copy**. During runtime, backgroun This pattern can be extended to triple banking where voting can be used to **recover** corrupted data if needed. **Pros** + - Can recover corrupted data if triple banking with voting is used - Detects wide range of memory corruption patterns **Cons** + - Increased memory overhead - More complex internal implementation +Decision +========= +- Status: Accepted +- Date: 2026-01-09 + +After evaluation of above detection mechanisms, it was decided that **Health Monitoring Library** shall be implemented as library within monitored process as FMEA confirms safety goals are met. +Rationale +========== +- All code within process is developed according to **ASIL-B** development process +- Library will use **protected pages** mechanism to detect `pass through` memory corruption +- Lifecycle CFT will investigate possibility to harden memory protection using **ARM MTE** extension (`more here `_ ) in future releases - https://github.com/eclipse-score/score/issues/2397 -Summary -^^^^^^^^ - -In above considerations we assumed: - - Performance overhead shall be minimal since Monitoring entries are simple operation on low memory chunks - -**Take into consideration that those are protections to `discuss` and agree what makes most sense in our use case.** Failure Mode List ----------------- From 5e759c603c4555a451afc35de608285a255da9f5 Mon Sep 17 00:00:00 2001 From: Pawel Rutka Date: Fri, 9 Jan 2026 13:47:44 +0100 Subject: [PATCH 3/3] Add missing test target --- src/health_monitoring_lib/BUILD | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/health_monitoring_lib/BUILD b/src/health_monitoring_lib/BUILD index 54f25cb8..69e352ef 100644 --- a/src/health_monitoring_lib/BUILD +++ b/src/health_monitoring_lib/BUILD @@ -19,3 +19,8 @@ rust_library( crate_name = "health_monitoring_lib", visibility = ["//visibility:public"], ) + +rust_test( + name = "tests", + crate = ":health_monitoring_lib", +)