Skip to content

Implement Diagnostic Fault Library with basic DFM, SOVD interface, and CI infrastructure#5

Open
bburda42dot wants to merge 7 commits intoeclipse-opensovd:mainfrom
bburda42dot:main
Open

Implement Diagnostic Fault Library with basic DFM, SOVD interface, and CI infrastructure#5
bburda42dot wants to merge 7 commits intoeclipse-opensovd:mainfrom
bburda42dot:main

Conversation

@bburda42dot
Copy link

Summary

Complete implementation of the Diagnostic Fault Library - a Rust library for managing diagnostic fault reporting, processing, and querying in Software-Defined Vehicles. Replaces the initial scaffold (src/lib.rs, api.rs, catalog.rs, etc.) with a production-grade multi-crate workspace aligned with the S-CORE module template.

What changed

Architecture - multi-crate workspace

  • Reorganized from a single flat crate into three workspace crates:
    • common - shared types: FaultId, FaultRecord, FaultCatalog, DebounceMode, IPC service types, compliance tags
    • fault_lib - reporter-side API: Reporter with debounce filtering, enabling-condition guards, IpcWorker with retry queue (exponential backoff), LogHook observability, FaultManagerSink
    • dfm_lib - Diagnostic Fault Manager: FaultRecordProcessor, AgingManager, SovdFaultManager with KVS-backed storage, EnablingConditionRegistry, OperationCycle provider abstraction
  • Added xtask crate for developer automation
  • Deleted original scaffold files (src/lib.rs, src/api.rs, src/model.rs, src/catalog.rs, src/config.rs, src/ids.rs, src/sink.rs, src/utils.rs)

Features

  • Reporter-side debounce filtering - CountWithinWindow, HoldTime, EdgeWithCooldown, CountThreshold modes
  • Enabling conditions - E2E flow: reporters register conditions, DFM evaluates before processing
  • IPC worker - iceoryx2-based transport with bounded channel, backpressure, and retry queue with exponential backoff
  • Fault aging & reset - policy-driven aging evaluation with operation cycle integration
  • SOVD fault API - typed status, counters, ISO 8601 timestamps, full FaultId variant support (Numeric/Text/Uuid)
  • Graceful shutdown - deadlock prevention via cooperative shutdown mechanism
  • Memory safety - replaced Box::leak with Cow<str>, bounded channels

Safety & quality

  • #[deny(clippy::unwrap_used)] enforced in runtime code - all todo!(), expect(), and unwrap() replaced with proper error handling
  • Raw TODO comments replaced with documented error paths
  • Comprehensive test suite: unit tests inlined in source, integration tests (tests/integration/) covering lifecycle transitions, multi-catalog scenarios, persistent storage, and report-query flows
  • Miri-compatible for memory safety validation

CI/CD (6 new workflows)

  • build_test.yml - Cargo build + test
  • lint.yml - Clippy with deny warnings
  • format.yml - rustfmt check
  • coverage.yml - Code coverage reporting
  • miri.yml - Memory safety checks
  • copyright.yml - License header validation

All workflows aligned with S-CORE patterns.

Project structure alignment

  • .bazelrc, MODULE.bazel, BUILD files for Bazel 8 support
  • .vscode/settings.json and extensions.json for development environment
  • .ruff.toml, .yamlfmt, rustfmt.toml for formatting consistency
  • Updated README.md with architecture overview, getting started, and examples
  • Issue and PR templates added

Checklist

  • I have tested my changes locally
  • I have added or updated documentation
  • I have linked related issues or discussions
  • I have added or updated tests

Related

This work is continuation of #4

Notes for Reviewers

Code is quite large, so it is better to review commit by commit. I split them into categories: "common", "fault-lib", "dfm" etc.

Migrate from single-crate layout to multi-crate workspace with
Bazel 8.3 + Cargo dual build system. Add xtask runner for common
development commands.
IPC-safe types (IpcDuration, IpcTimestamp), fault descriptors,
catalog configuration, debounce/enabling condition config,
query protocol definitions, and iceoryx2 service types.
Fault reporter API, IPC worker with exponential backoff retry,
fault catalog validation, enabling condition management, and
FaultManagerSink for iceoryx2 transport.
SOVD-compliant fault manager with KVS persistent storage, aging
manager, operation cycle tracking, fault record processor, and
query server with iceoryx2 IPC transport.
E2E tests covering lifecycle transitions, debounce/aging/cycles,
persistent storage, concurrent access, boundary values, error
paths, multi-catalog, JSON catalog loading, IPC query/clear,
and report-and-query flow.
Workflows: build/test, clippy lint, rustfmt, miri, coverage,
copyright header check, cargo audit (pinned to SHA), Bazel
format check. All workflows set permissions: contents: read.
…rence

Architecture overview, fault catalog/reporter/DFM sequence
diagrams, library architecture drawing, Sphinx docs scaffold,
and HVAC component design reference example.
@bburda42dot bburda42dot requested a review from a team as a code owner February 25, 2026 14:52
@vinodreddy-g
Copy link

vinodreddy-g commented Feb 26, 2026

@bburda42dot Just wanted to know,Why was this PR not started on top of the Initial commit in #4 from Qorix and started from scratch and moved all the files here , when it says continuation from #4?

@bburda42dot
Copy link
Author

@bburda42dot Just wanted to know,Why was this PR not started on top of the Initial commit in #4 from Qorix and started from scratch and moved all the files here , when it says continuation from #4?

@vinodreddy-g I did start on top of Qorix's initial commit from #4 - this PR is a direct continuation of that work. On top of the original ~4.9k lines, I added 63 commits (21k+ lines added, ~800 removed) with significant changes and improvements.

The resulting 64-commit history was hard to review as-is, so before opening this PR I squashed them all into a cleaner, logically grouped commit history specifically to enable commit-by-commit review. That squash is why the git history may look like it was started from scratch, but the code lineage traces directly back to #4.

If proper attribution is important to you, feel free to point out which parts of the current code originate from the original PR and I can add Co-Authored-By to the relevant commits.

@vinodreddy-g
Copy link

vinodreddy-g commented Feb 26, 2026

@bburda42dot Just wanted to know,Why was this PR not started on top of the Initial commit in #4 from Qorix and started from scratch and moved all the files here , when it says continuation from #4?

@vinodreddy-g I did start on top of Qorix's initial commit from #4 - this PR is a direct continuation of that work. On top of the original ~4.9k lines, I added 63 commits (21k+ lines added, ~800 removed) with significant changes and improvements.

The resulting 64-commit history was hard to review as-is, so before opening this PR I squashed them all into a cleaner, logically grouped commit history specifically to enable commit-by-commit review. That squash is why the git history may look like it was started from scratch, but the code lineage traces directly back to #4.

If proper attribution is important to you, feel free to point out which parts of the current code originate from the original PR and I can add Co-Authored-By to the relevant commits.

@bburda42dot ok so you split/changed the initial commit for easy review and added a lot of changes offcourse.
we had some design decisions/assumed some things in our initial implementation , wanted to know if these were discussed or aligned in some discussions in opensovd in last weeks like:
1 -The handling of fault catalog shown in fault_catalog.svg
2 - The interfaces between the lib and DFM shown in Registering new fault in the system.svg etc
3 -Behaviour of if fault doesn't exist in - new_fault.svg
4 - we currently had used iceoryx2 mostly directly and should it be moved to mw::com api to connect to S-core as next steps.

Could you update also the design changes/add in the svg/puml files to follow the new changes easily from #4 .

@FScholPer
Copy link

FScholPer commented Feb 26, 2026

To 4. we should start with what we have now (iceoryx2) later we can evaluate the migration to mw::com. For the artifacts potential next step(not now) could be using sphinx needs

@bburda42dot
Copy link
Author

ok so you split/changed the initial commit for easy review [...] wanted to know if these were discussed or aligned [...]

@vinodreddy-g Thanks for the detailed questions. These changes weren't discussed in OpenSOVD architecture meetings - they follow from the design doc requirements and the code review feedback on #4. Happy to discuss any of them in the next Architecture meeting if needed.

I've updated all diagrams in the latest force-push, so you can follow the design changes visually. Here's the breakdown:

1. Fault catalog (fault_catalog.svg)

Core idea is the same - builder pattern, SHA-256 hash verification with DFM, decentral catalogs. Main change: the original diagram had an UpdateFaultCatalogRequest path (hash mismatch -> push catalog to DFM). The current version is simpler - hash mismatch = Err(CatalogVerification), no catalog sync over IPC. Both sides load the same catalog files; a mismatch is a deployment error caught early. (The #4 code had check_fault_catalog() returning Result<bool> but the result was discarded in FaultApi::new(), so mismatches were silently ignored.)

FaultCatalog/FaultCatalogBuilder moved to the common crate (as flagged in review - dfm_lib depended on fault_lib just for catalog types).

2. Interfaces between lib and DFM (Registering new fault in the system.svg)

That diagram showed FaultMonitor + explicit register_fault() + get_fault() returning Some<Fault>/None. The #4 code was already heading in a different direction - it used Reporter (not FaultMonitor) and had no register_fault() IPC call. I kept the approach the code actually took: Reporter as per-fault handle, catalog hash handshake as implicit registration, iceoryx2 pub-sub services.

I removed Registering new fault in the system.svg since it was the only diagram without a .puml source and didn't match the implementation. new_fault.svg already covers the full registration + reporting flow.

On DFM side: I added DfmTransport trait (in dfm_lib/src/transport.rs) to abstract the IPC transport - the original FaultLibCommunicator used iceoryx2 directly. The FaultSinkApi trait on the reporter side was already there from #4 - I kept it.

3. Fault doesn't exist in catalog (new_fault.svg)

The #4 code had Reporter::new() returning Option<Self> - missing fault IDs returned None. I changed it to Result<Self, ReporterError> with a dedicated FaultIdNotFound(FaultId) variant, so the caller gets a clear error with the fault ID. Same fail-fast intent, more informative. The new_fault.puml/.svg are rewritten to match.

4. iceoryx2 vs mw::com

Agreed with @FScholPer - iceoryx2 for now, evaluate mw::com migration later. The transport is now isolated behind traits on both sides (FaultSinkApi + DfmTransport), so swapping means implementing those two traits with no business logic changes.

5. Diagram updates

All diagrams are now up to date in docs/puml/:

Diagram Status
fault_catalog.puml/.svg Updated - current init + hash verification flow
new_fault.puml/.svg Updated - ReporterApi::new() with Err(FaultIdNotFound) path
new_enable_condition.puml/.svg New - EC registration and report_status() flow
enable_condition_ntf.puml/.svg New - cross-app EC status notifications via DFM
local_enable_condition_ntf.puml/.svg New - local (same-app) EC notifications, fast path without IPC
query_clear.puml/.svg New - DFM query/clear IPC protocol
Registering new fault in the system.svg Removed - outdated, new_fault.svg covers this
lib_arch.drawio/.svg Updated - renamed FaultMgrClient to FaultManagerSink, fixed typo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants