Merged
Conversation
Enable caching of unknown events. This behavior could be switched on by setting ENABLE_DEFAULT_EVENT_CACHING at compile time. Now it is always on.
Adapt wireshark plugin for newer wireshark versions Wireshark 4.6.0 already decodes vsomeip internal messages. https://www.wireshark.org/docs/dfref/v/vsomeip.html https://gitlab.com/wireshark/wireshark/-/merge_requests/18760 As both protocols have the same name, wireshark complains of our plugin having the same name as the official one. This just changes the protocol name to vsomeip3 to allow having both plugins.
Improve logs for registration process Logs during registration process could be a bit misleading. Clarify when a client is being queued for registration/deregistration, and when the client is actually being processed Example: Current logs // app is requesting to register Application/Client 1235 @ 160.10.4.1:31400 is registering // we do not know when the daemon is actually processing the registration Proposed logs // register request from the app Queueing a register request for application/client 1235 @ 160.10.4.1:31400 // actual log when daemon processes the registration Application/Client 1235 is registering
It is a somewhat psychotic test, where first host offers a service, and a second host offers the same service again, to check that collisions do not break communication for the first offer That is fine, however the second host would.. trigger a shutdown method.. through the service that has a collision? Which makes no sense, and was not deterministic. It is a leftover from b69aac5 Instead, have the high-level test logic stop the second host
Rework "Routing Ready" message introduce internal "Routing Ready" message (configurable by setting INTERNAL_ROUTING_READY_MESSAGE flag during compilation) log external "Routing Ready" also after resume
If an application (or libvsomeip itself) accidentally closes a descriptor, following syscalls receive EBADF There are plenty of libraries (e.g., libsystemd, libudev, ...) that will immediately SIGABRT on such a case, to signal application misbehavior Follow the same principle, and add a SIGABRT on EBADF. Furthermore, add a helper that can be used in the codebase to trigger SIGABRTs
Fix pending subscription tests return codes and add documentation Fixed return codes. Fixed send_alternating_subscribe_unsubscribe_same_port and send_subscribe_stop_subscribe_subscribe. Added documentation.
0aee1ea to
0f7f7b3
Compare
Update documentation based on recent commits. No functional change other than updating documentation. Remove VSOMEIP_ABORT_ON_CRIT_SYSCALL_ERROR: missing 93a0ffd. While at it add a comment that non-JSON files are now ignored.
Skip processing for non-JSON extension files showing a warning when it is routing manager. While iterating directory paths for non-mandatory files the routing manager ends up opening/reading files that are not even JSON. Parser fails causing a warning "Configuration may be incomplete" after the costs of opening & reading the file. Projects have files (e.g. vsomeip security policies) that ended up deciding for the same paths used by vsomeip JSON files.
This commit restructures the local communication endpoints, replacing an inheritance-based design with a composition-based approach. Key Changes New Abstraction Layers local_socket: Protocol-agnostic interface for bidirectional socket communication Implementations: local_socket_tcp_impl and local_socket_uds_impl Handles async read/write operations and connection management local_acceptor: Protocol-agnostic server socket interface Implementations: local_acceptor_tcp_impl and local_acceptor_uds_impl Manages incoming connection acceptance local_receive_buffer: Message parsing and buffer management Handles protocol framing (start/end tags removed in favor of direct command parsing) Implements buffer growth/shrink strategies Unified Endpoint Classes local_endpoint: Single class replacing 4 endpoint types: local_tcp_client_endpoint_impl local_tcp_server_endpoint_impl local_uds_client_endpoint_impl local_uds_server_endpoint_impl local_server: Manages server-side connection lifecycle Handles client handshake (ASSIGN_CLIENT/CONFIG commands) Creates and manages local_endpoint instances for each client Uses tmp_connection helper for initial handshake State Management Changes Explicit state machine in local_endpoint: STOPPED → CONNECTING → CONNECTED Non-restartable design (endpoints cannot be restarted after stop) Type Safety Changes endpoint_manager_base methods now return specific types: find_local() returns std::shared_ptr<local_endpoint> (previously endpoint) create_routing_root() returns std::shared_ptr<local_server> (previously endpoint) Protocol Handling Changes Magic cookie tags (0x67 0x37 0x6d 0x07 / 0x07 0x6d 0x37 0x67) removed from local communication Handshake uses protocol commands directly (instead of magic packet for uds) Client assignment moved to local_server::tmp_connection Messages are not send separately Code Organization Socket I/O, message parsing, and endpoint logic separated into distinct components Protocol-specific behavior isolated in socket implementations Implementation Changes ~2000 lines of duplicated client/server TCP/UDS endpoint code consolidated into local_endpoint routing_manager_impl, routing_manager_client, and routing_manager_stub updated to use new types unit tests were introduced for the endpoint layer to catch simple mistakes quicker Migration Impact Internal API changes (public vsomeip API unchanged) Endpoint restart behavior changed: restart() now stops and recreates endpoints rather than reusing them Socket options configured in stream_socket implementations rather than endpoint classes This refactoring consolidates local communication code while maintaining protocol-specific behavior in dedicated socket implementations.
There is Android-specific "fallback loading" of security policies, introduced in 94f113c, which adds the following logic: if given client host is unknown (e.g., ""), load own host (e.g., "android-idc") The loading is not idempotent, so every time an Android libvsomeip client would receive a RIE_ADD_CLIENT, it would re-load its' policy, due to add_known_client(.., "") 94f113c then massively worsens the situation, by adding another add_known_client(..., "") at every RIE_ADD_SERVICE_INSTANCE This is what causes the security policy to be loaded countless times in boardnetsettings HAL in , causing it to consume immense amounts of CPU - it is a HAL that creates 15 libvsomeip applications and requests >100 services Fix this by making sure security policy loading is idempotent, including the Android-specific fallback
Bump minimum version of Boost to 1.74
If a user either closes or messes with the descriptors that the boost::asio::io_context uses (in Linux, a epollfd), it is possible that the io threads exit This is a terrible scenario - libvsomeip application no longer works, few signs aside the threads exit (and application::start terminating, which libCommonAPI does nothing about) Therefore: if the io threads exit, and the user did not invoke application::stop, do SIGABRT There is an unfortunately long history to the topic, see 71cd2d5, , , ...
It is better to do std::abort from a library point-of-view, because it guarantees a SIGABRT, whereas std::terminate may still invoke user code See for example While at it, fix a double-evaluation in VSOMEIP_TERMINATE macro
Reset is sending flag for UDP server endpoint. An issue has been identified after STR where the the unicast socket for SD stopped sending data to only one of its targets. Moreover, the dangerous ESQ trace that depicts when a sending queue surpasses the warning limit was being logged for that endpoint, this indicated that the data for that target reached the internal queue but never the network, leading to believe that could be related with the is_sending flag not being reset if an asynchronous operation was somehow halted/corrupted during STR. This PR introduces some further traces to track targets unable to send data for the SD endpoint and resets the is_sending_ flag for all UDP server endpoint on resume.
Check config for app name and id If an app with clientID X and name Y de-registers and another Y name app comes up and takes its clientID, if the original app re-registers again, routing manager will accept the connection believing it's the exact same application. This will then lead to issues where the second app will detect lost of connection due to the replacing of connection and try to register again, leading to a loop of register->de-register between the two apps that share the same clientID X and name Y. To fix this, check the config for configured app names, if it exists, then it's OK, otherwise, continue with the clientID assignment Note: there are several apps that do not specify a name, so it will default to "" which can lead to the issue described above.
Change epoll_wait to check EINVAL in addition to EBADF. Change react to support an errno (0 means use global) and add an additional check for epoll_wait checking for EINVAL. While at it, support epoll_pwait in addition to epoll_wait. As a cleanup, fix return type for several other wrappers. Keep original strings unchanged because of existing tools/parsers. ,
Minor documentation update
Silly regression from 7b3dedc, was causing local endpoints to re-connect forever When using TCP, it can cause "Cannot assign requested address", due to old-endpoints-connecting-forever mixed with new endpoint using the same address
vsomeip_sec_sync_client is what allows mapping an ip/port to a uid, when using libvsomeip-sec Unfortunately there is a gap in the configuration, and trying to map the ip/port of the outgoing connection to vsomeipd to the uid of vsomeipd results in an error (this mapping now happens due to 7b3dedc) Workaround this by "pretending" that the outgoing connection uses the remote address of the incoming connection - libvsomeip-sec does have a configuration for that one
Do not send STOP OFFER when registering an application Fixes an issue where a second registration attempt made by an already registering application would cause services that where offered to be stop offered and not offered again Steps leading to the issue: Daemon receives an error on connection with server and triggers DEREGISTER_ON_ERROR Server simultaneously starts a registration attempt (ST_REGISTERING) Daemon is still processing deregistration and queues registration request from server Daemon finishes deregistration, closes connection with the server, and starts processing pending registration request Connection closed on step 4. triggers error handling on server and a new registration attempt (ST_REGISTERING again) Daemon queues second registration attempt, while processing the first one Daemon finishes processing first registration, services are offered, server changes to ST_REGISTERED Daemon starts processing second registration, services are stop offered Services are not offered again due to server already being on ST_REGISTERED This partially reverts 9abdb5f, now STOP OFFERs are only sent when an application is being deregistered (The revert was already tested and no regression was found)
Consistency across VSOMEIP logging at RMC This PR does not include any functional change. It makes analysis and troubleshooting easier if log messages are standardized. Changes are motivated to promote consistency across logging and support at log analysis. Note: vSomeIP Security logs remain untouched as I'm afraid it may break some BAT Tests which rely somehow on these messages.
Fix missing unavailability on suspend_resume_test_initial Sporadically, the test fails due to a logic limitation. Root Cause: The client waits to receive unavailable followed by available after sending the STR message to the service. In the first iteration (network down → network up), the unavailable/available state transitions sporadically occur instantaneously. Although is_available_ correctly becomes false, the test thread cannot capture this state change because it immediately becomes true again before the condition variable check. Solution: Introduced a new flag was_unavailable_ that acts as a "sticky bit" to track state transitions: Set to true when the service becomes available and the previous state was unavailable. This ensures that the test knows the service went through the unavailable state. The flag is only cleared after sending the STR (Suspend/Resume) message to the service, ensuring proper synchronization
le::status acquires mutex_, same as what is held in le::start, which results in a deadlock if this error happens
Follow-up to a9b4e9f, it is the lower-byte, not the top-byte that needs to be modified (otherwise there are still libvsomeip-sec issues)
local_routing_test_service would receive requests, immediately respond, and immediately stop on the last one There is no guarantee of message delivery on stop, and so test would timeout as local_routing_test_client would never receive responses
Ensure stop_offer/offer completes before processing next command Sporadically, even though STOP_OFFERs and OFFERs are sent in the correct order, the routing manager takes longer to process the STOP_OFFER than the OFFER and ends up sending the availability out of order, that is, it sends the ON_AVAILABLE first and only then the ON_UNAVAILABLE. This issue seems to be due to the premature removal of offer commands from the queue. Early Command Removal: The erase_offer_command() was being called before the asynchronous on_availability() notification completed. Queue Processing Gap: When erase_offer_command() removed the STOP_OFFER from the queue, the next OFFER command would immediately start processing. Async Callback Delay: The STOP_OFFER's asynchronous endpoint cleanup callbacks would still be pending and could execute after the OFFER completed, sending ON_UNAVAILABLE out of order. In order to ensure the correct order, the erase_offer_command() call must be made after all notifications have been sent, only allowing the next command in the queue to be processed after the previous one has finished processing. Key Improvement: Guaranteed Notification Order: By keeping the command in the queue until all notifications complete, we prevent the next OFFER from processing prematurely.
Fix and optimize offer_test_external Test updated to remove the sleeps. Additionally, it was verified that, the way the test was being initialized, the external service (offer_test_service_external) never got to run because when the 3 seconds passed and the slave container was started, the client had already finished execution and immediately sent the kill signal to the slave. Changes: offer_test_external_master_starter.sh.in: -> Remove all sleeps offer_test_external_sd_msg_sender.cpp: ->Mechanism added to wait for the master service to start offering the service offer_test_service_external.cpp: -> Wait until master service is available
Remove the server_endpoint interface This interface is no longer needed after the refactoring of the local communication.
Description Same as 80341ae, but for a bajillion other tests It is always the same pattern, either: client sends a shutdown request and waits before stop, or server sends a shutdown response and waits before stop
Fix E2E tests return codes and add documentation Fixed return codes. Removed some magic numbers from source code. Added documentation.
Enables existent logging These logs are very useful for day-to-day analysis. They allow us to see exactly when the boardnet service is being offered and when it is stopped.
Improve logging for initial event tests For some test cases logs like this would appear when launching initial_event_test_service Unknown argument: CLIENT_SUBSCRIBES_TWICE This is not an error but could be a bit confusing when analyzing the test. Change the logs so that Unknown argument is only logged is a wrong argument is passed
Optimize external_local_payload_test_client_external The test starter uses a function to verify if the service sockets are open, but unlike other starters, it checks after launching the client instead of checking after launching the service. Additionally, to verify if the sockets are open, a 2-second sleep is performed, but I don't think it makes sense to always wait 2 seconds because in the case of sanitizers, it is almost instantaneous. Therefore, to optimize execution times, a while loop was created that checks the socket status every 250ms with a maximum limit of 2 seconds. This way, as soon as the sockets are open, the test can start, avoiding the static 2-second wait.
Eliminate the risk of desynchronisation of the list of servers to be restarted Since 4ef740c, the lock on endpoint_mutex_ isn't hold while restarting the server. So there is a risk that the list of active servers is modified while the endpoint_manager_impl restarts them during a resume. The consequence is that the old server occupies an endpoint and prevents its replacement from taking its place. We can reduce the likelihood of such an event occurring. And when it occurs, we can correct the consequences. Secondly, this problem should not occur because it is not normal for the old server not to have self-destructed, as it should no longer be referenced.
Remove initial/end tag check With 7b3dedc, start and end tags were removed so vsomeip-dissector had to be updated to no longer count on this 8 bytes. Note: this breaks compatibility with pcaps that had versions previous to the introduction of the PR mentioned above, you can use the official wireshark vsomeip dissector for those.
If a user does application::start, then stop, then start, there is no executor_work_guard anymore, so the io threads could exit due to lack of outstanding work It is theoretical, caught in code analysis
Add missing #include <unordered_map> Remove unused internal.hpp Android build is complaining, local_server.hpp:187:10: error: no template named 'unordered_map' in namespace 'std'; did you mean 'unordered_set'? And the missing internal.hpp on Android side.
Replaced all usages of std::lock_guard with std::scoped_lock across the codebase. This aligns the project with modern C++17 best practices and provides cleaner, more flexible mutex handling. The main motivation for this PR is to ensure consistent mutex-handling practices
In case of an error, a lock was incorrectly held, leading to a deadlock In short: lati::accept_cbk receives connection lati::accept_cbk errors with ENOTCONN when reading remote address (could happen if connection is already kaputt) lati::accept_cbk calls ls::accept_cbk.. while incorrectly holding mtx ls::accept_cbk calls lati::async_accept, which tries to acquire mtx deadlock
Otherwise causes failures specifically with builds that only do vsomeip_ctrl no def for std::scoped_lock
Change type for getsockopt Android is complaining, local_socket_uds_impl.cpp:114:101: error: cannot initialize a parameter of type 'socklen_t *' (aka 'int *') with an rvalue of type 'unsigned int *'
After every single kill, the test script must wait, because otherwise the process might receive: signal from kill and then signal from shell termination Unfortunately 2) can cause valgrind issues, but even 1) is problem, as we can lose logs of the process exiting Change all tests to wait after every single kill, check for exit code. Furthermore, fix a tsan issue uncovered by these changes in routingmanagerd
Ensure memory keeps alive until callback is invoked keep a shared_ptr of the local_receive_buffer and copy it into the callback. Because the callback does not keep alive the owner of the buffer as in the other places, the callback needs to ensure that at least the memory used by asio is kept alive. This avoids circular ownership problems, while protecting against memory corruption.
The following (somewhat unlikely) sequence can happen: le is created to connect to vsomeipd, from a guest vsomeipd is unreachable connection attempts timeout, leads to le::connection_cbk connection attempts are exhausted, so le::connection_cbk does le::escalate, calls error handler however, before le::escalate, it does socket_->stop(), which concurrently calls le::connection_cbk with "operation canceled" concurrent le::connection_cbk sets state_e::STOPPED due to state_e::STOPPED, le::escalate does nothing This is possible because we're no longer using strands, and because le::mutex_ is released between socket_->stop() and le::escalate Rework the code to solve the last point, that alone eliminates the problem
Ensure lazy load before using security During the refactoring in: 7b3dedc the add_known_client was moved after the creation of a local_endpoint. But the creation method of the local_endpoint internally uses the security model to ensure this endpoint is allowed to be used. This was done because: It was not understood that add_known_client needs to be called strictly before the check, there was no remove_client method in the rh interface to remove the client when the security check failed. Both of these points are now understood and adjusted in this PR. This bug did not surface, because the lazy load + security check needs to be used for a client-server relation across containers.
As stands, vsomeip_sec_client_t::host is uninitialized with unix sockets, which leads to valgrind issues when using CommonAPI
Trigger on_stop_offer_service coherently after on_unavailable everywhere. It is a consistency fix, first and foremost
Which gives visibility into what steps the bash scripts do, without any other changes. We could even remove quite a few logical "print" statements Without something like this, it can be very difficult to understand which executable exited with an error
Avoid wrong security checks on the sender endpoints of the router and harden the local_server Sequence that revealed all these problems addressed: A container client that did not load the security policy (lazy loaded) tried to connect to the router The router accepted the connection and assigned a client id The client received the client id The client started the receiver (triggered a async_accept) The client dispatched the register_application command + the config command The router enqueued the registration answer process on some other thread The router started to load the security policy for the received config command The routers endpoint was able to connect to the client The router executed a "client" security check which failed The router dropped the sender towards the client The clients error handler is invoked The client invoked rmc::reconnect but a credentials check within fails The client does not stop the receiver The register_application timer expires, not stopping the receiver, but restarting the sender The loop starts from the beginning Eventually the router is no longer rejecting the client (the policy loading was quick enough), but now the pilled up async_accepts fail with "Bad file descriptor". Problems address in this PR: The sender of the router should not perform the security check (be aware it is a router endpoint) The local_server should be able to handle subsequent start, start invocations, Sockets used in the async_accept callback should not be replaced, before the callback is invoked. remove the credential check when re-attempting to connect to the router
vSomeIP-Lib 3.6.2 Release Enable caching of unknown events Adapt wireshark plugin for newer wireshark versions Improve logs for registration process tests: fix offer_test_external Rework "Routing Ready" message misc: SIGABRT on EBADF Fix network_tests/pending_subscription_tests Fix for removed env and tweaks information on JSON files Skip non-JSON files iterating config paths restructure local endpoints conf: fix lazy loading of security policies Bump minimum version of Boost to 1.74 app: SIGABRT on unexpected io exit misc: use std::abort instead of std::terminate Reset sending flag for UDP server endpoint check config for app name and id epoll_wait check for EINVAL and react supporting errno Update documentation le: fix max-connection-attempts lsti: fix policy check do not send STOP OFFER when registering application rmc: Unified log message structure suspend_resume_test_initial – fix missing unavailability le: fix silly deadlock lsti: fix byte order tests: fix local_routing fragility Ensure stop_offer/offer completes before next command Optimize offer_test_external Remove the server_endpoint interface tests: fix more fragility Fix E2E return codes Set log as info for boardnet OFFER/STOP OFFER improve logging for initial event tests Optimize external_local_payload_test_client_external workaround to race condition with restart list dissector: remove initial/end tag check app: fix missing executor guard Add missing include for unordered_map tests: include unit-tests in network-tests replaces std::lock_guard with std::scoped_lock lati: fix accept deadlock build: fix missing headers Change type for getsockopt tests: wait after kill Ensure buffer is not cleaned up while asio might use it le: fix escalate race Ensure lazy load of client, before using security le: fix uninitialized sec_client rmi: fix order of on_*offer_service/on_available tests: add bash -x Avoid security failures on the daemon side
0f7f7b3 to
06c35de
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.