Merge Feature/async-api-variable-collectives into FMI origin#22
Merge Feature/async-api-variable-collectives into FMI origin#22
Conversation
| @@ -0,0 +1,8 @@ | |||
| @PACKAGE_INIT@ | |||
There was a problem hiding this comment.
I think it would be easier to keep the TCPunch submodule and open a second PR in the other repository; would that be okay with you?
|
|
||
| //! Send data to peer with id dest, must match a recv call | ||
| virtual void send(channel_data buf, FMI::Utils::peer_num dest) = 0; | ||
| virtual void send(std::shared_ptr<channel_data> buf, FMI::Utils::peer_num dest) = 0; |
There was a problem hiding this comment.
I'm not sure why this is necessary - if you implemented data storage within channel_data as a shared ptr, then why do we need a pointer here? It looks like we have a nested shared pointer, and I'm not sure this is strictly necessary.
| struct channel_data { | ||
| char* buf; | ||
| std::size_t len; | ||
| std::shared_ptr<char[]> buf; |
There was a problem hiding this comment.
I think that having RAII-style memory management is a good idea. Now, if we manage everything internally, then do we need a shared_ptr if we can just implement a destructor that frees memory?
Question is: are there situations where a single pointer is shared between multiple owners (possibly threads), and it's not easy to keep track who owns what?
There was a problem hiding this comment.
Creating shared_ptr only to pass a noop_deleter looks a bit like the wrong abstraction.
| explicit channel_data(std::size_t length) | ||
| : buf(std::shared_ptr<char[]>(new char[length])), len(length) {} | ||
|
|
||
| // From raw pointer with custom deleter (for external buffers) |
There was a problem hiding this comment.
I'm a bit confused because it looks like we have three classes of resources here: owned data, external buffer, and "original reference". Perhaps we need additional documentation to explain what are the semantics of each class.
| FMI::Utils::fmiContext* context, FMI::Utils::Mode mode, | ||
| std::function<void(FMI::Utils::NbxStatus, const std::string&, | ||
| FMI::Utils::fmiContext*)> callback) { | ||
| // ClientServer doesn't support true non-blocking - just call blocking version |
There was a problem hiding this comment.
In theory, one could implement the non-blocking version by having a thread polling on the storage in the background - we can leave a note in the documentation
|
|
||
| hostname = params["host"]; | ||
| port = std::stoi(params["port"]); | ||
| if (model_params["resolve_host_dns"] == "true") { |
There was a problem hiding this comment.
Can you explain what is the purpose of this additional step?
| return it; | ||
| } | ||
|
|
||
| int socketfd = it->first; |
There was a problem hiding this comment.
Is my understanding correct: every time we call channel_event_progress, we check the state of each IOState separately?
I'm thinking if the entire implementation could not be simplified by a single epoll - create epoll structure, add all socket file descriptors, and then poll events in the loop. For that, you can use a blocking epoll with a timeout. This way, we skip all the unnecessary checks, and instead of looping all the time across all sockets, the process will sleep waiting for the next event.
| {"host", "127.0.0.1"}, | ||
| {"port", "10000"}, | ||
| {"max_timeout", "1000"} | ||
| {"max_timeout", "5000"} |
There was a problem hiding this comment.
I think that we still use the "5000" as a magic number somewhere in the code :)
This pull request implements changes made based on including FMI as a Cylon communicator. You can find details about Cylon here: https://cylondata.org. The referenced Cylon is under review to be incorporated into the main branch for an upcoming v2 release: cylondata/cylon#691.
1. Non-Blocking I/O with Callbacks
Added async send/recv operations with callback-based completion notification:
Key components:
Modeenum:BLOCKING/NONBLOCKINGNbxStatusenum: Detailed error codes (SUCCESS, SEND_FAILED, RECEIVE_FAILED, NBX_TIMEOUT, etc.)EventProcessStatusenum: Progress tracking (PROCESSING, EMPTY, NOOP)fmiContextstruct: Completion context for tracking operationschannel_event_progress(): Poll-based progress function2. Variable-Length Collective Operations
Added MPI-style variable-length collectives:
Note:
recvcountsanddisplsare in bytes, not elements (consistent with channel_data's byte-based API).3. Memory Management with shared_ptr
Replaced raw pointers with
std::shared_ptr<channel_data>throughout:Benefits:
noop_deleterfor external buffer management4. CMake Configuration
Added configurable Boost linking:
Build with shared Boost (e.g., from conda):
cmake -DFMI_BOOST_STATIC=OFF -DCMAKE_PREFIX_PATH=$CONDA_PREFIX ..