Merge Feature/async-api-variable-collectives into FMI origin by mstaylor · Pull Request #22 · spcl/fmi

mstaylor · 2026-02-03T17:12:27Z

This pull request implements changes made based on including FMI as a Cylon communicator. You can find details about Cylon here: https://cylondata.org. The referenced Cylon is under review to be incorporated into the main branch for an upcoming v2 release: cylondata/cylon#691.

1. Non-Blocking I/O with Callbacks

Added async send/recv operations with callback-based completion notification:

// Non-blocking send with callback
ch->send(buf, dest, &ctx, FMI::Utils::NONBLOCKING,
    [](FMI::Utils::NbxStatus status, const std::string& msg, FMI::Utils::fmiContext* ctx) {
        if (status == FMI::Utils::SUCCESS) {
            // Send completed
        }
    });

// Poll for completion
while (!completed) {
    ch->channel_event_progress(FMI::Utils::send);
}

Key components:

Mode enum: BLOCKING / NONBLOCKING
NbxStatus enum: Detailed error codes (SUCCESS, SEND_FAILED, RECEIVE_FAILED, NBX_TIMEOUT, etc.)
EventProcessStatus enum: Progress tracking (PROCESSING, EMPTY, NOOP)
fmiContext struct: Completion context for tracking operations
channel_event_progress(): Poll-based progress function

2. Variable-Length Collective Operations

Added MPI-style variable-length collectives:

// Gatherv - gather variable amounts from each peer
std::vector<int32_t> recvcounts = {4, 8, 12, 16};  // bytes per peer
std::vector<int32_t> displs = {0, 4, 12, 24};       // byte displacements
ch->gatherv(sendbuf, recvbuf, root, recvcounts, displs);

// Allgather - gather fixed amounts, distribute to all
ch->allgather(sendbuf, recvbuf, root);

// Allgatherv - gather variable amounts, distribute to all
ch->allgatherv(sendbuf, recvbuf, root, recvcounts, displs);

Note: recvcounts and displs are in bytes, not elements (consistent with channel_data's byte-based API).

3. Memory Management with shared_ptr

Replaced raw pointers with std::shared_ptr<channel_data> throughout:

// Automatic memory management
auto buf = std::make_shared<channel_data>(size);

// External buffer with custom deleter (no-op for stack/external memory)
auto buf = std::make_shared<channel_data>(external_ptr, size, noop_deleter);

Benefits:

Automatic cleanup, no memory leaks
Safe sharing between async operations
noop_deleter for external buffer management

4. CMake Configuration

Added configurable Boost linking:

option(FMI_BOOST_STATIC "Use static Boost libraries" ON)

Build with shared Boost (e.g., from conda):

cmake -DFMI_BOOST_STATIC=OFF -DCMAKE_PREFIX_PATH=$CONDA_PREFIX ..

mcopik · 2026-02-27T13:53:31Z

extern/TCPunch/client/cmake/tcpunchConfig.cmake.in

@@ -0,0 +1,8 @@
+@PACKAGE_INIT@


I think it would be easier to keep the TCPunch submodule and open a second PR in the other repository; would that be okay with you?

mcopik · 2026-02-27T13:57:06Z

include/comm/Channel.h

+
        //! Send data to peer with id dest, must match a recv call
-        virtual void send(channel_data buf, FMI::Utils::peer_num dest) = 0;
+        virtual void send(std::shared_ptr<channel_data> buf, FMI::Utils::peer_num dest) = 0;


I'm not sure why this is necessary - if you implemented data storage within channel_data as a shared ptr, then why do we need a pointer here? It looks like we have a nested shared pointer, and I'm not sure this is strictly necessary.

mcopik · 2026-02-27T13:58:54Z

include/comm/Channel.h

 struct channel_data {
-    char* buf;
-    std::size_t len;
+    std::shared_ptr<char[]> buf;


I think that having RAII-style memory management is a good idea. Now, if we manage everything internally, then do we need a shared_ptr if we can just implement a destructor that frees memory?

Question is: are there situations where a single pointer is shared between multiple owners (possibly threads), and it's not easy to keep track who owns what?

Creating shared_ptr only to pass a noop_deleter looks a bit like the wrong abstraction.

mcopik · 2026-02-27T13:59:40Z

include/comm/Channel.h

+    explicit channel_data(std::size_t length)
+        : buf(std::shared_ptr<char[]>(new char[length])), len(length) {}
+
+    // From raw pointer with custom deleter (for external buffers)


I'm a bit confused because it looks like we have three classes of resources here: owned data, external buffer, and "original reference". Perhaps we need additional documentation to explain what are the semantics of each class.

mcopik · 2026-02-27T14:36:30Z

src/comm/ClientServer.cpp

+                                    FMI::Utils::fmiContext* context, FMI::Utils::Mode mode,
+                                    std::function<void(FMI::Utils::NbxStatus, const std::string&,
+                                                      FMI::Utils::fmiContext*)> callback) {
+    // ClientServer doesn't support true non-blocking - just call blocking version


In theory, one could implement the non-blocking version by having a thread polling on the storage in the background - we can leave a note in the documentation

mcopik · 2026-02-27T14:37:13Z

src/comm/Direct.cpp

+
    hostname = params["host"];
    port = std::stoi(params["port"]);
+    if (model_params["resolve_host_dns"] == "true") {


Can you explain what is the purpose of this additional step?

mcopik · 2026-02-27T14:40:42Z

src/comm/Direct.cpp

+        return it;
+    }
+
+    int socketfd = it->first;


Is my understanding correct: every time we call channel_event_progress, we check the state of each IOState separately?

I'm thinking if the entire implementation could not be simplified by a single epoll - create epoll structure, add all socket file descriptors, and then poll events in the loop. For that, you can use a blocking epoll with a timeout. This way, we skip all the unnecessary checks, and instead of looping all the time across all sockets, the process will sleep waiting for the next event.

mcopik · 2026-02-27T14:42:15Z

tests/channels.cpp

        {"host", "127.0.0.1"},
        {"port", "10000"},
-        {"max_timeout", "1000"}
+        {"max_timeout", "5000"}


I think that we still use the "5000" as a magic number somewhere in the code :)

mstaylor added 16 commits June 10, 2024 13:45

initial commit

8b1e2f1

initial commit

7a76c4b

initial commit

5f4ad50

initial commit

b4c345b

Adds support for resolving host for Direct Mode

8a4b5bf

Adds support for resolving host for Direct Mode

5c26bb1

Adds support for resolving host for Direct Mode

446e10f

Adds support for resolving host for Direct Mode

847cfbe

Adds support for resolving host for Direct Mode

5553df2

Adds output to show max_timeout

2123697

Adds output to show max_timeout

ee9e693

wiki version 1

94ccfa1

Partial - FMI variable length collectives and nonblocking io support

76a6ecb

Partial - FMI variable length collectives and nonblocking io support

143e9cd

reverts config changes from local testing

002b0e0

removes DS 5110 video

513de01

mcopik requested changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Feature/async-api-variable-collectives into FMI origin#22

Merge Feature/async-api-variable-collectives into FMI origin#22
mstaylor wants to merge 16 commits intospcl:mainfrom
mstaylor:feature/async-api-variable-collectives

mstaylor commented Feb 3, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

mcopik Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mstaylor commented Feb 3, 2026

1. Non-Blocking I/O with Callbacks

2. Variable-Length Collective Operations

3. Memory Management with shared_ptr

4. CMake Configuration

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants