Skip to content

Conversation

@mdorier
Copy link
Contributor

@mdorier mdorier commented Nov 26, 2025

This PR adds documentation for the memory ownership expected of the raft_event structure passed to raft_step, addressing #194.
Hopefully I got it right (please double-check).

Copy link
Member

@freeekanayaka freeekanayaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this work, it actually prompts me to clean up the current behavior a bit :)

See my comments, glad to have your input on them.

* raft_step(), regardless of success or error. The entries must be allocated
* such that all entries in a batch share the same batch pointer, and each
* distinct batch pointer appears only once in the entries array for proper
* cleanup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. However I think the behavior of the error handling part is unintended and just result of the v0 to v1 API refactoring.

I would like to change this to be consistent with what happens with event->start.metadata, in the sense that the ownership of the entries pointer passed to raft_step() stays in the hands of the caller.

Basically the idea is that for all data passed to RAFT_START via a pointer (i.e. start.metadata and start.entries) is passed read-only and the ownership is NOT transferred. This should be true for the top-level pointer itself and for nested pointers/fields too.

I'd prepare a new PR for this and you can update this branch after it gets merged, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I like the idea of everything being copied except for the actual data of entries, that way I can use an std::vector<raft_entry> as input, which will automatically be freed, as opposed to having to call raft_malloc. Happy to change the documentation when you merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed as well in #200. I'll address the rest of the cleanups for the other events separately.

* field is transferred to raft (moved into r->configuration). The metadata
* structure itself is NOT freed by raft and must be freed by the caller after
* raft_step() returns, regardless of success or error. The caller must NOT
* close the configuration field after the call, as it has been transferred.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. However, I think it would probably be better to make a copy of the nested configuration field, so the ownership of the metadata structure stays entirely in the hands of the caller (see the other comment below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make more sense, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been changed in #200, which I already merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the documentation for RAFT_START; can you double-check? (I also detailed a bit more how the batch pointer works -- did I get it right? Do I understand correctly that ALL the entries must come from the same batch, i.e. I cannot have two different batches in the same call to raft_step?)

Copy link
Member

@freeekanayaka freeekanayaka Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the documentation for RAFT_START; can you double-check? (I also detailed a bit more how the batch pointer works -- did I get it right? Do I understand correctly that ALL the entries must come from the same batch, i.e. I cannot have two different batches in the same call to raft_step?)

Yes, you are getting it right. The way I'm envisioning this to be used is that user code will load the entries from disk at start up, allocating a chunk of contiguous memory for them (that's the batch pointer). Then it will pass these entries to the RAFT_START event.

EDIT: actually it does not matter that all entries are part of the same batch, user code can load entries in multiple batches if deemed appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also this docstring:

* An entry that was received from the network as part of an AppendEntries RPC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so my understanding was wrong: we can have multiple batch pointers, and they can repeat, simply we cannot have something like ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: for the RAFT_START event it's ok to have more than batch of entries (like in the example I wrote earlier, 7 entries in 2 batches), because client code could use several files to store entries, and when bulk loading them at startup a convenient strategy would be to have one batch for each file being loaded.

However, for the RAFT_RECEIVE event it's currently expected that all entries belong to the same batch. That's because client code will need to implement receiving an AppendEntries RPC message over the network and typically will have some wire-protocol-level way to know the size of the message being received (for example this information could be contained in a fixed-size message header) and then once the message size is known client code will typically perform a single memory allocation to receive the whole message data with all the entries it contains.

Hope this explanation is clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that the batch pointer is not required to point exactly at the beginning for the data of the first entry in the batch. The memory allocation that was performed for the batch pointer could be for example slightly larger (e.g. when receiving an AppendEntries RPC message the batch pointer could have been allocated to also contain a bit of message metadata). The only important thing is that calling raft_free() again that batch pointer will free all memory used by all entries belonging to that batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that I understood. Thanks for clarifying about the unique/non-unique batch pointers, I'll double-check what I wrote about that.

* - message->append_entries.entries: OWNERSHIP TRANSFERRED immediately upon
* calling raft_step(). The entries array and the batch memory referenced by
* entries[].batch are owned by raft after the call. The caller must NOT free
* this memory, even if raft_step() returns an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, even though I'm not totally sure that all error paths currently do the correct cleanup. However, I'm wondering if it would be better to return the ownership to the caller in case of error, so that the caller is free to decide the best course of action and the library does not just destroy that data. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I suggested in my issue that raft_step could zero-out the entries it "consumed" so that the caller can run a cleanup regardless of errors. If entries are now NULL, free will be a no-op.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now changed this in #202 to be consistent with the RAFT_START event. The ownership of the message->append_entries.entries array is always retained by the caller. The ownership of the entries data is transferred to the raft library in case of success, or is returned to the caller in case of error.

* - message->install_snapshot.data.base: OWNERSHIP TRANSFERRED immediately upon
* calling raft_step(). The snapshot data buffer is owned by raft after the
* call. The caller must NOT free this memory, even if raft_step() returns an
* error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct. However, I'd probably change the error handling part, and return full ownership to the caller in case of error, without destroying any data. This would be similar to the comment I made for RAFT_APPEND_ENTRIES.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now changed this in #202 to be consistent with RAFT_START and RAFT_APPEND_ENTRIES. That is, ownership of the snapshot data is transferred to the raft library in case of success, and returned to the caller in case of error.

Note there is also a side change in consequence of this. If the RAFT_UPDATE_ENTRIES flag is set after raft_step() returns, then the ownership of the update->entries.batch array is transferred to the caller, who is now in charge of raft_free()-ing it when appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the note, I think I need to document the ownership model of the returned raft_update. My initial assumption was that the caller is responsible for everything in it (i.e. RAFT will never give it a handle to something internal), which is consistent with what you're saying about update->entries.batch. Is my assumption correct more generally?

Copy link
Member

@freeekanayaka freeekanayaka Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the note, I think I need to document the ownership model of the returned raft_update. My initial assumption was that the caller is responsible for everything in it (i.e. RAFT will never give it a handle to something internal), which is consistent with what you're saying about update->entries.batch. Is my assumption correct more generally?

Yes, that's correct. Roughly it's something like this:

struct raft_update
{
    unsigned flags;
    struct
    {
        raft_index index; /* 0 if no change */
        /* Ownership of the entry array and entries data belongs to the caller. The caller
         * Is required to persist the entries to disk and it will also typically want to maintain
         * a cache of them in memory, so it's easy to retrieve them when it comes to send
         * a RAFT_APPEND_ENTRIES message (when the RAFT_UPDATE_MESSAGES is set).
         * 
         * NOTE: the raft library never maintains any copy of any entries data itself. So
         * when calling raft_step() with a RAFT_RECEIVE event containing RAFT_APPEND_ENTRIES
         * message, the ownership entries data is transferred to raft during the raft_step() call,
         * but raft itself won't copy the data, so in case it sets the RAFT_UPDATE_ENTRIES flag the
         * entries data contained in raft_update->entries.batch will be basically the same memory
         * that was initially passed to the raft_step() call via the RAFT_RECEIVE event. However
         * this detail could be treated as a memory-copy optimization, mostly transparent to
         * the caller. */
        struct raft_entry *batch;
        unsigned n;
    } entries;
    struct
    {
        struct raft_snapshot_metadata metadata; /* Ownership belongs to the caller */
        size_t offset;
        /* Ownership belongs to the caller. Similarly to what happens with raft_update->entries.batch,
         * this field basically echoes back the snapshot data memory that was passed to raft_step()
         * via a RAFT_RECEIVE event containing a RAFT_INSTALL_SNAPSHOT message. */
        struct raft_buffer chunk;
        bool last;
    } snapshot;
    struct
    {
        /* The only message types with nested data are RAFT_APPEND_ENTRIES and
         * RAFT_INSTALL_SNAPSHOT. Since raft does not internally maintain any copy
         * of entries or snapshot data, when handling the raft_update->messages.batch
         * fields the caller must do the following:
         *
         * - if it finds a RAFT_APPEND_ENTRIES message that needs to be sent, it must
         *   fill the raft_message->append_entries.entries field itself, typically retrieving
         *   the relevant entries from the entries cache it maintains. The exact entries
         *   to retrieve are specified by the raft_message->append_entries.prev_log_index
         *   and raft_message->append_entries.n_entries field.
         *
         * - if it finds a RAFT_INSTALL_SNAPSHOT message that needs to be sent, it must
         *   fill the raft_message->install_snapshot.data field with the relevant snapshot
         *   data, typically retrieving it from disk using the identifying details contained in
         *   raft_message->install_snapshot.last_index/last_term fields.
         *
         * In other words, for these two message types raft does not fill all fields, but
         * rather leaves the responsibility of some of them to the caller.
         *
         * This is done so that raft stays agnostic about how and where actual entries/snapshot
         * data is kept, and leaves freedom to the caller about how to do it. */
        struct raft_message *batch;
        unsigned n;
    } messages;
};

* - message->install_snapshot.conf: OWNERSHIP TRANSFERRED immediately upon
* calling raft_step(). The configuration is owned by raft after the call.
* The caller must NOT close this configuration, even if raft_step() returns
* an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct too. However, it feels it'd be better to make a copy of message.install_snapshot.conf instead, like I'm proposing to do in the RAFT_START event case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed as well in #202. Ownership of the snapshot configuration is always retained by the caller.

Note there is also a side change in consequence of this. If the RAFT_UPDATE_SNAPSHOT flag is set after raft_step() returns, then the ownership of the update->snapshot.metadata.conf structure is transferred to the caller, who is now in charge of raft_configuration_close()-ing it when appropriate.

* - event->configuration.index: Scalar value (no ownership concerns)
* - event->configuration.conf: OWNERSHIP TRANSFERRED to raft. The configuration
* will be consumed if needed or closed if not needed. The caller must NOT close
* this configuration after calling raft_step(), regardless of success or error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be best to make a copy, like of other similar cases. Opinions are welcome.

*
* When raft_step() returns an error:
* - For RAFT_START events: The caller must free the metadata structure itself
* (but NOT the configuration field, which has been transferred). The entries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, this seems unnecessarily complicated, and I would change the current implementation to make a copy of the configuration field.

* - For RAFT_START events: The caller must free the metadata structure itself
* (but NOT the configuration field, which has been transferred). The entries
* are automatically freed by raft_step() on error, so the caller must NOT
* free them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change this too and leave all ownership to the caller.

@freeekanayaka freeekanayaka added the downstream Trigger downstream tests label Nov 28, 2025
* If successfully consumed by RAFT, raft_free will be called on
* entries[0].batch when RAFT no longer needs these entries to be kept in
* memory. Hence the caller must NOT free the entries after a successful call
* to raft_step().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice description! Thanks.

@mdorier
Copy link
Contributor Author

mdorier commented Dec 23, 2025

Is there anything left to update in my documentation or can this PR be merged?

@freeekanayaka
Copy link
Member

Is there anything left to update in my documentation or can this PR be merged?

The documentation for raft_update is missing, but it can be addressed separately, and I actually started a branch for it. I'll go ahead and merge this branch, because it looks overall good, and possibly fix a few details in follow-up PRs.

@freeekanayaka freeekanayaka merged commit 148951f into cowsql:main Dec 27, 2025
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

downstream Trigger downstream tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants