Document memory ownership of the raft_event structure passed to raft_step #199

mdorier · 2025-11-26T16:24:48Z

This PR adds documentation for the memory ownership expected of the raft_event structure passed to raft_step, addressing #194.
Hopefully I got it right (please double-check).

…t_step

freeekanayaka

Thanks a lot for this work, it actually prompts me to clean up the current behavior a bit :)

See my comments, glad to have your input on them.

freeekanayaka · 2025-11-28T12:09:15Z

include/raft.h.in

+ *   raft_step(), regardless of success or error. The entries must be allocated
+ *   such that all entries in a batch share the same batch pointer, and each
+ *   distinct batch pointer appears only once in the entries array for proper
+ *   cleanup.


This is correct. However I think the behavior of the error handling part is unintended and just result of the v0 to v1 API refactoring.

I would like to change this to be consistent with what happens with event->start.metadata, in the sense that the ownership of the entries pointer passed to raft_step() stays in the hands of the caller.

Basically the idea is that for all data passed to RAFT_START via a pointer (i.e. start.metadata and start.entries) is passed read-only and the ownership is NOT transferred. This should be true for the top-level pointer itself and for nested pointers/fields too.

I'd prepare a new PR for this and you can update this branch after it gets merged, wdyt?

Sounds good. I like the idea of everything being copied except for the actual data of entries, that way I can use an std::vector<raft_entry> as input, which will automatically be freed, as opposed to having to call raft_malloc. Happy to change the documentation when you merge.

This has been changed as well in #200. I'll address the rest of the cleanups for the other events separately.

freeekanayaka · 2025-11-28T12:10:30Z

include/raft.h.in

+ *   field is transferred to raft (moved into r->configuration). The metadata
+ *   structure itself is NOT freed by raft and must be freed by the caller after
+ *   raft_step() returns, regardless of success or error. The caller must NOT
+ *   close the configuration field after the call, as it has been transferred.


This is correct. However, I think it would probably be better to make a copy of the nested configuration field, so the ownership of the metadata structure stays entirely in the hands of the caller (see the other comment below).

That would make more sense, yes.

This has now been changed in #200, which I already merged.

I have updated the documentation for RAFT_START; can you double-check? (I also detailed a bit more how the batch pointer works -- did I get it right? Do I understand correctly that ALL the entries must come from the same batch, i.e. I cannot have two different batches in the same call to raft_step?)

I have updated the documentation for RAFT_START; can you double-check? (I also detailed a bit more how the batch pointer works -- did I get it right? Do I understand correctly that ALL the entries must come from the same batch, i.e. I cannot have two different batches in the same call to raft_step?)

Yes, you are getting it right. The way I'm envisioning this to be used is that user code will load the entries from disk at start up, allocating a chunk of contiguous memory for them (that's the batch pointer). Then it will pass these entries to the RAFT_START event.

EDIT: actually it does not matter that all entries are part of the same batch, user code can load entries in multiple batches if deemed appropriate.

See also this docstring:

raft/include/raft.h.in

Line 278 in 47f1805

* An entry that was received from the network as part of an AppendEntries RPC

Ah so my understanding was wrong: we can have multiple batch pointers, and they can repeat, simply we cannot have something like ?

One more thing: for the RAFT_START event it's ok to have more than batch of entries (like in the example I wrote earlier, 7 entries in 2 batches), because client code could use several files to store entries, and when bulk loading them at startup a convenient strategy would be to have one batch for each file being loaded.

However, for the RAFT_RECEIVE event it's currently expected that all entries belong to the same batch. That's because client code will need to implement receiving an AppendEntries RPC message over the network and typically will have some wire-protocol-level way to know the size of the message being received (for example this information could be contained in a fixed-size message header) and then once the message size is known client code will typically perform a single memory allocation to receive the whole message data with all the entries it contains.

Hope this explanation is clear.

Also, note that the batch pointer is not required to point exactly at the beginning for the data of the first entry in the batch. The memory allocation that was performed for the batch pointer could be for example slightly larger (e.g. when receiving an AppendEntries RPC message the batch pointer could have been allocated to also contain a bit of message metadata). The only important thing is that calling raft_free() again that batch pointer will free all memory used by all entries belonging to that batch.

Yep, that I understood. Thanks for clarifying about the unique/non-unique batch pointers, I'll double-check what I wrote about that.

freeekanayaka · 2025-11-28T12:25:37Z

include/raft.h.in

+ *     - message->append_entries.entries: OWNERSHIP TRANSFERRED immediately upon
+ *       calling raft_step(). The entries array and the batch memory referenced by
+ *       entries[].batch are owned by raft after the call. The caller must NOT free
+ *       this memory, even if raft_step() returns an error.


This is correct, even though I'm not totally sure that all error paths currently do the correct cleanup. However, I'm wondering if it would be better to return the ownership to the caller in case of error, so that the caller is free to decide the best course of action and the library does not just destroy that data. Thoughts?

I think I suggested in my issue that raft_step could zero-out the entries it "consumed" so that the caller can run a cleanup regardless of errors. If entries are now NULL, free will be a no-op.

I've now changed this in #202 to be consistent with the RAFT_START event. The ownership of the message->append_entries.entries array is always retained by the caller. The ownership of the entries data is transferred to the raft library in case of success, or is returned to the caller in case of error.

freeekanayaka · 2025-11-28T12:31:07Z

include/raft.h.in

+ *     - message->install_snapshot.data.base: OWNERSHIP TRANSFERRED immediately upon
+ *       calling raft_step(). The snapshot data buffer is owned by raft after the
+ *       call. The caller must NOT free this memory, even if raft_step() returns an
+ *       error.


This looks correct. However, I'd probably change the error handling part, and return full ownership to the caller in case of error, without destroying any data. This would be similar to the comment I made for RAFT_APPEND_ENTRIES.

I've now changed this in #202 to be consistent with RAFT_START and RAFT_APPEND_ENTRIES. That is, ownership of the snapshot data is transferred to the raft library in case of success, and returned to the caller in case of error.

Note there is also a side change in consequence of this. If the RAFT_UPDATE_ENTRIES flag is set after raft_step() returns, then the ownership of the update->entries.batch array is transferred to the caller, who is now in charge of raft_free()-ing it when appropriate.

Regarding the note, I think I need to document the ownership model of the returned raft_update. My initial assumption was that the caller is responsible for everything in it (i.e. RAFT will never give it a handle to something internal), which is consistent with what you're saying about update->entries.batch. Is my assumption correct more generally?

Regarding the note, I think I need to document the ownership model of the returned raft_update. My initial assumption was that the caller is responsible for everything in it (i.e. RAFT will never give it a handle to something internal), which is consistent with what you're saying about update->entries.batch. Is my assumption correct more generally?

Yes, that's correct. Roughly it's something like this:

struct raft_update { unsigned flags; struct { raft_index index; /* 0 if no change */ /* Ownership of the entry array and entries data belongs to the caller. The caller * Is required to persist the entries to disk and it will also typically want to maintain * a cache of them in memory, so it's easy to retrieve them when it comes to send * a RAFT_APPEND_ENTRIES message (when the RAFT_UPDATE_MESSAGES is set). * * NOTE: the raft library never maintains any copy of any entries data itself. So * when calling raft_step() with a RAFT_RECEIVE event containing RAFT_APPEND_ENTRIES * message, the ownership entries data is transferred to raft during the raft_step() call, * but raft itself won't copy the data, so in case it sets the RAFT_UPDATE_ENTRIES flag the * entries data contained in raft_update->entries.batch will be basically the same memory * that was initially passed to the raft_step() call via the RAFT_RECEIVE event. However * this detail could be treated as a memory-copy optimization, mostly transparent to * the caller. */ struct raft_entry *batch; unsigned n; } entries; struct { struct raft_snapshot_metadata metadata; /* Ownership belongs to the caller */ size_t offset; /* Ownership belongs to the caller. Similarly to what happens with raft_update->entries.batch, * this field basically echoes back the snapshot data memory that was passed to raft_step() * via a RAFT_RECEIVE event containing a RAFT_INSTALL_SNAPSHOT message. */ struct raft_buffer chunk; bool last; } snapshot; struct { /* The only message types with nested data are RAFT_APPEND_ENTRIES and * RAFT_INSTALL_SNAPSHOT. Since raft does not internally maintain any copy * of entries or snapshot data, when handling the raft_update->messages.batch * fields the caller must do the following: * * - if it finds a RAFT_APPEND_ENTRIES message that needs to be sent, it must * fill the raft_message->append_entries.entries field itself, typically retrieving * the relevant entries from the entries cache it maintains. The exact entries * to retrieve are specified by the raft_message->append_entries.prev_log_index * and raft_message->append_entries.n_entries field. * * - if it finds a RAFT_INSTALL_SNAPSHOT message that needs to be sent, it must * fill the raft_message->install_snapshot.data field with the relevant snapshot * data, typically retrieving it from disk using the identifying details contained in * raft_message->install_snapshot.last_index/last_term fields. * * In other words, for these two message types raft does not fill all fields, but * rather leaves the responsibility of some of them to the caller. * * This is done so that raft stays agnostic about how and where actual entries/snapshot * data is kept, and leaves freedom to the caller about how to do it. */ struct raft_message *batch; unsigned n; } messages; };

freeekanayaka · 2025-11-28T12:31:59Z

include/raft.h.in

+ *     - message->install_snapshot.conf: OWNERSHIP TRANSFERRED immediately upon
+ *       calling raft_step(). The configuration is owned by raft after the call.
+ *       The caller must NOT close this configuration, even if raft_step() returns
+ *       an error.


This looks correct too. However, it feels it'd be better to make a copy of message.install_snapshot.conf instead, like I'm proposing to do in the RAFT_START event case.

This was changed as well in #202. Ownership of the snapshot configuration is always retained by the caller.

Note there is also a side change in consequence of this. If the RAFT_UPDATE_SNAPSHOT flag is set after raft_step() returns, then the ownership of the update->snapshot.metadata.conf structure is transferred to the caller, who is now in charge of raft_configuration_close()-ing it when appropriate.

freeekanayaka · 2025-11-28T12:33:15Z

include/raft.h.in

+ * - event->configuration.index: Scalar value (no ownership concerns)
+ * - event->configuration.conf: OWNERSHIP TRANSFERRED to raft. The configuration
+ *   will be consumed if needed or closed if not needed. The caller must NOT close
+ *   this configuration after calling raft_step(), regardless of success or error.


Would probably be best to make a copy, like of other similar cases. Opinions are welcome.

freeekanayaka · 2025-11-28T12:35:03Z

include/raft.h.in

+ *
+ * When raft_step() returns an error:
+ * - For RAFT_START events: The caller must free the metadata structure itself
+ *   (but NOT the configuration field, which has been transferred). The entries


As I mentioned, this seems unnecessarily complicated, and I would change the current implementation to make a copy of the configuration field.

freeekanayaka · 2025-11-28T12:35:27Z

include/raft.h.in

+ * - For RAFT_START events: The caller must free the metadata structure itself
+ *   (but NOT the configuration field, which has been transferred). The entries
+ *   are automatically freed by raft_step() on error, so the caller must NOT
+ *   free them.


I'd change this too and leave all ownership to the caller.

freeekanayaka · 2025-12-01T10:51:16Z

include/raft.h.in

+ *   If successfully consumed by RAFT, raft_free will be called on
+ *   entries[0].batch when RAFT no longer needs these entries to be kept in
+ *   memory. Hence the caller must NOT free the entries after a successful call
+ *   to raft_step().


Very nice description! Thanks.

mdorier · 2025-12-23T10:38:46Z

Is there anything left to update in my documentation or can this PR be merged?

freeekanayaka · 2025-12-27T21:24:06Z

Is there anything left to update in my documentation or can this PR be merged?

The documentation for raft_update is missing, but it can be addressed separately, and I actually started a branch for it. I'll go ahead and merge this branch, because it looks overall good, and possibly fix a few details in follow-up PRs.

documented memory ownership of the raft_event structure passed to raf…

6bb12a0

…t_step

freeekanayaka reviewed Nov 28, 2025

View reviewed changes

freeekanayaka added the downstream Trigger downstream tests label Nov 28, 2025

changed memory ownership documentation for RAFT_START event

dff384e

freeekanayaka reviewed Dec 1, 2025

View reviewed changes

mdorier added 2 commits December 3, 2025 10:45

updated documentation about RAFT_START and RAFT_RECEIVE

4d49611

changed documentation for RECEIVE

aee66da

freeekanayaka merged commit 148951f into cowsql:main Dec 27, 2025
42 checks passed

Document memory ownership of the raft_event structure passed to raft_step #199

Document memory ownership of the raft_event structure passed to raft_step #199

Uh oh!

Conversation

mdorier commented Nov 26, 2025

Uh oh!

freeekanayaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

freeekanayaka Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

freeekanayaka Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdorier commented Dec 23, 2025

Uh oh!

freeekanayaka commented Dec 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

freeekanayaka Dec 3, 2025 •

edited

Loading

freeekanayaka Dec 3, 2025 •

edited

Loading