draft: RFC for `Horizontal scalability for ahnlich` #276

jimezesinachi · 2025-11-17T10:00:24Z

Issue - Horizontal scalability for ahnlich

Signed-off-by: Jim Ezesinachi <ezesinachijim@gmail.com>

github-actions · 2025-11-17T10:15:47Z

Test Results

63 tests 63 ✅ 2m 39s ⏱️
25 suites 0 💤
2 files 0 ❌

Results for commit 1f15cc5.

♻️ This comment has been updated with latest results.

github-actions · 2025-11-17T10:24:14Z

Benchmark Results

group                                                        main                                   pr
-----                                                        ----                                   --
store_batch_insertion_without_predicates/size_100            1.00    755.9±2.50µs        ? ?/sec    1.04   787.7±11.34µs        ? ?/sec
store_batch_insertion_without_predicates/size_1000           1.00      6.3±0.04ms        ? ?/sec    1.01      6.4±0.04ms        ? ?/sec
store_batch_insertion_without_predicates/size_10000          1.00     85.5±0.15ms        ? ?/sec    1.00     85.9±0.23ms        ? ?/sec
store_batch_insertion_without_predicates/size_100000         1.00    867.6±5.78ms        ? ?/sec    1.01    876.2±5.38ms        ? ?/sec
store_retrieval_no_condition/size_100                        1.00      2.2±0.01ms        ? ?/sec    1.02      2.2±0.01ms        ? ?/sec
store_retrieval_no_condition/size_1000                       1.00     14.6±0.12ms        ? ?/sec    1.02     14.9±0.05ms        ? ?/sec
store_retrieval_no_condition/size_10000                      1.00    142.6±0.08ms        ? ?/sec    1.02    145.6±0.10ms        ? ?/sec
store_retrieval_no_condition/size_100000                     1.00   1430.0±1.44ms        ? ?/sec    1.02  1465.0±13.35ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_100                   1.00      2.2±0.01ms        ? ?/sec    1.01      2.3±0.00ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_1000                  1.00     15.4±0.05ms        ? ?/sec    1.02     15.7±0.04ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_10000                 1.00    151.9±0.13ms        ? ?/sec    1.03    157.1±0.39ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_100000                1.00   1546.0±1.21ms        ? ?/sec    1.03   1594.2±2.65ms        ? ?/sec
store_sequential_insertion_without_predicates/size_100       1.00  1474.2±21.78µs        ? ?/sec    1.03   1512.1±6.46µs        ? ?/sec
store_sequential_insertion_without_predicates/size_1000      1.00     14.4±0.17ms        ? ?/sec    1.02     14.7±0.20ms        ? ?/sec
store_sequential_insertion_without_predicates/size_10000     1.00    145.5±1.49ms        ? ?/sec    1.01    146.4±1.82ms        ? ?/sec
store_sequential_insertion_without_predicates/size_100000    1.00  1434.2±15.24ms        ? ?/sec    1.02  1457.6±11.29ms        ? ?/sec

deven96 · 2025-11-18T00:18:52Z

ahnlich/replication-rfc.md

+
+They are (NOTE: these names are tentative):
+
+- `LogStore` - This is where the logs from the Raft cluster activities will be stored. Here is an in-memory impl from the openraft guys that they used in their example: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/mem-log/src/log_store.rs>, and I believe we can easily co-opt this into a write-to-disk-log file service, or any other log storage we settle on.


What are the considerations for storing raft cluster activity on disk vs in-memory?

It's pretty much durability vs speed. On-disk allows for easier recovery from logs and snapshots, but is slower to write and read from. In-memory can still recover, but only in the case that there's at least 1 node to recover from, else if all nodes go down, all data is lost; but it is faster to write to and read from. I think the question for us to answer is which of these guarantees will be more important to users that will be horizontally scaling their ahnlich

Currently (without clustering), we run primarily in-memory but we have a background persistence thread that takes snapshots at intervals

Wondering if that behaviour would be easy to also propagate here

Also I'm curious, what API controls potentially loading a snapshot from disk when a replica starts up

Currently (without clustering), we run primarily in-memory but we have a background persistence thread that takes snapshots at intervals

Wondering if that behaviour would be easy to also propagate here

Makes sense. Snapshots are stored on disk, yes?

Also I'm curious, what API controls potentially loading a snapshot from disk when a replica starts up

There's a install_snapshot() in the StateMachineStore:

#[tracing::instrument(level = "trace", skip(self, snapshot))] async fn install_snapshot(&mut self, meta: &SnapshotMeta, snapshot: SnapshotData) -> Result<(), io::Error> { tracing::info!("install snapshot"); let new_snapshot = StoredSnapshot { meta: meta.clone(), data: snapshot, }; // Update the state machine. { let d: pb::StateMachineData = prost::Message::decode(new_snapshot.data.as_ref()) .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e))?; let mut state_machine = self.state_machine.lock().await; *state_machine = d; } // Update current snapshot. let mut current_snapshot = self.current_snapshot.lock().unwrap(); *current_snapshot = Some(new_snapshot); Ok(()) }

Ohhhh nice nice .... we can somewhat reuse our existing snapshot in this case then

deven96 · 2025-11-18T00:29:09Z

ahnlich/replication-rfc.md

+
+- `LogStore` - This is where the logs from the Raft cluster activities will be stored. Here is an in-memory impl from the openraft guys that they used in their example: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/mem-log/src/log_store.rs>, and I believe we can easily co-opt this into a write-to-disk-log file service, or any other log storage we settle on.
+
+- `StateMachineStore` - This is where the last known state (snapshot) is stored and read from. They have a neat impl here as well: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/store/mod.rs>, and I think the bit we need to figure out are where we want to 'store' the state machine.


What is a snapshot in this sense with respect to the state? All the application data in the store in some serializable format?

Affirmative. In their example impl, they just saved it as a Vec<u8>:

openraft::declare_raft_types!( /// Declare the type configuration, for example, K/V store. pub TypeConfig: D = pb::SetRequest, R = pb::Response, LeaderId = pb::LeaderId, Vote = pb::Vote, Entry = pb::Entry, Node = pb::Node, SnapshotData = Vec<u8>, /// <- Note here );

Here's the SnapshotMeta:

pub struct SnapshotMeta<NID, N>where NID: NodeId, N: Node,{ pub last_log_id: Option<LogId<NID>>, pub last_membership: StoredMembership<NID, N>, pub snapshot_id: SnapshotId, }

Here's the StoredSnapshot:

#[derive(Debug)] pub struct StoredSnapshot { pub meta: SnapshotMeta, /// <- Note here /// The data of the state machine at the time of this snapshot. pub data: SnapshotData, /// <- Note here }

Here's the StateMachineData proto:

// All the data in a state machine, including user-defined data and membership data. message StateMachineData { // The last log ID that has been applied to the state machine LogId last_applied = 1; // User data in a map map<string, string> data = 2; // The ID of the last membership config log entry that is applied. LogId last_membership_log_id = 3; // The last membership config that is applied. Membership last_membership = 4; }

And here's the usage in the StateMachineStore:

/// Defines a state machine for the Raft cluster. This state machine represents a copy of the /// data for this node. Additionally, it is responsible for storing the last snapshot of the data. #[derive(Debug, Default)] pub struct StateMachineStore { /// The Raft state machine. pub state_machine: tokio::sync::Mutex<pb::StateMachineData>, /// <- Note here snapshot_idx: Mutex<u64>, /// The last received snapshot. current_snapshot: Mutex<Option<StoredSnapshot>>, /// <- Note here }

deven96 · 2025-11-18T00:32:30Z

ahnlich/replication-rfc.md

+
+- `StateMachineStore` - This is where the last known state (snapshot) is stored and read from. They have a neat impl here as well: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/store/mod.rs>, and I think the bit we need to figure out are where we want to 'store' the state machine.
+
+- `Network` - This is the communication layer for the nodes. They have a tonic gRPC impl here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/network/mod.rs>, we can adopt from.


Nice... how will this play with our already existing public facing gRPC endpoints

From what I can see, it should be fine. We only need to make sure that the channels (endpoints) being used for Raft don't collide with any ones used for the AI and DB clients (or any other running services on the machine), which can be solved with os_select_port()

deven96 · 2025-11-18T00:44:33Z

ahnlich/replication-rfc.md

+
+- `AppService` - This is the client/application gRPC service. Here, the agent/app running the Raft cluster can issue commands to change the state, behaviour and roles of the nodes in the cluster. Impl here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/grpc/app_service.rs>.
+
+- `Server` - This is the server to add the above services to, and essentially listen for the requests. It's a `tonic` Server, so we're just importing from tonic, adding our services and giving it a port to listen to requests at, like they do in their example here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/app.rs#L47>


Oh neat... this shows that impl AppService and impl RaftService are the two large parts of the puzzle where the former is the public facing APIs. If we are in clustering mode, we should then sure that the former uses self.raft for write operations otherwise

deven96 · 2025-11-18T00:48:02Z

ahnlich/replication-rfc.md

+
+- `Server` - This is the server to add the above services to, and essentially listen for the requests. It's a `tonic` Server, so we're just importing from tonic, adding our services and giving it a port to listen to requests at, like they do in their example here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/app.rs#L47>
+
+For the App and Raft Service, they have a bunch of types defined in protobuf, as seen here: <https://github.com/databendlabs/openraft/tree/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/proto>, and they seem to be importing them directly (without a pre-generation step) into their Rust code using tonic, as shown here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/lib.rs#L8>


Yeah tonic::include_proto can do that and is quite helpful... we went the build file route for ours as we had a bit of postprocessing to do and that felt more natural in that regard

deven96 · 2025-11-18T00:49:13Z

ahnlich/replication-rfc.md

+
+`AhnlichRaftService` - this is where were are going to plug in Ahnlich, by adding some service that will instantiate the Faft service based on some config/commands passed by the CLI. I'm not sure exactly what goes here yet, but i think some of the things we would want to do here are:
+
+- Allow for Raft nodes to be created based on some config/command issued via the CLI


I think first step is deciding if we are in cluster-mode or not ... where the default is NOT

deven96 · 2025-11-18T00:51:17Z

ahnlich/replication-rfc.md

+- Allow for the cluster to be restarted
+
+The next step I think is answering the following questions.
+


Also interested in the question of could we reuse a ton of their log and snapshot infrastructure to play nicely with our need for persistence? (even outside clustering mode as we know that in clustering mode that's a given)

I think the in-memory examples are largely helpful but not all the way as clients would be able to enable persistence. I see some implementation insight in the rocksdb example which shows implementation of log_store

But perhaps taking a step back it would largely be a lot to grok at once... and we can figure out a reasonable way to dissect the change such that we can first of all implement clustering and then try make the persistence bits nicer or vice versa

Yep, I can see how it should be possible. Their examples have placeholders for where they expect us to do any persistence stuff we want to add, for example, these lines from here: https://github.com/databendlabs/openraft/blob/956d6f6a6c344d63a92b5bebf81ee9051a1aace2/examples/raft-kv-memstore-grpc/src/store/mod.rs#L91, see below:

// Emulation of storing snapshot locally { let mut current_snapshot = self.current_snapshot.lock().unwrap(); *current_snapshot = Some(stored); }

so basically we should be able to just plug in some persistence function in there

For the logs, they are saved in memory as a BTreeMap:

pub struct LogStoreInner<C: RaftTypeConfig> { /// The last purged log id. last_purged_log_id: Option<LogIdOf<C>>, /// The Raft log. log: BTreeMap<u64, C::Entry>, /// <- Note here /// The commit log id. committed: Option<LogIdOf<C>>, /// The current granted vote. vote: Option<VoteOf<C>>, }

And appended with this function:

async fn append<I>(&mut self, entries: I, callback: IOFlushed<C>) -> Result<(), io::Error> where I: IntoIterator<Item = C::Entry> { // Simple implementation that calls the flush-before-return `append_to_log`. for entry in entries { self.log.insert(entry.index(), entry); /// <- Note here } callback.io_completed(Ok(())).await; Ok(()) }

We should be able to add a persistent function call inside the append()

Okay so in the case where a replica goes down and comes back up... does it receive the logs from other nodes in the cluster or does it load from it's possibly outdated logs

I'm guessing it's the former... if so then we may not need to persist RAFT logs on disk and we could stick to instead persisting only the snapshots as we currently do

I think the in-memory examples are largely helpful but not all the way as clients would be able to enable persistence. I see some implementation insight in the rocksdb example which shows implementation of log_store

But perhaps taking a step back it would largely be a lot to grok at once... and we can figure out a reasonable way to dissect the change such that we can first of all implement clustering and then try make the persistence bits nicer or vice versa

I agree with this as well, and I'll say we should do the clustering first, then figure out what we want to persist, before adding persistence

Okay so in the case where a replica goes down and comes back up... does it receive the logs from other nodes in the cluster or does it load from it's possibly outdated logs

I'm guessing it's the former... if so then we may not need to persist RAFT logs on disk and we could stick to instead persisting only the snapshots as we currently do

Yeah, it would only make sense to be the former (would have to look inside openraft-rs itself to be sure how it's updating the log BTreeMap), but can you explain a bit how/why this would reduce the need for persisting the logs?

Cuz if it receives the logs from other nodes in the cluster, then on startup it shouldn't absolutely need to recover from disk as it would get the logs and snapshots right?

Assuming there's atleast other nodes in a good state

Cuz if it receives the logs from other nodes in the cluster, then on startup it shouldn't absolutely need to recover from disk as it would get the logs and snapshots right?

Assuming there's atleast other nodes in a good state

Ahh, I understand now

deven96 · 2025-11-18T00:54:51Z

ahnlich/replication-rfc.md

+
+1. Where/how are we storing the logs?
+
+1. Where/how are we storing the state machine snapshots?


Where/how are we storing the logs/state machine would largely depend on how they affect the hot request path and also whether or not they are safe to have in-memory (potentially lose)?

added RFC

6b1503a

Signed-off-by: Jim Ezesinachi <ezesinachijim@gmail.com>

jimezesinachi changed the title ~~draft: RFC for [Horizontal scalability for ahnlich](https://github.com/deven96/ahnlich/issues/271)~~ draft: RFC for Horizontal scalability for ahnlich Nov 17, 2025

deven96 reviewed Nov 18, 2025

View reviewed changes

jimezesinachi marked this pull request as draft November 18, 2025 07:59

jimezesinachi added 3 commits November 24, 2025 08:15

Merge branch 'main' into rfc/replication

f741285

Merge branch 'main' into rfc/replication

db95425

Merge branch 'main' into rfc/replication

1f15cc5


		They are (NOTE: these names are tentative):

		- `LogStore` - This is where the logs from the Raft cluster activities will be stored. Here is an in-memory impl from the openraft guys that they used in their example: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/mem-log/src/log_store.rs>, and I believe we can easily co-opt this into a write-to-disk-log file service, or any other log storage we settle on.


		- `LogStore` - This is where the logs from the Raft cluster activities will be stored. Here is an in-memory impl from the openraft guys that they used in their example: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/mem-log/src/log_store.rs>, and I believe we can easily co-opt this into a write-to-disk-log file service, or any other log storage we settle on.

		- `StateMachineStore` - This is where the last known state (snapshot) is stored and read from. They have a neat impl here as well: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/store/mod.rs>, and I think the bit we need to figure out are where we want to 'store' the state machine.


		- `StateMachineStore` - This is where the last known state (snapshot) is stored and read from. They have a neat impl here as well: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/store/mod.rs>, and I think the bit we need to figure out are where we want to 'store' the state machine.

		- `Network` - This is the communication layer for the nodes. They have a tonic gRPC impl here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/network/mod.rs>, we can adopt from.


		- `AppService` - This is the client/application gRPC service. Here, the agent/app running the Raft cluster can issue commands to change the state, behaviour and roles of the nodes in the cluster. Impl here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/grpc/app_service.rs>.

		- `Server` - This is the server to add the above services to, and essentially listen for the requests. It's a `tonic` Server, so we're just importing from tonic, adding our services and giving it a port to listen to requests at, like they do in their example here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/app.rs#L47>


		- `Server` - This is the server to add the above services to, and essentially listen for the requests. It's a `tonic` Server, so we're just importing from tonic, adding our services and giving it a port to listen to requests at, like they do in their example here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/app.rs#L47>

		For the App and Raft Service, they have a bunch of types defined in protobuf, as seen here: <https://github.com/databendlabs/openraft/tree/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/proto>, and they seem to be importing them directly (without a pre-generation step) into their Rust code using tonic, as shown here: <https://github.com/databendlabs/openraft/blob/4f0fd5fa034413d2f367306da4a0016f7603fb7e/examples/raft-kv-memstore-grpc/src/lib.rs#L8>


		`AhnlichRaftService` - this is where were are going to plug in Ahnlich, by adding some service that will instantiate the Faft service based on some config/commands passed by the CLI. I'm not sure exactly what goes here yet, but i think some of the things we would want to do here are:

		- Allow for Raft nodes to be created based on some config/command issued via the CLI

		- Allow for the cluster to be restarted

		The next step I think is answering the following questions.


		1. Where/how are we storing the logs?

		1. Where/how are we storing the state machine snapshots?

draft: RFC for Horizontal scalability for ahnlich #276

Are you sure you want to change the base?

draft: RFC for Horizontal scalability for ahnlich #276

Conversation

jimezesinachi commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

github-actions bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimezesinachi Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deven96 Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimezesinachi Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimezesinachi Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

draft: RFC for `Horizontal scalability for ahnlich` #276

draft: RFC for `Horizontal scalability for ahnlich` #276

jimezesinachi commented Nov 17, 2025 •

edited

Loading

github-actions bot commented Nov 17, 2025 •

edited

Loading

github-actions bot commented Nov 17, 2025 •

edited

Loading

jimezesinachi Nov 18, 2025 •

edited

Loading

deven96 Nov 18, 2025 •

edited

Loading

jimezesinachi Nov 18, 2025 •

edited

Loading

jimezesinachi Nov 20, 2025 •

edited

Loading