Skip to content

Conversation

@achamayou
Copy link
Member

@achamayou achamayou commented Jan 6, 2026

Adding an API aiming to allow a ledger backup process that does not have access to the ledger storage directories to efficiently fetch committed ledger chunks for archival/retention purposes.

HEAD/GET /node/ledger-chunk
HEAD/GET /node/ledger-chunk/{chunk_name}

Typical Scenario

sequenceDiagram
  Note over Client: Client asks for chunk starting at index
  Client->>+Backup: GET /node/ledger-chunk?since=index
  Backup->>-Client: 308 Location: /node/ledger-chunk/ledger_startIndex_endIndex.committed
  Note over Backup: Backup node has that chunk
  Client->>+Backup: GET /node/ledger-chunk/ledger_startIndex_endIndex.committed
  Backup->>-Client: 200 <Chunk Contents>
  Client->>+Backup: GET /node/ledger-chunk?since=endIndex+1
  Note over Backup: Backup node does not yet have a committed chunk starting at endIndex+1
  Backup->>-Client: 308 Location: https://primary/node/ledger-chunk?since=endIndex+1
  Client->>+Primary: GET /node/ledger-chunk?since=endIndex+1
  Primary->>-Client: 308 Location: /node/ledger-chunk/ledger_endIndex+1_nextEndIndex.committed
  Client->>+Primary: GET /node/ledger-chunk/ledger_startIndex_endIndex.committed
  Note over Primary: But the Primary node has the most recent chunk already
  Primary->>-Client: 200 <Chunk Contents>
Loading

Alternative Scenario

The initial node (Primary in this case) that client hits has started from a snapshot, and does not have some past chunks. To make this more readable, let's say that Primary started from snapshot_100.committed and locally has:

ledger_1-50.committed
ledger_101-150.committed

Backup has:

ledger_1-50.committed
ledger_51-100.committed
sequenceDiagram
  Client->>+Primary: GET /node/ledger-chunk?since=51
  Primary->>-Client: 308 Location: https://backup/node/ledger-chunk?since=51
  Client->>+Backup: GET /node/ledger-chunk?since=51
  Backup->>-Client: 308 Location: /node/ledger-chunk/ledger_51-100.committed
  Client->>+Backup: GET /node/ledger-chunk/ledger_51-100.committed
  Backup->>-Client: 200 <Chunk Contents>
  Client->>+Backup: GET /node/ledger-chunk?since=101
  Note over Backup: Backup node does not have 101-150
  Backup->>-Client: 308 Location: https://backup/node/ledger-chunk?since=51
  Client->>+Primary: GET /node/ledger-chunk?since=101
Loading
  • Serialise all access to Ledger
  • Check performance impact of serialisation
  • Add fetch file API
  • Test fetch file API
  • File API redirects to primary if requests go beyond the last known local committed file
  • Test primary redirection
  • Separate LedgerChunkRead interface feature
  • Factor out pass with the snapshot handlers
  • /ledger-chunk?since=index needs to redirect to the next node in a stable order when a file is not found and index is strictly less than locally known start index (i.e. the startup snapshot version)
  • Documentation/Changelog for fetch file API
  • Committed chunks need to be fsync()'ed before they are closed
  • Final performance check

@achamayou achamayou removed the bench-ab label Jan 8, 2026
@achamayou achamayou changed the title [Draft] Thread-safe ledger file access interface [Draft] Ledger Chunk download API Jan 12, 2026
if (!node_operation->can_replicate())
{
LOG_INFO_FMT(
"This node cannot serve ledger chunk including index {} - trying "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if even the primary node doesn't have the requested chunk? For instance, if a node joins the network late, it might not have all the ledger files that were committed before it joined. If this node becomes the primary at some point and a client requests an older chunk from before the node joined, redirecting to the primary would not work (assuming that the node who first received the request and redirected to the primary didn't have the file locally as well).

Would it make sense to implement a discovery mechanism for ledger files that could allow nodes to redirect requests to the node that actually has the file? This could also help with downloading snapshots and eliminate the need to check each node one by one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can happen, and it's a 404 right now, which I think is correct. On a typical primary, the only way this can happen is if the primary hasn't committed that chunk yet, in which case 404 is clearly the right answer. The client must wait and retry.

On an isolated primary (i.e. a node that was once primary, but is now unable to replicate because it's partitioned, and someone else has been elected), there is a bit of delay until CheckQuorum kicks in, and causes it to step down, but that's very short.

I am not sure what a separate mechanism would look like, or how it would do better on this, because it seems to me that it would be bound by the same conditions and timeouts. I am not sure I understand why this would help download snapshots?

}

const auto chunk_path =
read_ledger_subsystem->committed_ledger_path_with_idx(since_idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are potentially forked files stored across different nodes (with different start and end seqnos), the API may lose its idempotency and return different results for identical queries. Is there a way to address this situation? Should nodes verify with the network to ensure they have the correct chunk (which aligns with the state of the ledger), rather than a leftover from an earlier fork?

I'm not sure if this issue is actually solvable, but I'm wondering if there are scenarios where the backup process might download a forked chunk that doesn't match the rest of the ledger and then get stuck as a result when attempting to find the next chunk.

Copy link
Member Author

@achamayou achamayou Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are potentially forked files stored across different nodes (with different start and end seqnos), the API may lose its idempotency and return different results for identical queries. Is there a way to address this situation?

We protect against forks by verifying the consensus algorithm formally, and we connect the algorithm and the implementation with trace validation. Forks caused by a consortium executing competing DRs are out of scope, there is no way to decide which one is "best" or "correct".

The backup process might download a forked chunk that doesn't match the rest of the ledger and then get stuck as a result when attempting to find the next chunk.

It's likely possible to construct this scenario by executing competing DRs and cherry-picking chunks, but it's also trivially possible to make it unrecoverable by removing alternative chunks from all locations. It's also possible to destroy persistence without executing forks.

The API operates under the assumption that the node is the sole writer, at any given time, to the main ledger storage in the configuration, and makes basic expectations on the FS implementation (e.g. reads from a file that has been fsync()ed and is no longer written are idempotent). If these assumptions do not hold, neither do the guarantees.

}
}
},
"/node/ledger-chunk": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to also implement an endpoint that lists all ledger files stored locally on a node, with pagination? This could help in finding which node has a specific ledger file and comparing local ledger states for debugging purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of that, it might be helpful to also have an endpoint that internally calls all the trusted nodes and returns a de-duplicated list of ledger files from the entire ledger.

Essentially, the former endpoint could be node-specific and just used for internal discovery / debugging. The latter would apply to the entire service and be the single source of truth for finding ledger files across all the nodes (no matter which node serves the request).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to also implement an endpoint that lists all ledger files stored locally on a node, with pagination?

That's very easy to do on the client side, or for an operator debugging, to mount the share read only.

On top of that, it might be helpful to also have an endpoint that internally calls all the trusted nodes and returns a de-duplicated list of ledger files from the entire ledger.

That's what the archive share is, the network is not going to provide that because it's going to GC files locally, and the fault model is that not all nodes are available at all times.

/**
* Returns the path to the committed ledger file containing the given
* index, or nullopt if no such file exists. Only returns paths from the
* main ledger directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an issue to return only the paths from the main ledger directory if we later add a garbage collection process that deletes ledger files from the main directory once they're present in the shared read-only directory? Ideally, the node should check all available directories and locate the file in any of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this might be intentional to handle cases where there are corrupted files in the read-only directory. However, ignoring the read-only directory entirely could be problematic, as some files might exist only there and not in the main directory of any node. Maybe the node could perform a quick verification to ensure the file found is valid before returning it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The files from the read-only mount are by definition already archived, and should not be archived again. We are not going to re-serve them here, the purpose of the endpoint is not to serve the de-duplicated archive.

The purpose of this API is to enable ledger file backup, it is not to act as a long-term static file share for archived ledger files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants