-
Notifications
You must be signed in to change notification settings - Fork 247
[Draft] Ledger Chunk download API #7550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| if (!node_operation->can_replicate()) | ||
| { | ||
| LOG_INFO_FMT( | ||
| "This node cannot serve ledger chunk including index {} - trying " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if even the primary node doesn't have the requested chunk? For instance, if a node joins the network late, it might not have all the ledger files that were committed before it joined. If this node becomes the primary at some point and a client requests an older chunk from before the node joined, redirecting to the primary would not work (assuming that the node who first received the request and redirected to the primary didn't have the file locally as well).
Would it make sense to implement a discovery mechanism for ledger files that could allow nodes to redirect requests to the node that actually has the file? This could also help with downloading snapshots and eliminate the need to check each node one by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can happen, and it's a 404 right now, which I think is correct. On a typical primary, the only way this can happen is if the primary hasn't committed that chunk yet, in which case 404 is clearly the right answer. The client must wait and retry.
On an isolated primary (i.e. a node that was once primary, but is now unable to replicate because it's partitioned, and someone else has been elected), there is a bit of delay until CheckQuorum kicks in, and causes it to step down, but that's very short.
I am not sure what a separate mechanism would look like, or how it would do better on this, because it seems to me that it would be bound by the same conditions and timeouts. I am not sure I understand why this would help download snapshots?
| } | ||
|
|
||
| const auto chunk_path = | ||
| read_ledger_subsystem->committed_ledger_path_with_idx(since_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are potentially forked files stored across different nodes (with different start and end seqnos), the API may lose its idempotency and return different results for identical queries. Is there a way to address this situation? Should nodes verify with the network to ensure they have the correct chunk (which aligns with the state of the ledger), rather than a leftover from an earlier fork?
I'm not sure if this issue is actually solvable, but I'm wondering if there are scenarios where the backup process might download a forked chunk that doesn't match the rest of the ledger and then get stuck as a result when attempting to find the next chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are potentially forked files stored across different nodes (with different start and end seqnos), the API may lose its idempotency and return different results for identical queries. Is there a way to address this situation?
We protect against forks by verifying the consensus algorithm formally, and we connect the algorithm and the implementation with trace validation. Forks caused by a consortium executing competing DRs are out of scope, there is no way to decide which one is "best" or "correct".
The backup process might download a forked chunk that doesn't match the rest of the ledger and then get stuck as a result when attempting to find the next chunk.
It's likely possible to construct this scenario by executing competing DRs and cherry-picking chunks, but it's also trivially possible to make it unrecoverable by removing alternative chunks from all locations. It's also possible to destroy persistence without executing forks.
The API operates under the assumption that the node is the sole writer, at any given time, to the main ledger storage in the configuration, and makes basic expectations on the FS implementation (e.g. reads from a file that has been fsync()ed and is no longer written are idempotent). If these assumptions do not hold, neither do the guarantees.
| } | ||
| } | ||
| }, | ||
| "/node/ledger-chunk": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to also implement an endpoint that lists all ledger files stored locally on a node, with pagination? This could help in finding which node has a specific ledger file and comparing local ledger states for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of that, it might be helpful to also have an endpoint that internally calls all the trusted nodes and returns a de-duplicated list of ledger files from the entire ledger.
Essentially, the former endpoint could be node-specific and just used for internal discovery / debugging. The latter would apply to the entire service and be the single source of truth for finding ledger files across all the nodes (no matter which node serves the request).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to also implement an endpoint that lists all ledger files stored locally on a node, with pagination?
That's very easy to do on the client side, or for an operator debugging, to mount the share read only.
On top of that, it might be helpful to also have an endpoint that internally calls all the trusted nodes and returns a de-duplicated list of ledger files from the entire ledger.
That's what the archive share is, the network is not going to provide that because it's going to GC files locally, and the fault model is that not all nodes are available at all times.
| /** | ||
| * Returns the path to the committed ledger file containing the given | ||
| * index, or nullopt if no such file exists. Only returns paths from the | ||
| * main ledger directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be an issue to return only the paths from the main ledger directory if we later add a garbage collection process that deletes ledger files from the main directory once they're present in the shared read-only directory? Ideally, the node should check all available directories and locate the file in any of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might be intentional to handle cases where there are corrupted files in the read-only directory. However, ignoring the read-only directory entirely could be problematic, as some files might exist only there and not in the main directory of any node. Maybe the node could perform a quick verification to ensure the file found is valid before returning it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files from the read-only mount are by definition already archived, and should not be archived again. We are not going to re-serve them here, the purpose of the endpoint is not to serve the de-duplicated archive.
The purpose of this API is to enable ledger file backup, it is not to act as a long-term static file share for archived ledger files.
Co-authored-by: Eddy Ashton <edashton@microsoft.com>
Adding an API aiming to allow a ledger backup process that does not have access to the ledger storage directories to efficiently fetch committed ledger chunks for archival/retention purposes.
Typical Scenario
Alternative Scenario
The initial node (Primary in this case) that client hits has started from a snapshot, and does not have some past chunks. To make this more readable, let's say that Primary started from
snapshot_100.committedand locally has:Backup has:
/ledger-chunk?since=indexneeds to redirect to the next node in a stable order when a file is not found and index is strictly less than locally known start index (i.e. the startup snapshot version)