Skip to content

Core: Make sequence number conflicts retryable in concurrent commits#15126

Open
agnes-xinyi-lu wants to merge 6 commits intoapache:mainfrom
agnes-xinyi-lu:xinyilu/fix2
Open

Core: Make sequence number conflicts retryable in concurrent commits#15126
agnes-xinyi-lu wants to merge 6 commits intoapache:mainfrom
agnes-xinyi-lu:xinyilu/fix2

Conversation

@agnes-xinyi-lu
Copy link
Contributor

@agnes-xinyi-lu agnes-xinyi-lu commented Jan 23, 2026

When multiple processes concurrently commit to different branches of the
same table through the REST catalog, sequence number validation failures
in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
instead of retryable CommitFailedException.
This fix catches the sequence number validation error in CatalogHandlers.commit()
and wraps it in ValidationFailureException(CommitFailedException) to:
- Skip server-side retry (which won't help since sequence number is in the request)
- Return CommitFailedException to the client so it can retry with refreshed metadata

Issue #15001

@amogh-jahagirdar
Copy link
Contributor

Thanks @agnes-xinyi-lu will take a look with fresh eyes tomorrow.

    When multiple processes concurrently commit to different branches of the
    same table through the REST catalog, sequence number validation failures
    in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
    instead of retryable CommitFailedException.

    This fix catches the sequence number validation error in CatalogHandlers.commit()
    and wraps it in ValidationFailureException(CommitFailedException) to:
    - Skip server-side retry (which won't help since sequence number is in the request)
    - Return CommitFailedException to the client so it can retry with refreshed metadata
@github-actions github-actions bot removed the API label Jan 30, 2026
@agnes-xinyi-lu
Copy link
Contributor Author

@amogh-jahagirdar @singhpk234 any more comments? Could we get this in this week?

@Parth-Brahmbhatt
Copy link
Contributor

@amogh-jahagirdar @singhpk234 can you take another look ?

@amogh-jahagirdar amogh-jahagirdar added this to the Iceberg 1.11.0 milestone Feb 10, 2026
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agnes-xinyi-lu , I think this looks good, I just have some suggestions on how to better improve the test so we can have some stronger assertions, but I think it's reasonable as is. I think we should get some other reviews as well

I've also went ahead and added it to the 1.11 milestone since I think it's an important improvement.

import org.apache.iceberg.exceptions.ValidationException;

/**
* A {@link ValidationException} that indicates the client can retry with refreshed metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too specific to the REST context because it mentions the client.

I think that this should focus on what the problem is: a validation failed but the commit can be fixed and retried. This is specifically not a conflict.

snapshot.snapshotId());

ValidationException.check(
RetryableValidationException.check(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first row ID check also needs to be updated.

// Wrap in ValidationFailureException to skip server retry, return to client as
// CommitFailedException so the client can retry with refreshed metadata.
throw new ValidationFailureException(
new CommitFailedException(e, "Commit conflict: %s", e.getMessage()));
Copy link
Contributor

@rdblue rdblue Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not rethrow a different exception. Instead, add RetryableValidationException to exceptions passed to Tasks.onlyRetryOn.

Also, this commit message is incorrect. We are not throwing CommitFailedException because there was not a commit conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue this exception is only retryable on the client side, server side retries will not help because snapshot is provided by the client side.
If we want to change SnapshotProducer to retry on this RetryableValidationException, then should we create a different status code in the SPEC other than 409 which is CommitFailedException? Or should we keep the wrapping of CommitFailedException(but update message to something more specific like Retryable validation failure) and avoid the spec change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments