Core: Make sequence number conflicts retryable in concurrent commits#15126
Core: Make sequence number conflicts retryable in concurrent commits#15126agnes-xinyi-lu wants to merge 6 commits intoapache:mainfrom
Conversation
38e36ca to
59a899f
Compare
core/src/test/java/org/apache/iceberg/rest/TestRestCatalogConcurrentWrites.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRestCatalogConcurrentWrites.java
Outdated
Show resolved
Hide resolved
5f76559 to
b245667
Compare
|
Thanks @agnes-xinyi-lu will take a look with fresh eyes tomorrow. |
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
When multiple processes concurrently commit to different branches of the
same table through the REST catalog, sequence number validation failures
in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
instead of retryable CommitFailedException.
This fix catches the sequence number validation error in CatalogHandlers.commit()
and wraps it in ValidationFailureException(CommitFailedException) to:
- Skip server-side retry (which won't help since sequence number is in the request)
- Return CommitFailedException to the client so it can retry with refreshed metadata
b245667 to
5ace567
Compare
|
@amogh-jahagirdar @singhpk234 any more comments? Could we get this in this week? |
|
@amogh-jahagirdar @singhpk234 can you take another look ? |
There was a problem hiding this comment.
Thanks @agnes-xinyi-lu , I think this looks good, I just have some suggestions on how to better improve the test so we can have some stronger assertions, but I think it's reasonable as is. I think we should get some other reviews as well
I've also went ahead and added it to the 1.11 milestone since I think it's an important improvement.
| import org.apache.iceberg.exceptions.ValidationException; | ||
|
|
||
| /** | ||
| * A {@link ValidationException} that indicates the client can retry with refreshed metadata. |
There was a problem hiding this comment.
This is a bit too specific to the REST context because it mentions the client.
I think that this should focus on what the problem is: a validation failed but the commit can be fixed and retried. This is specifically not a conflict.
| snapshot.snapshotId()); | ||
|
|
||
| ValidationException.check( | ||
| RetryableValidationException.check( |
There was a problem hiding this comment.
The first row ID check also needs to be updated.
| // Wrap in ValidationFailureException to skip server retry, return to client as | ||
| // CommitFailedException so the client can retry with refreshed metadata. | ||
| throw new ValidationFailureException( | ||
| new CommitFailedException(e, "Commit conflict: %s", e.getMessage())); |
There was a problem hiding this comment.
This should not rethrow a different exception. Instead, add RetryableValidationException to exceptions passed to Tasks.onlyRetryOn.
Also, this commit message is incorrect. We are not throwing CommitFailedException because there was not a commit conflict.
There was a problem hiding this comment.
@rdblue this exception is only retryable on the client side, server side retries will not help because snapshot is provided by the client side.
If we want to change SnapshotProducer to retry on this RetryableValidationException, then should we create a different status code in the SPEC other than 409 which is CommitFailedException? Or should we keep the wrapping of CommitFailedException(but update message to something more specific like Retryable validation failure) and avoid the spec change?
When multiple processes concurrently commit to different branches of the
same table through the REST catalog, sequence number validation failures
in TableMetadata.addSnapshot() were throwing non-retryable ValidationException
instead of retryable CommitFailedException.
This fix catches the sequence number validation error in CatalogHandlers.commit()
and wraps it in ValidationFailureException(CommitFailedException) to:
- Skip server-side retry (which won't help since sequence number is in the request)
- Return CommitFailedException to the client so it can retry with refreshed metadata
Issue #15001