-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Apache Iceberg version
1.6.1
Query engine
Spark
Please describe the bug 🐞
Problem
When multiple processes commit to different branches(or writing to different waps) of the same Iceberg table concurrently through the REST catalog, some commits fail with a non-retryable ValidationException when building TableMetadata on the server side and calling addSnapshot:
Cannot add snapshot with sequence number X older than last sequence number X
Instead of CommitFailedException, this error is non-retryable , bypassing automatic retries and wasting compute resources.
Repro in unit test.
Root Cause
- All the snapshots share a global sequence number counter at the table level, but we don't add extra requirements for such addSnapshot to guarantee snapshotId>last sequence number.
- When a commit reaches TableMetadata.addSnapshot(), it fails validation because another concurrent commit to a different branch already incremented the global sequence number
- This validation failure occurs after the requirement checks (because there is no check) pass, so it's thrown as ValidationException rather than CommitFailedException
Relevant work
Previously in OSS there was similar issue with replace table, which was fixed/mitigated by checking if the snapshot has a parent. But in this case it's a normal table update, and we probably don't want to bypass the check because we want to maintain the order of all snapshots through the global sequence number.
Proposed Solution:
(Note that we need to update the Rest Spec for the requirement)
Added a new AssertLastSequenceNumber update requirement that validates sequence number conflicts before the commit is applied.
Proposed PR #15002
Behavior After Fix
- Sequence number conflicts are caught early by AssertLastSequenceNumber requirement
- Conflicts throw CommitFailedException which triggers automatic client-side retries
- Concurrent commits to different branches eventually succeed through the retry mechanism
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time