Skip to content

[iceberg-core]Rest Catalog Concurrent Commit Race Condition #15001

@agnes-xinyi-lu

Description

@agnes-xinyi-lu

Apache Iceberg version

1.6.1

Query engine

Spark

Please describe the bug 🐞

Problem

When multiple processes commit to different branches(or writing to different waps) of the same Iceberg table concurrently through the REST catalog, some commits fail with a non-retryable ValidationException when building TableMetadata on the server side and calling addSnapshot:
Cannot add snapshot with sequence number X older than last sequence number X

Instead of CommitFailedException, this error is non-retryable , bypassing automatic retries and wasting compute resources.
Repro in unit test.

Root Cause

  1. All the snapshots share a global sequence number counter at the table level, but we don't add extra requirements for such addSnapshot to guarantee snapshotId>last sequence number.
  2. When a commit reaches TableMetadata.addSnapshot(), it fails validation because another concurrent commit to a different branch already incremented the global sequence number
  3. This validation failure occurs after the requirement checks (because there is no check) pass, so it's thrown as ValidationException rather than CommitFailedException

Relevant work

Previously in OSS there was similar issue with replace table, which was fixed/mitigated by checking if the snapshot has a parent. But in this case it's a normal table update, and we probably don't want to bypass the check because we want to maintain the order of all snapshots through the global sequence number.

Proposed Solution:

(Note that we need to update the Rest Spec for the requirement)
Added a new AssertLastSequenceNumber update requirement that validates sequence number conflicts before the commit is applied.

Proposed PR #15002

Behavior After Fix

  • Sequence number conflicts are caught early by AssertLastSequenceNumber requirement
  • Conflicts throw CommitFailedException which triggers automatic client-side retries
  • Concurrent commits to different branches eventually succeed through the retry mechanism

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions