Core: Detect and merge duplicate DVs for a data file and merge them before committing by amogh-jahagirdar · Pull Request #15006 · apache/iceberg

amogh-jahagirdar · 2026-01-09T03:34:17Z

While generally, writers are expected to merge DVs for a given data file before attempting to commit, we probably want to have a safeguard in the commit path in case this assumption is violated. This has been observed when AQE is enabled in Spark and a data file is split across multiple tasks (really just depends on how files and deletes are split); then multiple DVs are produced for a given data file, and then committed. Currently, after that commit reads would fail since the DeleteFileIndex detects the duplicates and fails on read.

Arguably, there should be a safeguard on the commit path which detects duplicates and fixes them up to prevent any invalid table states. Doing this behind the API covers any engine integration using the library.

This change updates MergingSnapshotProducer to track duplicate DVs for a datafile, and then merge them and produces a Puffin file per DV. Note that since we generally expect duplicates to be rare, we don't expect there to be too many small Puffins produced, and we don't add the additional logic to coalesce into larger files. Furthermore, these can later be compacted. In case of large scale duplicates, then engines should arguably fix those up before handing off to the commit path.

amogh-jahagirdar · 2026-01-09T03:38:11Z

Still cleaning some stuff up, so leaving in draft but feel free to comment. But basically there are some cases in Spark where a file can be split across multiple tasks, and if deletes happen to touch every single part in the task we'd incorrectly produce multiple DVs for a given data file (discovered this recently with a user when they had Spark AQE enabled, but I think file splitting can happen in more cases).

We currently throw on read in such cases, but ideally we can try and prevent this on write by detecting and merging pre-commit.

The reason this is done behind the API is largely so that we are defensive from a library perspective that in case an engine/integration happens to produce multiple DVs, we can at least fix it up pre-commit.

In the case there are too many to reasonably rewrite on a single node, then engines could do distributed writes to fix up before handing off the files to the API, but arguably from a library perspective it seems reasonable to pay this overhead to prevent bad commits across any integration.

core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java

amogh-jahagirdar · 2026-01-09T23:20:09Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

Probably keep a boolean in-case we detect a duplicate. That way we don't have to pay the price of grouping by referenced file everytime to detect possible duplicates; only if we detect it at the time of adding it, we can do the dedupe/merge

We also could just keep a mapping specific for duplicates. That shrinks down how much work we need to do because instead of trying to group by every referenced data file in case of duplicates, we just go through the duplicates set. It's maybe a little more memory but if we consider that we expect duplicates to generally be rare it feels like a generally better solution

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

…require HasTableOPerations

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

amogh-jahagirdar · 2026-01-12T03:16:09Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

    DeleteFileSet deleteFiles =
        newDeleteFilesBySpec.computeIfAbsent(spec.specId(), ignored -> DeleteFileSet.create());
-    if (deleteFiles.add(file)) {
-      addedFilesSummary.addedFile(spec, file);


because we may be merging duplicates, we don't update the summary for delete files until after we dedupe and are just about to write the new manifests

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

amogh-jahagirdar · 2026-01-12T03:30:57Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java

    Pair<List<PositionDelete<?>>, DeleteFile> deletesA =
        deleteFile(tab, dataFileA, new Object[] {"aa"}, new Object[] {"a"});
    Pair<List<PositionDelete<?>>, DeleteFile> deletesB =
-        deleteFile(tab, dataFileA, new Object[] {"bb"}, new Object[] {"b"});


This fix surfaced an issue in some of the TestPositionDeletesTable tests where we were setting the wrong data file for delete file; we'd just add a DV for the same data file, and then it'd get merged with the new logic , and break some of the later assertions.

amogh-jahagirdar · 2026-01-12T03:37:47Z

...4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveDanglingDeleteAction.java

    // Add Data Files with EQ and POS deletes
    DeleteFile fileADeletes = fileADeletes();
    DeleteFile fileA2Deletes = fileA2Deletes();
-    DeleteFile fileBDeletes = fileBDeletes();


This test had to be fixed after the recent changes because the file paths for data file B and B2 were set to the same before, so the DVs for both referenced the same file (but that probably wasn't the intention of these tests) so it was a duplicate. After this change we'd merge the DVs in the commit, and then it'd actually get treated as a dangling delete and fail some of the assertions.

Since these tests are just testing the eq. delete case we could just simplify it by removing the usage of fileB deletes, it's a more minimal test that tests the same thing.

Also note, generally I'd take this in a separate PR but I think there's a good argument that this change should be in a 1.10.2 patch release to prevent invalid table states; in that case we'd need to keep these changes together.

rdblue · 2026-02-18T00:31:39Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+  public static PositionDeleteIndex readDV(
+      DeleteFile deleteFile, FileIO fileIO, EncryptionManager encryptionManager) {
+    Preconditions.checkArgument(
+        ContentFileUtil.isDV(deleteFile), "Delete file must be a deletion vector");


"Cannot read, not a deletion vector: %s", deleteFile.location()?

rdblue · 2026-02-18T00:32:47Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+    Preconditions.checkArgument(
+        ContentFileUtil.isDV(deleteFile), "Delete file must be a deletion vector");
+    InputFile inputFile =
+        EncryptingFileIO.combine(fileIO, encryptionManager).newInputFile(deleteFile);


I think that the caller should combine these so that the FileIO can handle the DeleteFile directly. We don't want utility methods creating a one-time-use EncryptingFileIO. This should just assume that the FileIO has been wrapped appropriately for the table.

rdblue · 2026-02-18T00:46:44Z

core/src/main/java/org/apache/iceberg/io/IOUtil.java

    }
  }

+  public static byte[] readBytes(InputFile inputFile, long offset, int length) {


I think this ends up being confusing because RangeReadable#readFully ends up calling the method above. It isn't clear which one to use, except that this one allocates a byte array. That's not related to the difference in behavior, where this one looks for a specific FileIO capability. It would be reasonable to call IOUtil.readFully(inputFile.newStream(), buffer, offset, buffer.length) to reuse a buffer, even though that would miss the RangeReadable optimization.

To improve this:

It should also be named readFully to capture the semantics: this will not return after a partial read. (This should also be covered by the method's Javadoc.)

It should accept a destination buffer so that the difference doesn't make people choose one or the other

It should also use the local implementation of readFully(InputStream, byte[]) instead of the Guava one

It will need a second offset. The readFully implementation above uses offset as the location in the destination buffer, not in the source stream (which is assumed to be in position).

rdblue · 2026-02-18T00:51:55Z

core/src/main/java/org/apache/iceberg/deletes/DVFileWriter.java

+      PositionDeleteIndex positionDeleteIndex,
+      PartitionSpec spec,
+      StructLike partition) {
+    throw new UnsupportedOperationException("Delete with positionDeleteIndex is not supported");


Minor: This could be implemented by deleting specific positions from the delete index.

rdblue · 2026-02-18T00:52:59Z

core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java

+        taskId);
+  }
+
+  public static Builder builderFor(


Why doesn't this create a builder and then configure it? Why does all of this need to be passed into the builder constructor? Seems like that defeats the purpose of the builder.

If the idea was to be able to create a builder without passing a table, that seems reasonable. But this should use as many builder methods as possible.

rdblue · 2026-02-18T00:54:40Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+import org.apache.iceberg.types.Comparators;
+import org.apache.iceberg.util.Tasks;
+
+class DVUtil {


Should DVUtil be in deletes? That would require making some of this public.

rdblue · 2026-02-18T23:37:14Z

core/src/main/java/org/apache/iceberg/deletes/DVFileWriter.java

+      PositionDeleteIndex positionDeleteIndex,
+      PartitionSpec spec,
+      StructLike partition) {
+    throw new UnsupportedOperationException("Delete with positionDeleteIndex is not supported");


Error messages generally shouldn't use identifiers from code. Would this be more helpful to users if you used "Bulk deletes are not supported" or similar?

rdblue · 2026-02-18T23:38:54Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

  private Long newDataFilesDataSequenceNumber;
-  private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec = Maps.newHashMap();
-  private final Set<String> newDVRefs = Sets.newHashSet();
+  private final List<DeleteFile> positionAndEqualityDeletes = Lists.newArrayList();


allDeletes?

I see below that this doesn't include DVs. But DVs are still position deletes, so we should probably find a better name.

rdblue · 2026-02-18T23:54:21Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+              file.referencedDataFile(), newFile -> Lists.newArrayList());
+      dvsForReferencedFile.add(file);
+    } else {
+      positionAndEqualityDeletes.add(file);


In v3, position deletes are no longer allowed to be added to the table, so should this have a check that the table is either version 2 or the deletes are equality deletes?

I know that we don't want to add the complexity right now to rewrite v2 position deletes as DVs, but it doesn't seem like we should allow them to be added through the public API.

We should do this in a separate commit. I wouldn't want to backport it to 1.10.x but we should definitely check because there isn't one elsewhere that I see.

rdblue · 2026-02-19T00:17:12Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java


  protected boolean addsDeleteFiles() {
-    return !newDeleteFilesBySpec.isEmpty();
+    return !positionAndEqualityDeletes.isEmpty() || !dvsByReferencedFile.isEmpty();


This isn't entirely correct because it assumes that the lists in dvsByReferencedFile are non-empty. It would probably be better to check:

protected boolean addsDeleteFiles() { return !positionAndEqualityDeletes.isEmpty() || dvsByReferencedFile.values().stream().anyMatch(dvs -> !dvs.isEmpty()); }

rdblue · 2026-02-19T00:23:35Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+        if (entry.getValue().size() > 1) {
+          LOG.warn(
+              "Attempted to commit {} duplicate DVs for data file {} in table {}. "
+                  + "Merging duplicates, and original DVs will be orphaned.",


Will be orphaned? I think this is a bit strong of a warning and could lead to incorrect issue reports.

It would be nice to detect whether the puffin file is actually orphaned by checking whether it has other DVs. It's fine to leave a Puffin file in place if it has live DVs from this commit. And it's fine to delete it if we see that all of its DVs will be merged.

We probably don't need to do all that in this commit, though. I'd probably leave this out and just have a warning about merging, without the part about orphan DVs.

rdblue · 2026-02-19T00:24:42Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+   * @param dvsByFile map of data file location to DVs
+   * @return a list containing both any newly merged DVs and any DVs that are already valid
+   */
+  static List<DeleteFile> mergeAndWriteDvsIfRequired(


Typo: Dvs should be DVs.

Looks like this happens quite a bit. We usually don't follow the Java convention of changing acronyms to camel case. For instance, DV util -> DVUtil instead of DvUtil and file IO -> fileIO. (The exception is ID that was added early and does usually change to fieldId)

rdblue · 2026-02-19T00:26:52Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+              ops().current().specsById());
+      // Prevent commiting duplicate V2 deletes by deduping them
+      Map<Integer, List<DeleteFile>> newDeleteFilesBySpec =
+          Streams.stream(Iterables.concat(finalDVs, DeleteFileSet.of(positionAndEqualityDeletes)))


Why does this use a DeleteFileSet? Do we think that there are cases where not using one is a behavior change? I'd prefer failing if there are duplicate deletes.

rdblue · 2026-02-19T00:28:47Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+      // Prevent commiting duplicate V2 deletes by deduping them
+      Map<Integer, List<DeleteFile>> newDeleteFilesBySpec =
+          Streams.stream(Iterables.concat(finalDVs, DeleteFileSet.of(positionAndEqualityDeletes)))
+              .map(file -> Delegates.pendingDeleteFile(file, file.dataSequenceNumber()))


This doesn't seem correct. The delete files are already wrapped when passed into this class. At a minimum, positionAndEqualityDeletes don't need to be re-wrapped. And merged DVs should already be constructed with the right dataSequenceNumber because that's an internal rewrite.

mergedDVs would be constructed with the right data sequence number but this was done because we validate that every DeleteFile that we want to commit is an instance of PendingDeleteFile otherwise it fails. I didn't want to do that in DVUtil because I didn't want to make assumptions for how a caller would use it.

But you're right we don't need this for positionAndEqualityDeletes.

rdblue · 2026-02-19T00:32:16Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

      newDeleteFilesBySpec.forEach(
          (specId, deleteFiles) -> {
            PartitionSpec spec = ops().current().spec(specId);
+            deleteFiles.forEach(file -> addedFilesSummary.addedFile(spec, file));


If the cache is invalidated, this will double-count all delete files, which I think we have to fix. That's a bit ugly. The cleanest way I can think of is to keep a summary builder for data files and then combine the current set of deletes with data files here with a new builder, or possibly just clear the current builder and add everything.

rdblue · 2026-02-19T00:33:03Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+      List<DeleteFile> finalDVs =
+          DVUtil.mergeAndWriteDvsIfRequired(
+              dvsByReferencedFile,
+              ThreadPools.getDeleteWorkerPool(),


I'd put this at the end, since it is an optimization and not an argument that changes the result.

rdblue · 2026-02-19T00:42:01Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+    return result;
+  }
+
+  private static void validateDVCanBeMerged(


For method names, it's shorter if you avoid past tense. For example: validateCanMerge rather than validateCanBeMerged. I also think since this is DVUtil you can get away with not having DV in each method name.

rdblue · 2026-02-19T00:43:17Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+                    Deletes.readDV(duplicateDvs.get(i), io, encryptionManager));
+
+    // Build a grouping of referenced file to indices of the corresponding duplicate DVs
+    Map<String, List<Integer>> dvIndicesByDataFile = Maps.newLinkedHashMap();


Why use linked hashmaps?

rdblue · 2026-02-19T00:44:40Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+        .run(
+            i ->
+                duplicateDvPositions[i] =
+                    Deletes.readDV(duplicateDvs.get(i), io, encryptionManager));


As noted earlier, there should be no need to pass encryptionManager separately.

rdblue · 2026-02-19T00:50:28Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+      for (int i = 1; i < dvIndicesForFile.size(); i++) {
+        int dvIndex = dvIndicesForFile.get(i);
+        DeleteFile dv = duplicateDvs.get(dvIndex);
+        validateDVCanBeMerged(dv, firstDv, partitionComparator);


I think it is a bit late to validate that DVs can be merged because they were all read into memory. I would move this check to the start of the method.

This is a good reason to pass the original map in since it is available to the calling method.

rdblue · 2026-02-19T00:53:47Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+    // Validate and merge per referenced file, caching comparators by spec ID
+    Map<Integer, Comparator<StructLike>> comparatorsBySpecId = Maps.newHashMap();
+    Map<String, PositionDeleteIndex> result = Maps.newHashMap();
+    for (Map.Entry<String, List<Integer>> entry : dvIndicesByDataFile.entrySet()) {


It looks like the reason for the complex logic here is to validate the DVs can be merged. As I noted below I think that should be done before reading. If you restructure that way, then all you need is to loop over the PositionDeleteIndex array and duplicateDVs at the same time and either adding a new entry for the DeleteFile#referencedDataFile or merging them. I think that would really simplify this part.

I think you're right about this, we can really simplify if we just do the validation in that initial loop where we construct the duplicate mapping. Everything we need to validate is at the metadata level. Then we can just work off the assumption later on that everything can be merged, and just loop through the indices and merge.

rdblue · 2026-02-19T00:55:36Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+
+  // Produces a single Puffin file containing the merged DVs
+  private static List<DeleteFile> writeMergedDVs(
+      Map<String, PositionDeleteIndex> mergedIndices,


There aren't indices here.

rdblue · 2026-02-19T00:58:53Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+        String referencedLocation = entry.getKey();
+        PositionDeleteIndex mergedPositions = entry.getValue();
+        List<DeleteFile> duplicateDvs = dvsByFile.get(referencedLocation);
+        DeleteFile firstDV = duplicateDvs.get(0);


I'd combine with the previous line. You don't need the duplicates after this, just the first.

rdblue · 2026-02-19T01:03:13Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+        new BaseDVFileWriter(
+            // Use an unpartitioned spec for the location provider for the puffin containing
+            // all the merged DVs
+            OutputFileFactory.builderFor(


This OutputFileFactory has a fake partition spec and IDs. Maybe we don't need to use it, then. All this really needs to do is generate a file name, add the Puffin extension, and call the location provider. I'd rather just do those things directly and not need to fake OutputFileFactory, which is intended for tasks. This could also use the snapshot ID.

github-actions bot added the core label Jan 9, 2026

github-actions bot added the spark label Jan 9, 2026

amogh-jahagirdar commented Jan 9, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jan 9, 2026

View reviewed changes

singhpk234 reviewed Jan 10, 2026

View reviewed changes

amogh-jahagirdar commented Jan 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

geruh reviewed Jan 11, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

amogh-jahagirdar added 8 commits January 11, 2026 12:50

Core: Merge DVs referencing the same data files as a safeguard

82cced9

Fix dangling delete tests

e41943d

Simplification in OutputFileFactory

76e24e4

minor optimization

a740ff9

cleanup, make outputfilefactory take in more fields so that we don't …

11ffc2f

…require HasTableOPerations

change the duplicate tracking algorithm, fix spark tests

772e3c2

Add more tests for multiple DVs and w equality deletes

3404a86

Rebase and fix spark 4.1 tests

c04d0e0

amogh-jahagirdar force-pushed the always-merge-duplicates-on-driver branch from 374b567 to c04d0e0 Compare January 11, 2026 20:02

amogh-jahagirdar commented Jan 11, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

more cleanup, put dvfilewriter in try w resources

a39b073

geruh reviewed Jan 11, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

amogh-jahagirdar added 2 commits January 11, 2026 19:13

Add logging, some more cleanup

a079d22

more cleanup

d7eadb0

amogh-jahagirdar commented Jan 12, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jan 12, 2026

View reviewed changes

amogh-jahagirdar marked this pull request as ready for review January 12, 2026 17:15

amogh-jahagirdar changed the title ~~Core: Merge DVs referencing the same data files as a safeguard~~ Core: Track duplicate DVs for data file and merge them before committing Jan 12, 2026

amogh-jahagirdar requested review from nastra and rdblue January 12, 2026 17:35

amogh-jahagirdar force-pushed the always-merge-duplicates-on-driver branch 3 times, most recently from bc231a2 to edecb48 Compare February 5, 2026 16:59

more consistent naming

8087558

amogh-jahagirdar force-pushed the always-merge-duplicates-on-driver branch from edecb48 to 8087558 Compare February 5, 2026 17:07

amogh-jahagirdar requested a review from rdblue February 5, 2026 17:09

rdblue reviewed Feb 18, 2026

View reviewed changes

rdblue reviewed Feb 19, 2026

View reviewed changes

Conversation

amogh-jahagirdar commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amogh-jahagirdar Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

amogh-jahagirdar commented Jan 9, 2026 •

edited

Loading

amogh-jahagirdar commented Jan 9, 2026 •

edited

Loading

amogh-jahagirdar Jan 10, 2026 •

edited

Loading

amogh-jahagirdar Jan 12, 2026 •

edited

Loading

rdblue Feb 18, 2026 •

edited

Loading

rdblue Feb 19, 2026 •

edited

Loading