Skip to content

fix: handle missing metadata files gracefully in Iceberg analysis#8

Merged
danielbeach merged 1 commit intodanielbeach:mainfrom
athossampayo:fix/iceberg-missing-metadata-resilience
Oct 20, 2025
Merged

fix: handle missing metadata files gracefully in Iceberg analysis#8
danielbeach merged 1 commit intodanielbeach:mainfrom
athossampayo:fix/iceberg-missing-metadata-resilience

Conversation

@athossampayo
Copy link
Collaborator

@athossampayo athossampayo commented Oct 17, 2025

Fix: Handle Missing Metadata Files Gracefully in Iceberg Analysis

This fixes #7

Problem

When analyzing large, actively updated Iceberg tables, the analysis would fail with a generic RuntimeError: Iceberg analysis failed: service error. This error masked the underlying issue: AWS S3 NoSuchKey errors occurring when trying to read metadata files during various analysis phases.

Error Details

RuntimeError: Iceberg analysis failed: service error
ServiceError(ServiceError { source: NoSuchKey(NoSuchKey { message: Some("The specified key does not exist.") }) })

The error occurred during:

  • Schema evolution analysis
  • Time travel metrics calculation
  • Table constraints analysis
  • File compaction Z-order opportunity detection

Root Cause

This is a race condition common in large, actively updated Iceberg tables:

  1. The analysis starts by listing all objects in the table path
  2. Metadata files (snapshots, manifests) are identified
  3. While the analysis is running, old metadata files are cleaned up by table maintenance operations
  4. When the analysis tries to read these files, they no longer exist
  5. The S3 client returns NoSuchKey error, causing the entire analysis to fail

Solution

Modified the Iceberg analysis to be resilient to missing metadata files:

Changes Made

  1. Graceful Error Handling: Updated analyze_schema_evolution, analyze_time_travel, analyze_table_constraints, and analyze_iceberg_z_order_opportunity methods to handle NoSuchKey errors

  2. Skip Missing Files: Instead of failing the entire analysis, the code now:

    • Catches errors when reading metadata files
    • Skips files that are missing or invalid
    • Continues analyzing available files
  3. User-Friendly Warning: Added informative warning in the health report recommendations:

    ⚠️  Analysis incomplete: Schema Evolution, Time Travel, Table Constraints 
    sections could not be analyzed due to missing/inaccessible metadata files 
    (common in actively updated tables). Basic metrics are still accurate.
    
  4. Partial Reports: The analysis now returns:

    • Complete basic metrics (file counts, sizes, partitions, unreferenced files, etc.)
    • Partial advanced metrics (only sections where metadata was accessible)
    • Clear warnings about incomplete sections

Code Changes

src/iceberg.rs

  • Modified analyze_schema_evolution() to use match statements for error handling
  • Modified analyze_time_travel() to skip missing metadata files
  • Modified analyze_table_constraints() to continue on errors
  • Modified analyze_iceberg_z_order_opportunity() to handle missing files
  • Updated generate_recommendations() to warn about incomplete sections

Testing

Tested On

  • Production Table
  • Table Size: 93,501 GB across 3,344,871 files
  • Metadata Files: 116 snapshot/manifest files

Results

Before Fix: Analysis failed with RuntimeError: Iceberg analysis failed: service error

After Fix: Analysis completes successfully with:

  • All basic metrics calculated correctly
  • Partial advanced metrics where metadata was accessible
  • Clear warning about incomplete sections
  • Full set of actionable recommendations

Test Output (masked confidential info)

============================================================
Table Health Report: s3://xxxxxxx/schema/table/
Type: iceberg
============================================================

🔴 Overall Health Score: 4.0%

📊 Key Metrics:
  Total Files:         3344871
  Total Size:          93501.51 GB
  Average File Size:   28.62 MB
  Partition Count:     19595

📦 File Compaction Analysis:
  Compaction Opportunity: 0.80
  Small Files Count:     2672541
  Compaction Priority:   CRITICAL

💡 Recommendations:
  1. ⚠️  Analysis incomplete: Schema Evolution, Time Travel, Table Constraints 
     sections could not be analyzed due to missing/inaccessible metadata files 
     (common in actively updated tables). Basic metrics are still accurate.
  2. Found 3344871 unreferenced files. Consider running VACUUM to clean up.
  3. High file compaction opportunity detected. Consider running rewrite_data_files.
  ...

Impact

User Benefits

  • ✅ Analyses no longer fail on actively updated tables
  • ✅ Users still get valuable insights even with missing metadata
  • ✅ Clear communication about what was analyzed
  • ✅ Better user experience for large production tables

Backward Compatibility

  • ✅ No breaking changes to API
  • ✅ Existing code continues to work
  • ✅ Additional warnings in reports are informative, not disruptive

Related Issues

This fix addresses a common scenario in production environments where:

  • Tables are continuously updated
  • Maintenance operations run concurrently
  • Analysis jobs overlap with cleanup operations
  • Large tables have many metadata files being rotated

Checklist

  • Code changes tested on production-scale table
  • All Rust tests pass (cargo test)
  • All Python tests pass (pytest)
  • No debug logs or temporary code included
  • Error handling is production-ready
  • User-facing messages are clear and helpful
  • No breaking changes to public API

Add resilient error handling for NoSuchKey errors when reading metadata files during:
- Schema evolution analysis
- Time travel metrics
- Table constraints analysis
- File compaction Z-order opportunity detection

This is common in large, actively updated Iceberg tables where old metadata files
are cleaned up while the table is still being queried. The analysis now continues
when metadata files are missing, and adds a warning to the health report about
incomplete sections while still providing all basic metrics.

Fixes race condition where metadata files listed at the start of analysis are
deleted/moved before being read.
Copy link
Owner

@danielbeach danielbeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your contribution.

@athossampayo
Copy link
Collaborator Author

athossampayo commented Oct 20, 2025

Would you mind adding a “hacktoberfest-accepted” label or a "hacktoberfest" topic to the repository before merge? Thanks :)

@danielbeach
Copy link
Owner

Would you mind adding a “hacktoberfest-accepted” label or a "hacktoberfest" topic to the repository before merge? Thanks :)

Done

@danielbeach danielbeach merged commit a688ae4 into danielbeach:main Oct 20, 2025
1 of 6 checks passed
@athossampayo athossampayo deleted the fix/iceberg-missing-metadata-resilience branch October 20, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg analysis fails with vague "service error" on large/active tables due to missing metadata files

2 participants