Skip to content

Parquet metadata check limit optimization #11782

@jinchengchenghh

Description

@jinchengchenghh

Description

Now the validation metadata config(spark.gluten.sql.fallbackUnexpectedMetadataParquet) is default false, if set to true, for each root path, we check the file limit (spark.gluten.sql.fallbackUnexpectedMetadataParquet.limit), if the number of partitions are too much, the validation will be expensive.

The possible solution is to sample the rootPaths to select some files.

The sample file limit should be decided by file total limit number and the total file number in root paths, the latter should be decided by the percentage.

Gluten version

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions