Skip to content

Add an optional cache for ParquetFragmentStreamer#1492

Open
caciolai wants to merge 4 commits intomainfrom
caciolai/parquet_dataset_caching
Open

Add an optional cache for ParquetFragmentStreamer#1492
caciolai wants to merge 4 commits intomainfrom
caciolai/parquet_dataset_caching

Conversation

@caciolai
Copy link
Contributor

@caciolai caciolai commented Feb 9, 2026

What does this PR do? Please describe:

  • When instantiating N data pipelines across different partitions of the same Parquet Dataset, currently ParquetFragmentStreamer will initialize N separate Dataset instances, with much duplicated work in filesystem scanning metadata parsing etc

Does your PR introduce any breaking changes? If yes, please list them:

  • The cache mechanism is controlled by a flag set to False by default
  • Previously existing code path(s) are not affected

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@caciolai caciolai marked this pull request as ready for review February 9, 2026 18:13
)


@lru_cache(maxsize=1) # TODO: decide on reasonable upper bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would pass max_cache_size as a param here instead of use_cache

)


@lru_cache(maxsize=1) # TODO: decide on reasonable upper bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would pass max_cache_size as a param here instead of use_cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but how can you set dynamically (tied to an object instance) a singleton component that you want to exist across instances?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe as some global context...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if different ParquetFragmentStreamer instances ask for different cache sizes? or were you thinking of something like AssetStore i.e. a singleton handled by fairseq2 and controlled via the recipe? that would be a deeper change though I think, as I would need to touch recipe composition

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026
@caciolai
Copy link
Contributor Author

thinking that we probably want to also cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants