Add an optional cache for ParquetFragmentStreamer by caciolai · Pull Request #1492 · facebookresearch/fairseq2

caciolai · 2026-02-09T17:45:54Z

What does this PR do? Please describe:

When instantiating N data pipelines across different partitions of the same Parquet Dataset, currently ParquetFragmentStreamer will initialize N separate Dataset instances, with much duplicated work in filesystem scanning metadata parsing etc

Does your PR introduce any breaking changes? If yes, please list them:

The cache mechanism is controlled by a flag set to False by default
Previously existing code path(s) are not affected

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…treamer

artemru · 2026-02-09T19:07:38Z

src/fairseq2/data/parquet/fragment_streaming/primitives.py

+        )
+
+
+@lru_cache(maxsize=1)  # TODO: decide on reasonable upper bound


i would pass max_cache_size as a param here instead of use_cache

artemru · 2026-02-09T19:07:53Z

src/fairseq2/data/parquet/fragment_streaming/primitives.py

+        )
+
+
+@lru_cache(maxsize=1)  # TODO: decide on reasonable upper bound


i would pass max_cache_size as a param here instead of use_cache

yes but how can you set dynamically (tied to an object instance) a singleton component that you want to exist across instances?

maybe as some global context...

what if different ParquetFragmentStreamer instances ask for different cache sizes? or were you thinking of something like AssetStore i.e. a singleton handled by fairseq2 and controlled via the recipe? that would be a deeper change though I think, as I would need to touch recipe composition

caciolai · 2026-02-11T15:22:41Z

thinking that we probably want to also cache

stream_parquet_fragments
list_parquet_fragments
as they also do some heavy ops, even if re-using the same dataset object...

caciolai added 4 commits February 9, 2026 17:41

add possibility of caching datasets across different parquetfragments…

d3bf560

…treamer

simplify

1b847b0

simplify

a369f3c

trim down changes

f256b70

caciolai marked this pull request as ready for review February 9, 2026 18:13

caciolai requested review from MartinGleize, artemru, cbalioglu, cirquit and zyaoj as code owners February 9, 2026 18:13

artemru reviewed Feb 9, 2026

View reviewed changes

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an optional cache for ParquetFragmentStreamer#1492

Add an optional cache for ParquetFragmentStreamer#1492
caciolai wants to merge 4 commits intomainfrom
caciolai/parquet_dataset_caching

caciolai commented Feb 9, 2026 •

edited

Loading

Uh oh!

artemru Feb 9, 2026

Uh oh!

artemru Feb 9, 2026

Uh oh!

caciolai Feb 9, 2026

Uh oh!

artemru Feb 11, 2026

Uh oh!

caciolai Feb 11, 2026

Uh oh!

caciolai commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		@lru_cache(maxsize=1) # TODO: decide on reasonable upper bound

Conversation

caciolai commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artemru Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

artemru Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

caciolai Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

artemru Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

caciolai Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

caciolai commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caciolai commented Feb 9, 2026 •

edited

Loading