Skip to content

Improve embedding data loading robustness and dataset structure#61

Open
clemsgrs wants to merge 2 commits intomainfrom
fix/embed-dataloader-shards-and-dataset-structure
Open

Improve embedding data loading robustness and dataset structure#61
clemsgrs wants to merge 2 commits intomainfrom
fix/embed-dataloader-shards-and-dataset-structure

Conversation

@clemsgrs
Copy link
Owner

Summary

  • add worker-local cache support to legacy TileDataset
  • add speed.use_parquet toggle (legacy .npy dataset vs parquet catalog dataset)
  • switch temp shard format from HDF5 to .npy memmaps to avoid HDF5 lock/permission failures
  • merge rank shards with indexed fill (order-independent, bounded memory)
  • improve RegionUnfolding repr
  • move TileCatalogDataset class to data/dataset.py; keep tile_catalog.py focused on catalog utilities
  • clean up tmp_feature_shards at end of run
  • force out-of-order DataLoader delivery when supported (in_order=False)
  • fix process_list error/traceback dtype handling and stale error clearing on success

Tests

  • python -m pytest -q -o addopts='' tests/test_regression_bugfixes.py
  • python -m pytest -q -o addopts='' tests/test_tile_catalog.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant