Improve embedding data loading robustness and dataset structure by clemsgrs · Pull Request #61 · clemsgrs/slide2vec

clemsgrs · 2026-02-26T01:59:07Z

Summary

add worker-local cache support to legacy TileDataset
add speed.use_parquet toggle (legacy .npy dataset vs parquet catalog dataset)
switch temp shard format from HDF5 to .npy memmaps to avoid HDF5 lock/permission failures
merge rank shards with indexed fill (order-independent, bounded memory)
improve RegionUnfolding repr
move TileCatalogDataset class to data/dataset.py; keep tile_catalog.py focused on catalog utilities
clean up tmp_feature_shards at end of run
force out-of-order DataLoader delivery when supported (in_order=False)
fix process_list error/traceback dtype handling and stale error clearing on success

clemsgrs added 2 commits February 26, 2026 01:43

Improve embedding data loading robustness and dataset structure

6242203

add pyarrow to requirements

4b863f0