Dask DataLoader Speed (2.0 feature)

Background, dataloader slows down over time, especially when using a large number of slides; data that is persistent in memory loads quickly (case for very small number sslides), but not when training from large number of slides; issues with having .compute() within __getitem__(), yet needing to take into account data augmentations (albumentations) for the mask of the image for semantic segmentation task when loading data, which can make the dataloading operation if more daskified a bit more complex:

- https://discuss.pytorch.org/t/deadlock-with-dataloader-and-xarray-dask/9387
- https://github.com/dask/distributed/issues/2581
- https://examples.dask.org/machine-learning/torch-prediction.html
- https://github.com/horovod/horovod

Issue is with the getitem, when the data is loaded, it passes quickly through the DL model.

Potentially nice ideas:

- Daskifying the collate function when collecting data
- Chunk size of zarr/dask arrays
- https://github.com/muammar/ml4chem/blob/5bc7808dc0c3ecd650bc52ebc14c2c6fa4e93ef9/ml4chem/atomistic/models/autoencoders.py#L1062
- https://github.com/muammar/ml4chem/blob/d2dec155f53aedada4b106f2173cf315a8b95b2b/ml4chem/atomistic/models/neuralnetwork.py#L662
- Will add more here to this issue.

@lvaickus , can you comment more here?

@sumanthratna 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dask DataLoader Speed (2.0 feature) #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dask DataLoader Speed (2.0 feature) #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions